apache/seatunnel

[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel

Open

#10,889 opened on May 15, 2026

View on GitHub
 (2 comments) (2 reactions) (2 assignees)Java (6,897 stars) (1,432 forks)batch import
featurehelp wanted

Description

Code of Conduct

Search before asking

  • I had searched in the issues and found no similar issues.

Describe the proposal

[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel

Background

This umbrella issue summarizes and tracks the work proposed by the seatunnel-knowledge-sync-execution-plan.

The plan positions SeaTunnel as a Knowledge Indexing and Sync Layer for RAG scenarios. The goal is not to turn SeaTunnel into a RAG application framework, query engine, retriever, reranker, chatbot, or agent runtime. Instead, SeaTunnel should focus on distributed, reliable, and verifiable knowledge data synchronization.

In enterprise RAG scenarios, source documents need to be continuously discovered, parsed, chunked, embedded, and synchronized into vector stores such as Qdrant and Milvus. A simple append or upsert pipeline is not enough, because production knowledge sync must also handle refresh, delete, unchanged-skip, stale chunk cleanup, retry, failover, and parallel writer consistency.

Positioning

SeaTunnel should be positioned as:

SeaTunnel as Knowledge Indexing and Sync Layer

SeaTunnel is responsible for:

  • discovering new or changed documents
  • parsing documents into structured content
  • generating stable document and chunk metadata
  • splitting documents into embedding-ready chunks
  • generating embeddings
  • writing vectors and payloads into vector stores
  • supporting document-level refresh, delete, and skip-unchanged behavior
  • keeping vector indexes consistent under distributed execution

SeaTunnel is not responsible for:

  • query engine
  • retriever
  • reranker
  • chatbot application
  • agent orchestration
  • workflow deployment
  • LlamaCloud or LlamaParse-like hosted platform capability

Main Use Cases

Enterprise knowledge base indexing

Synchronize enterprise documents into vector databases for RAG retrieval.

Typical sources include:

  • local Markdown, TXT, or PDF files
  • HDFS, S3, OSS, or COS object storage
  • product manuals
  • FAQ documents
  • operation and troubleshooting documents
  • internal engineering documents
  • future dedicated sources such as Confluence, Google Drive, and SharePoint

Incremental knowledge synchronization

Knowledge documents change over time. SeaTunnel should support:

  • new document: insert new chunks
  • changed document: refresh only chunks belonging to that document
  • deleted document: delete all chunks belonging to that document
  • unchanged document: skip parse, embedding, and upsert

This is important because embedding is expensive and stale chunks can lead to incorrect RAG retrieval results.

Vector index lifecycle management

A document usually maps to multiple chunks and multiple vector points. When the document changes, the number of chunks may increase or decrease.

Therefore, a production sync pipeline needs lifecycle semantics:

  • read existing chunks for the document
  • compare the stored hash with the new ChunkHash
  • delete stale chunks that no longer exist
  • upsert changed chunks
  • skip unchanged chunks

A simple vector upsert cannot guarantee this behavior.

Future multimodal knowledge sync

The plan also leaves room for multimodal document assets, including:

  • PDF page images
  • screenshots
  • attachments
  • audio segments
  • video segments

These should be represented as document assets and linked back to the parent document.

Why This Is Needed

SeaTunnel already has several useful building blocks:

  • distributed Source / Transform / Sink execution model
  • file source
  • Markdown reader
  • SeaTunnelRow.options
  • CatalogTable.metadataSchema
  • Embedding transform
  • Qdrant sink
  • Milvus sink
  • binary and complete-file read support

However, these are still separate capabilities. They do not yet form a production-grade knowledge sync system.

The missing key capabilities are:

  1. unified Document / Chunk / Asset semantics
  2. stable DocumentHash / ChunkId / ChunkHash contracts
  3. document-level lifecycle write semantics
  4. document_id based partition and routing
  5. persistent parse and embedding cache semantics
  6. production verification under parallel execution

Without these, a pipeline may be able to write vectors, but it cannot safely maintain a correct vector index over time.

Core Design Direction

Zeta-first delivery

The first phase should be Zeta-first.

Flink and Spark support can be added later, but a feature that is only implemented in Zeta must not be documented as tri-engine ready.

Expected behavior:

  • Zeta should support Knowledge Sync routing first.
  • Flink and Spark should fail explicitly when a non-empty sink partition strategy is required but not supported.
  • Silent fallback is not acceptable.

Sink-declared partition and routing

Knowledge lifecycle sinks need to declare that they require records to be partitioned by specific fields, especially document_id.

The expected API direction is:

  • introduce a SinkPartitionStrategy
  • support modes such as NONE, HASH_BY_FIELDS, and ROUND_ROBIN
  • let SeaTunnelSink expose a default empty strategy
  • let lifecycle sinks declare HASH_BY_FIELDS with document_id

A lifecycle sink can declare user-facing options such as:

  • knowledge_sync.enabled = true
  • knowledge_sync.partition_keys = ["document_id"]

The engine should translate this declaration into native shuffle, keyBy, or repartition behavior.

Document and chunk semantic model

The first phase should define stable metadata keys:

  • DocumentId
  • DocumentHash
  • SourceUri
  • SourceVersion
  • SourceModifiedAt
  • MimeType
  • Deleted
  • ChunkId
  • ChunkHash
  • ChunkIndex
  • ParentDocumentId
  • AssetId
  • AssetHash
  • AssetUri
  • AssetRole
  • AssetModality

Important rules:

  • DocumentHash priority 1: source-native stable version
  • DocumentHash priority 2: raw bytes SHA-256 before parsing
  • DocumentHash priority 3: normalized text SHA-256
  • ChunkId: sha256(document_id + "|" + chunk_index + "|" + chunker_config_hash)
  • ChunkHash: sha256(chunk_text)

ChunkId should represent stable chunk identity.

ChunkHash should represent chunk content changes.

They must not be mixed together.

Target Pipeline

The target pipeline is:

  1. Knowledge Source, File Source, or Object Storage Source
  2. Document Identity Resolver
  3. Document Parse Transform
  4. Document Asset Expand Transform
  5. Document Chunk Transform
  6. Metadata Projection or Extraction
  7. Embedding Transform
  8. Lifecycle Sink
  9. Engine Translation Layer
  10. Vector Store

Current Status

SeaTunnel already has several related capabilities.

Already available

  • distributed Source / Transform / Sink execution model
  • SeaTunnelRow.options
  • CatalogTable.metadataSchema
  • metadata propagation in transform base classes
  • SeaTunnelFlatMapTransform
  • Zeta runtime support for flatMap transform execution
  • file source and Markdown reader
  • binary read strategy and complete-file mode
  • Embedding transform
  • Qdrant basic vector sink
  • Milvus basic vector sink
  • Milvus support for reading row.options[Partition]
  • optional Markdown RAG metadata support:
    • source_uri
    • document_id
    • chunk_id
    • chunk_index
    • content_hash

Partially available

Markdown RAG metadata

Current behavior:

  • document_id = "doc_" + sha256(source_uri)
  • content_hash = sha256(text)
  • chunk_id = "chunk_" + sha256(document_id + ":" + chunk_index + ":" + content_hash)

Gap:

  • no DocumentHash
  • no dedicated ChunkHash
  • chunk_id currently includes content hash, so chunk identity changes when content changes
  • this conflicts with the desired separation between stable ChunkId and content-based ChunkHash

Embedding reliability

There is ongoing work around model invocation reliability, including:

  • retry options
  • timeout options
  • response count validation
  • safe logging
  • metrics hooks
  • cache boundaries

Gap:

  • default retry may still be disabled unless configured
  • cache is not yet a persistent source-of-truth backend
  • metrics hooks still need production integration
  • failover cannot rebuild cache view from backend yet

Qdrant and Milvus sinks

Qdrant and Milvus can write vectors today, but they do not yet provide document-level lifecycle semantics.

Gap:

  • no knowledge_sync.enabled
  • no knowledge_sync.partition_keys
  • no read-old-chunks behavior
  • no hash comparison
  • no stale chunk deletion
  • no skip-unchanged behavior
  • no tombstone delete semantics

Proposed PR Train

  • PR-A: Gate 0 / ADR package
  • PR-B0: SinkPartitionStrategy API and Zeta translation
  • PR-B: Metadata contract
  • PR-C: Document identity and parse
  • PR-D: Document chunk
  • PR-E: Qdrant lifecycle sink V1
  • PR-F: Transformation cache
  • PR-G: Milvus lifecycle and vector base
  • PR-H: Document asset expand
  • PR-I: Dedicated knowledge sources

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contributor guide