[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel
#10,889 opened on May 15, 2026
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I had searched in the issues and found no similar issues.
Describe the proposal
[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel
Background
This umbrella issue summarizes and tracks the work proposed by the seatunnel-knowledge-sync-execution-plan.
The plan positions SeaTunnel as a Knowledge Indexing and Sync Layer for RAG scenarios. The goal is not to turn SeaTunnel into a RAG application framework, query engine, retriever, reranker, chatbot, or agent runtime. Instead, SeaTunnel should focus on distributed, reliable, and verifiable knowledge data synchronization.
In enterprise RAG scenarios, source documents need to be continuously discovered, parsed, chunked, embedded, and synchronized into vector stores such as Qdrant and Milvus. A simple append or upsert pipeline is not enough, because production knowledge sync must also handle refresh, delete, unchanged-skip, stale chunk cleanup, retry, failover, and parallel writer consistency.
Positioning
SeaTunnel should be positioned as:
SeaTunnel as Knowledge Indexing and Sync Layer
SeaTunnel is responsible for:
- discovering new or changed documents
- parsing documents into structured content
- generating stable document and chunk metadata
- splitting documents into embedding-ready chunks
- generating embeddings
- writing vectors and payloads into vector stores
- supporting document-level refresh, delete, and skip-unchanged behavior
- keeping vector indexes consistent under distributed execution
SeaTunnel is not responsible for:
- query engine
- retriever
- reranker
- chatbot application
- agent orchestration
- workflow deployment
- LlamaCloud or LlamaParse-like hosted platform capability
Main Use Cases
Enterprise knowledge base indexing
Synchronize enterprise documents into vector databases for RAG retrieval.
Typical sources include:
- local Markdown, TXT, or PDF files
- HDFS, S3, OSS, or COS object storage
- product manuals
- FAQ documents
- operation and troubleshooting documents
- internal engineering documents
- future dedicated sources such as Confluence, Google Drive, and SharePoint
Incremental knowledge synchronization
Knowledge documents change over time. SeaTunnel should support:
- new document: insert new chunks
- changed document: refresh only chunks belonging to that document
- deleted document: delete all chunks belonging to that document
- unchanged document: skip parse, embedding, and upsert
This is important because embedding is expensive and stale chunks can lead to incorrect RAG retrieval results.
Vector index lifecycle management
A document usually maps to multiple chunks and multiple vector points. When the document changes, the number of chunks may increase or decrease.
Therefore, a production sync pipeline needs lifecycle semantics:
- read existing chunks for the document
- compare the stored hash with the new
ChunkHash - delete stale chunks that no longer exist
- upsert changed chunks
- skip unchanged chunks
A simple vector upsert cannot guarantee this behavior.
Future multimodal knowledge sync
The plan also leaves room for multimodal document assets, including:
- PDF page images
- screenshots
- attachments
- audio segments
- video segments
These should be represented as document assets and linked back to the parent document.
Why This Is Needed
SeaTunnel already has several useful building blocks:
- distributed
Source / Transform / Sinkexecution model - file source
- Markdown reader
SeaTunnelRow.optionsCatalogTable.metadataSchema- Embedding transform
- Qdrant sink
- Milvus sink
- binary and complete-file read support
However, these are still separate capabilities. They do not yet form a production-grade knowledge sync system.
The missing key capabilities are:
- unified
Document / Chunk / Assetsemantics - stable
DocumentHash / ChunkId / ChunkHashcontracts - document-level lifecycle write semantics
document_idbased partition and routing- persistent parse and embedding cache semantics
- production verification under parallel execution
Without these, a pipeline may be able to write vectors, but it cannot safely maintain a correct vector index over time.
Core Design Direction
Zeta-first delivery
The first phase should be Zeta-first.
Flink and Spark support can be added later, but a feature that is only implemented in Zeta must not be documented as tri-engine ready.
Expected behavior:
- Zeta should support Knowledge Sync routing first.
- Flink and Spark should fail explicitly when a non-empty sink partition strategy is required but not supported.
- Silent fallback is not acceptable.
Sink-declared partition and routing
Knowledge lifecycle sinks need to declare that they require records to be partitioned by specific fields, especially document_id.
The expected API direction is:
- introduce a
SinkPartitionStrategy - support modes such as
NONE,HASH_BY_FIELDS, andROUND_ROBIN - let
SeaTunnelSinkexpose a default empty strategy - let lifecycle sinks declare
HASH_BY_FIELDSwithdocument_id
A lifecycle sink can declare user-facing options such as:
knowledge_sync.enabled = trueknowledge_sync.partition_keys = ["document_id"]
The engine should translate this declaration into native shuffle, keyBy, or repartition behavior.
Document and chunk semantic model
The first phase should define stable metadata keys:
DocumentIdDocumentHashSourceUriSourceVersionSourceModifiedAtMimeTypeDeletedChunkIdChunkHashChunkIndexParentDocumentIdAssetIdAssetHashAssetUriAssetRoleAssetModality
Important rules:
DocumentHashpriority 1: source-native stable versionDocumentHashpriority 2: raw bytes SHA-256 before parsingDocumentHashpriority 3: normalized text SHA-256ChunkId:sha256(document_id + "|" + chunk_index + "|" + chunker_config_hash)ChunkHash:sha256(chunk_text)
ChunkId should represent stable chunk identity.
ChunkHash should represent chunk content changes.
They must not be mixed together.
Target Pipeline
The target pipeline is:
- Knowledge Source, File Source, or Object Storage Source
- Document Identity Resolver
- Document Parse Transform
- Document Asset Expand Transform
- Document Chunk Transform
- Metadata Projection or Extraction
- Embedding Transform
- Lifecycle Sink
- Engine Translation Layer
- Vector Store
Current Status
SeaTunnel already has several related capabilities.
Already available
- distributed
Source / Transform / Sinkexecution model SeaTunnelRow.optionsCatalogTable.metadataSchema- metadata propagation in transform base classes
SeaTunnelFlatMapTransform- Zeta runtime support for flatMap transform execution
- file source and Markdown reader
- binary read strategy and complete-file mode
- Embedding transform
- Qdrant basic vector sink
- Milvus basic vector sink
- Milvus support for reading
row.options[Partition] - optional Markdown RAG metadata support:
source_uridocument_idchunk_idchunk_indexcontent_hash
Partially available
Markdown RAG metadata
Current behavior:
document_id = "doc_" + sha256(source_uri)content_hash = sha256(text)chunk_id = "chunk_" + sha256(document_id + ":" + chunk_index + ":" + content_hash)
Gap:
- no
DocumentHash - no dedicated
ChunkHash chunk_idcurrently includes content hash, so chunk identity changes when content changes- this conflicts with the desired separation between stable
ChunkIdand content-basedChunkHash
Embedding reliability
There is ongoing work around model invocation reliability, including:
- retry options
- timeout options
- response count validation
- safe logging
- metrics hooks
- cache boundaries
Gap:
- default retry may still be disabled unless configured
- cache is not yet a persistent source-of-truth backend
- metrics hooks still need production integration
- failover cannot rebuild cache view from backend yet
Qdrant and Milvus sinks
Qdrant and Milvus can write vectors today, but they do not yet provide document-level lifecycle semantics.
Gap:
- no
knowledge_sync.enabled - no
knowledge_sync.partition_keys - no read-old-chunks behavior
- no hash comparison
- no stale chunk deletion
- no skip-unchanged behavior
- no tombstone delete semantics
Proposed PR Train
- PR-A: Gate 0 / ADR package
- PR-B0:
SinkPartitionStrategyAPI and Zeta translation - PR-B: Metadata contract
- PR-C: Document identity and parse
- PR-D: Document chunk
- PR-E: Qdrant lifecycle sink V1
- PR-F: Transformation cache
- PR-G: Milvus lifecycle and vector base
- PR-H: Document asset expand
- PR-I: Dedicated knowledge sources
Are you willing to submit PR?
- Yes I am willing to submit a PR!