(3 comments) (2 reactions) (2 assignees)Java (1,432 forks)batch import

designfeaturehelp wanted

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I had searched in the issues and found no similar issues.

Describe the proposal

[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel

Background

This umbrella issue summarizes and tracks the work proposed by the seatunnel-knowledge-sync-execution-plan.

The plan positions SeaTunnel as a Knowledge Indexing and Sync Layer for RAG scenarios. The goal is not to turn SeaTunnel into a RAG application framework, query engine, retriever, reranker, chatbot, or agent runtime. Instead, SeaTunnel should focus on distributed, reliable, and verifiable knowledge data synchronization.

In enterprise RAG scenarios, source documents need to be continuously discovered, parsed, chunked, embedded, and synchronized into vector stores such as Qdrant and Milvus. A simple append or upsert pipeline is not enough, because production knowledge sync must also handle refresh, delete, unchanged-skip, stale chunk cleanup, retry, failover, and parallel writer consistency.

Positioning

SeaTunnel should be positioned as:

SeaTunnel as Knowledge Indexing and Sync Layer

SeaTunnel is responsible for:

discovering new or changed documents
parsing documents into structured content
generating stable document and chunk metadata
splitting documents into embedding-ready chunks
generating embeddings
writing vectors and payloads into vector stores
supporting document-level refresh, delete, and skip-unchanged behavior
keeping vector indexes consistent under distributed execution

SeaTunnel is not responsible for:

query engine
retriever
reranker
chatbot application
agent orchestration
workflow deployment
LlamaCloud or LlamaParse-like hosted platform capability

Main Use Cases

Enterprise knowledge base indexing

Synchronize enterprise documents into vector databases for RAG retrieval.

Typical sources include:

local Markdown, TXT, or PDF files
HDFS, S3, OSS, or COS object storage
product manuals
FAQ documents
operation and troubleshooting documents
internal engineering documents
future dedicated sources such as Confluence, Google Drive, and SharePoint

Incremental knowledge synchronization

Knowledge documents change over time. SeaTunnel should support:

new document: insert new chunks
changed document: refresh only chunks belonging to that document
deleted document: delete all chunks belonging to that document
unchanged document: skip parse, embedding, and upsert

This is important because embedding is expensive and stale chunks can lead to incorrect RAG retrieval results.

Vector index lifecycle management

A document usually maps to multiple chunks and multiple vector points. When the document changes, the number of chunks may increase or decrease.

Therefore, a production sync pipeline needs lifecycle semantics:

read existing chunks for the document
compare the stored hash with the new ChunkHash
delete stale chunks that no longer exist
upsert changed chunks
skip unchanged chunks

A simple vector upsert cannot guarantee this behavior.

Future multimodal knowledge sync

The plan also leaves room for multimodal document assets, including:

PDF page images
screenshots
attachments
audio segments
video segments

These should be represented as document assets and linked back to the parent document.

Why This Is Needed

SeaTunnel already has several useful building blocks:

distributed Source / Transform / Sink execution model
file source
Markdown reader
SeaTunnelRow.options
CatalogTable.metadataSchema
Embedding transform
Qdrant sink
Milvus sink
binary and complete-file read support

However, these are still separate capabilities. They do not yet form a production-grade knowledge sync system.

The missing key capabilities are:

unified Document / Chunk / Asset semantics
stable DocumentHash / ChunkId / ChunkHash contracts
document-level lifecycle write semantics
document_id based partition and routing
persistent parse and embedding cache semantics
production verification under parallel execution

Without these, a pipeline may be able to write vectors, but it cannot safely maintain a correct vector index over time.

Core Design Direction

Zeta-first delivery

The first phase should be Zeta-first.

Flink and Spark support can be added later, but a feature that is only implemented in Zeta must not be documented as tri-engine ready.

Expected behavior:

Zeta should support Knowledge Sync routing first.
Flink and Spark should fail explicitly when a non-empty sink partition strategy is required but not supported.
Silent fallback is not acceptable.

Sink-declared partition and routing

Knowledge lifecycle sinks need to declare that they require records to be partitioned by specific fields, especially document_id.

The expected API direction is:

introduce a SinkPartitionStrategy
support modes such as NONE, HASH_BY_FIELDS, and ROUND_ROBIN
let SeaTunnelSink expose a default empty strategy
let lifecycle sinks declare HASH_BY_FIELDS with document_id

A lifecycle sink can declare user-facing options such as:

knowledge_sync.enabled = true
knowledge_sync.partition_keys = ["document_id"]

The engine should translate this declaration into native shuffle, keyBy, or repartition behavior.

Document and chunk semantic model

The first phase should define stable metadata keys:

DocumentId
DocumentHash
SourceUri
SourceVersion
SourceModifiedAt
MimeType
Deleted
ChunkId
ChunkHash
ChunkIndex
ParentDocumentId
AssetId
AssetHash
AssetUri
AssetRole
AssetModality

Important rules:

DocumentHash priority 1: source-native stable version
DocumentHash priority 2: raw bytes SHA-256 before parsing
DocumentHash priority 3: normalized text SHA-256
ChunkId: sha256(document_id + "|" + chunk_index + "|" + chunker_config_hash)
ChunkHash: sha256(chunk_text)

ChunkId should represent stable chunk identity.

ChunkHash should represent chunk content changes.

They must not be mixed together.

Target Pipeline

The target pipeline is:

Knowledge Source, File Source, or Object Storage Source
Document Identity Resolver
Document Parse Transform
Document Asset Expand Transform
Document Chunk Transform
Metadata Projection or Extraction
Embedding Transform
Lifecycle Sink
Engine Translation Layer
Vector Store

Current Status

SeaTunnel already has several related capabilities.

Already available

distributed Source / Transform / Sink execution model
SeaTunnelRow.options
CatalogTable.metadataSchema
metadata propagation in transform base classes
SeaTunnelFlatMapTransform
Zeta runtime support for flatMap transform execution
file source and Markdown reader
binary read strategy and complete-file mode
Embedding transform
Qdrant basic vector sink
Milvus basic vector sink
Milvus support for reading row.options[Partition]
optional Markdown RAG metadata support:
- source_uri
- document_id
- chunk_id
- chunk_index
- content_hash

Partially available

Markdown RAG metadata

Current behavior:

document_id = "doc_" + sha256(source_uri)
content_hash = sha256(text)
chunk_id = "chunk_" + sha256(document_id + ":" + chunk_index + ":" + content_hash)

Gap:

no DocumentHash
no dedicated ChunkHash
chunk_id currently includes content hash, so chunk identity changes when content changes
this conflicts with the desired separation between stable ChunkId and content-based ChunkHash

Embedding reliability

There is ongoing work around model invocation reliability, including:

retry options
timeout options
response count validation
safe logging
metrics hooks
cache boundaries

Gap:

default retry may still be disabled unless configured
cache is not yet a persistent source-of-truth backend
metrics hooks still need production integration
failover cannot rebuild cache view from backend yet

Qdrant and Milvus sinks

Qdrant and Milvus can write vectors today, but they do not yet provide document-level lifecycle semantics.

Gap:

no knowledge_sync.enabled
no knowledge_sync.partition_keys
no read-old-chunks behavior
no hash comparison
no stale chunk deletion
no skip-unchanged behavior
no tombstone delete semantics

Proposed PR Train

PR-A: Gate 0 / ADR package
PR-B0: SinkPartitionStrategy API and Zeta translation
PR-B: Metadata contract
PR-C: Document identity and parse
PR-D: Document chunk
PR-E: Qdrant lifecycle sink V1
PR-F: Transformation cache
PR-G: Milvus lifecycle and vector base
PR-H: Document asset expand
PR-I: Dedicated knowledge sources

Are you willing to submit PR?

Yes I am willing to submit a PR!

Contributor guide

Research direction: Implement the knowledge sync pipeline as described in the umbrella issue: define Document/Chunk/Asset semantics, DocumentHash/ChunkId/ChunkHash contracts, SinkPartitionStrategy API, document level lifecycle write semantics for Qdrant and Milvus sinks, and ensure Zeta first delivery. Focus on PR A through PR I in the proposed train.
Tech stack: java
Domain: backenddatadatabaseai
Issue type: Feature
Difficulty: 5
Estimated time: Over 1 week
Activity status: Active
Clarity: Clear
Prerequisites: JavaDistributed SystemsVector Databases
Newbie friendliness: 20

Repository metrics

Description

Code of Conduct

Search before asking

Describe the proposal

[Umbrella][RAG] Define and implement production-grade Knowledge Sync for SeaTunnel

Background

Positioning

Main Use Cases

Enterprise knowledge base indexing

Incremental knowledge synchronization

Vector index lifecycle management

Future multimodal knowledge sync

Why This Is Needed

Core Design Direction

Zeta-first delivery

Sink-declared partition and routing

Document and chunk semantic model

Target Pipeline

Current Status

Already available

Partially available

Markdown RAG metadata

Embedding reliability

Qdrant and Milvus sinks

Proposed PR Train

Are you willing to submit PR?

Contributor guide

Get fresh easy issues in your inbox.