[Umbrella][Docs] P0 Documentation Optimization — Tracking · apache/seatunnel#10979

(1 commento) (1 reazione) (1 assegnatario)Java (6897 star) (1432 fork)batch import

help wanted

Descrizione

We need community help to improve several P0-level Apache SeaTunnel documents.

Based on recent documentation Q&A feedback, users are repeatedly asking about production operations rather than basic usage. The most frequent gaps are around long-running CDC jobs, Zeta state storage, checkpoint/savepoint behavior, REST API job lifecycle, Docker/Kubernetes/EKS job submission, and the capability boundary of multi-table transform/join.

This issue tracks all P0 documentation improvements in one place. No sub-issues are used. Please claim work directly in this issue before starting implementation.

Documentation items open for claim

Priority	Area	Document Item	Suggested Location	Contributor	Status
P0	Zeta Engine	Zeta State Storage / IMap / WAL / Checkpoint / Savepoint Guide	`docs/en/engines/zeta/checkpoint-storage.md`, `docs/zh/engines/zeta/checkpoint-storage.md`, or new `state-storage-and-recovery.md`	@dybyte	Doing
P0	CDC	CDC Production Cookbook: Full + Incremental Synchronization	New `docs/en/connectors/cdc-production-cookbook.md`, `docs/zh/connectors/cdc-production-cookbook.md`; link from CDC connector pages		Todo
P0	REST API	REST API Job Lifecycle Cookbook	Extend `docs/en/engines/zeta/rest-api-v2.md`, `docs/zh/engines/zeta/rest-api-v2.md`, or new `rest-api-job-lifecycle.md`		Todo
P0	Deployment	Docker / Kubernetes / EKS Job Submission Guide	Extend Docker/Kubernetes/Helm docs or add `submit-job-to-remote-zeta-cluster.md`		Todo
P0	Transform	Multi-table Transform / Join / EtLT Capability Boundary	Extend transform docs or add `multi-table-transform-and-join-boundary.md`		Todo

Task list

Claim a documentation item before starting implementation.
Update the Contributor and Status columns after a documentation item is claimed.
Add the PR link after work starts.
Update both docs/en/... and docs/zh/... where possible.
Update docs/sidebars.js if new pages are added.
Add cross-links from existing related documents to the new or updated guides.

P0 documentation checklist

Problem details

P0-1. Zeta State Storage / IMap / WAL / Checkpoint / Savepoint Guide

Users often cannot distinguish between checkpoint storage, savepoint, IMap/MapStore, WAL, running job state, and job metrics storage. This causes confusion in production when long-running jobs generate large storage directories in HDFS/S3/OSS/MinIO/local paths.

Typical questions include:

What is stored in checkpoint storage?
What is IMap/MapStore used for in Zeta?
Why are WAL or state-related directories growing?
What is the difference between checkpoint and savepoint?
Which directories can be safely cleaned and which must not be deleted?
Does history-job-expire-minutes clean checkpoint/state files or only expired job metadata?
How should long-running CDC jobs plan checkpoint/state storage capacity?

Suggested sidebar location:

Engines
  -> SeaTunnel Engine (Zeta)
     -> checkpoint-storage
     -> state-storage-and-recovery

P0-2. CDC Production Cookbook: Full + Incremental Synchronization

CDC users frequently ask how to configure production-grade full + incremental synchronization, especially for MySQL/PostgreSQL/Oracle CDC to Doris, StarRocks, Kafka, JDBC targets, etc. The current connector pages provide parameters, but users still need an end-to-end production cookbook.

Typical questions include:

What does startup.mode = initial really mean?
Does CDC support batch mode?
Why is server-id required for MySQL CDC?
Is GTID required?
How does checkpoint.interval affect CDC latency and 2PC commit frequency?
How should Doris/StarRocks 2PC be configured?
How should schema evolution and DDL changes be handled?
How to observe CDC delay and troubleshoot replication problems?

P0-3. REST API Job Lifecycle Cookbook

The REST API documentation currently works mainly as an API reference. Users need a practical job lifecycle guide that explains how to submit, query, stop, cancel, recover, and inspect jobs through REST API.

Typical questions include:

How to submit a job through REST API?
How to stop a job and keep checkpoint/savepoint state?
What is the difference between cancel, stop, and stop-with-savepoint?
How to query job status and logs?
Why does job-info become slow when many jobs exist?
How to use REST API with authentication enabled?
How to submit jobs with multiple transforms in JSON format?

Suggested sidebar location:

Engines
  -> SeaTunnel Engine (Zeta)
     -> REST API
        -> rest-api-v1
        -> rest-api-v2
        -> rest-api-job-lifecycle
        -> security

P0-4. Docker / Kubernetes / EKS Job Submission Guide

Users can find deployment documents, but many still do not know how to submit and operate jobs after deployment, especially in Docker, Kubernetes, Helm, and EKS scenarios.

Typical questions include:

After deploying SeaTunnel on Kubernetes, how should I run a job?
Can I submit a job from my local laptop to a remote Zeta cluster?
How should --master and --deploy-mode be used?
How should Docker client connect to master/worker nodes?
How should Helm deployment expose REST API or client submission endpoints?
What is the difference between local mode, cluster mode, hybrid mode, and separated mode from an operation perspective?

Suggested sidebar location:

Getting Started
  -> Docker
  -> Kubernetes
  -> Submit Job To Remote Zeta Cluster

P0-5. Multi-table Transform / Join / EtLT Capability Boundary

Many users expect multi-table CDC to support arbitrary joins and business-entity reshaping inside the synchronization pipeline. The current multi-table transform documentation should more explicitly describe what is supported and what is not supported.

Typical questions include:

Can two CDC tables be joined and written into one target table?
Can multiple sources be joined in SQL Transform?
Is TableMerge the same as SQL join?
Can different tables use different transform rules?
Can a multi-table CDC job add a common load_time field to all tables?
When should users use downstream ELT/EtLT instead of in-pipeline transform?

Suggested sidebar location:

Transforms
  -> transform-multi-table
  -> table-merge
  -> multi-table-transform-and-join-boundary

Acceptance criteria

All P0 documents are added or updated in both docs/en and docs/zh where possible.
docs/sidebars.js is updated if new pages are added.
Each new/updated document includes production-oriented examples, not only parameter references.
Each document includes a troubleshooting section.
Each document clearly states capability boundaries and safe operation rules.
Existing related pages contain cross-links to the new guides.
The documentation can answer the most common user questions without requiring maintainers to repeatedly explain the same concepts in GitHub issues, Slack, or mailing lists.

Code of Conduct

I agree to follow this project's Code of Conduct.

Guida contributor

Tech stack: Nessuno
Dominio: documentationbackend
Tipo issue: documentation
Difficoltà: 3
Tempo stimato: over 1 week
Stato attività: active
Chiarezza: mostly clear
Prerequisiti: Git
Adatta ai principianti: 50
Direzione di ricerca: Esamina l'issue e scegli uno degli elementi di documentazione P0 elencati. Inizia leggendo la documentazione esistente nelle posizioni suggerite, poi segui la checklist per aggiungere contenuti. Aggiorna le tabelle dell'issue e invia una PR con le versioni sia in inglese che in cinese.