[Feature][Zeta] Implement proper dry-run mode with progressive validation layers · apache/seatunnel#10681

(4 comments) (2 reactions) (1 assignee)Java (1,432 forks)batch import

featuregood first issuehelp wanted

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Motivation

SeaTunnel currently lacks a meaningful dry-run capability. The existing --check flag (SeaTunnelConfValidateCommand) is effectively a stub — its execute() body contains only a // TODO: validate config using new api comment and performs no real validation beyond checking that the config file path exists.

This means users must submit jobs to a cluster (or run them fully in local mode) just to discover problems like:

Typos in option names
Missing required fields
Wrong types for option values
Unreachable source/sink systems
Schema mismatches between source output and sink expectation
Broken transform SQL expressions

These failures account for the majority of user-filed bugs and support requests, yet they could all be caught before a single byte of data is read or written.

Proposed Solution

Introduce a proper --dry-run mode with four progressive validation layers, each independently selectable via --dry-run=<level>:

Layer 0: Static Analysis (no network, no I/O)

Triggered by: --dry-run=static

Config file syntax parsing (HOCON/YAML validity)
All Option keys validated against the connector's declared option set (catches typos, unknown fields)
Required options presence check
Option value type validation (e.g., string value in an integer field)
Plugin class loadability check (connector JAR exists and class is loadable)
DAG topology validation (at least one Source, at least one Sink, no cycles)
Transform SQL syntax parsing (parse-only, no execution)

Cost: milliseconds, zero external dependencies.

Layer 1: Connectivity Check (connect, no data)

Triggered by: --dry-run=connect

Establishes connections to source and sink systems
Validates credentials and permissions
Verifies source tables/topics/paths exist
Verifies sink write permissions and target existence (or createability)
Infers source schema
Checks source schema compatibility with sink schema (field names, type mapping)

Cost: seconds, requires live systems. Catches ~80% of real-world job failures before execution.

Layer 2: Data Sampling (read N rows, sink to memory only)

Triggered by: --dry-run=sample[=N] (default N=100)

Reads up to N rows from the source
Passes data through the full transform chain
Validates transform logic on real data (NULL handling, type casts, SQL semantics)
Writes output to memory/Console — never to the real sink
For transactional sinks (JDBC, Kafka), opens a transaction/producer but always rolls back / does not commit
Prints a sample of the output rows for human review

Cost: seconds to minutes depending on source. Zero side effects on sink.

Layer 3: Shadow Execution (full read, sink discarded)

Triggered by: --dry-run=shadow

Full source read + full transform execution
Sink replaced with a no-op implementation — no data written to target system
Generates complete statistics: row count, null distribution, throughput, checkpoint behavior
Useful for capacity planning and validating split enumeration logic at scale

Cost: same as a real job run. Only the final write is skipped.

Design Principles

Fast-fail per layer: on any error within a layer, stop immediately and report the exact location (connector index, option name, field name).
Strict no-side-effect guarantee for Layer 0–2: verified at the framework level, not relying on connector authors to remember to skip writes.
CI/CD-friendly exit codes: exit 0 on pass, non-zero with structured output on failure, so --dry-run=connect can be used as a pre-deploy gate in pipelines.
Layer independence: --dry-run=static does not require live systems; --dry-run=connect does not read data. Users can choose the trade-off between speed and coverage.

Current State vs Target

Layer	Target Capability	Current State
Layer 0	Full option validation, DAG check, SQL parse	Not implemented (`--check` is a stub)
Layer 1	Connectivity + schema inference + compatibility	Not implemented
Layer 2	Data sampling through full transform chain	Not implemented
Layer 3	Shadow full execution	Not implemented

Suggested Implementation Priority

Layer 0 + Layer 1 should be addressed first. They require the least infrastructure investment and deliver the highest value-to-cost ratio. Most user-reported "job failed immediately" issues fall into these two layers.

Affected Modules

seatunnel-core/seatunnel-core-starter — AbstractCommandArgs, SeaTunnelConfValidateCommand
seatunnel-core/seatunnel-starter — ClientCommandArgs, ClientExecuteCommand
seatunnel-api — option validation framework
seatunnel-engine — execution mode extension
All connectors (Layer 1 requires a validateConnection() SPI method)

Contributor guide

Research direction: Study the existing abstract command args and validate command stubs in seatunnel core starter and seatunnel api. Understand the SPI for connectors and how option validation is currently done. Focus on implementing Layer 0 static analysis first: add config syntax parsing, option key validation, required fields check, and DAG topology validation. Reference the existing TODO comments in SeaTunnelConfValidateCommand.
Tech stack: java
Domain: backenddevops
Issue type: Feature
Difficulty: 4
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: JavaGit
Newbie friendliness: 25