apache/seatunnel

[Feature][Zeta] Implement proper dry-run mode with progressive validation layers

Open

#10681 opened on Mar 31, 2026

View on GitHub
 (4 comments) (2 reactions) (1 assignee)Java (6,897 stars) (1,432 forks)batch import
featuregood first issuehelp wanted

Description

Motivation

SeaTunnel currently lacks a meaningful dry-run capability. The existing --check flag (SeaTunnelConfValidateCommand) is effectively a stub — its execute() body contains only a // TODO: validate config using new api comment and performs no real validation beyond checking that the config file path exists.

This means users must submit jobs to a cluster (or run them fully in local mode) just to discover problems like:

  • Typos in option names
  • Missing required fields
  • Wrong types for option values
  • Unreachable source/sink systems
  • Schema mismatches between source output and sink expectation
  • Broken transform SQL expressions

These failures account for the majority of user-filed bugs and support requests, yet they could all be caught before a single byte of data is read or written.

Proposed Solution

Introduce a proper --dry-run mode with four progressive validation layers, each independently selectable via --dry-run=<level>:


Layer 0: Static Analysis (no network, no I/O)

Triggered by: --dry-run=static

  • Config file syntax parsing (HOCON/YAML validity)
  • All Option keys validated against the connector's declared option set (catches typos, unknown fields)
  • Required options presence check
  • Option value type validation (e.g., string value in an integer field)
  • Plugin class loadability check (connector JAR exists and class is loadable)
  • DAG topology validation (at least one Source, at least one Sink, no cycles)
  • Transform SQL syntax parsing (parse-only, no execution)

Cost: milliseconds, zero external dependencies.


Layer 1: Connectivity Check (connect, no data)

Triggered by: --dry-run=connect

  • Establishes connections to source and sink systems
  • Validates credentials and permissions
  • Verifies source tables/topics/paths exist
  • Verifies sink write permissions and target existence (or createability)
  • Infers source schema
  • Checks source schema compatibility with sink schema (field names, type mapping)

Cost: seconds, requires live systems. Catches ~80% of real-world job failures before execution.


Layer 2: Data Sampling (read N rows, sink to memory only)

Triggered by: --dry-run=sample[=N] (default N=100)

  • Reads up to N rows from the source
  • Passes data through the full transform chain
  • Validates transform logic on real data (NULL handling, type casts, SQL semantics)
  • Writes output to memory/Console — never to the real sink
  • For transactional sinks (JDBC, Kafka), opens a transaction/producer but always rolls back / does not commit
  • Prints a sample of the output rows for human review

Cost: seconds to minutes depending on source. Zero side effects on sink.


Layer 3: Shadow Execution (full read, sink discarded)

Triggered by: --dry-run=shadow

  • Full source read + full transform execution
  • Sink replaced with a no-op implementation — no data written to target system
  • Generates complete statistics: row count, null distribution, throughput, checkpoint behavior
  • Useful for capacity planning and validating split enumeration logic at scale

Cost: same as a real job run. Only the final write is skipped.


Design Principles

  1. Fast-fail per layer: on any error within a layer, stop immediately and report the exact location (connector index, option name, field name).
  2. Strict no-side-effect guarantee for Layer 0–2: verified at the framework level, not relying on connector authors to remember to skip writes.
  3. CI/CD-friendly exit codes: exit 0 on pass, non-zero with structured output on failure, so --dry-run=connect can be used as a pre-deploy gate in pipelines.
  4. Layer independence: --dry-run=static does not require live systems; --dry-run=connect does not read data. Users can choose the trade-off between speed and coverage.

Current State vs Target

Layer Target Capability Current State
Layer 0 Full option validation, DAG check, SQL parse Not implemented (--check is a stub)
Layer 1 Connectivity + schema inference + compatibility Not implemented
Layer 2 Data sampling through full transform chain Not implemented
Layer 3 Shadow full execution Not implemented

Suggested Implementation Priority

Layer 0 + Layer 1 should be addressed first. They require the least infrastructure investment and deliver the highest value-to-cost ratio. Most user-reported "job failed immediately" issues fall into these two layers.

Affected Modules

  • seatunnel-core/seatunnel-core-starterAbstractCommandArgs, SeaTunnelConfValidateCommand
  • seatunnel-core/seatunnel-starterClientCommandArgs, ClientExecuteCommand
  • seatunnel-api — option validation framework
  • seatunnel-engine — execution mode extension
  • All connectors (Layer 1 requires a validateConnection() SPI method)

Contributor guide