apache/seatunnel

[Feature][Zeta] Implement proper dry-run mode with progressive validation layers

Open

#10.681 aberto em 31 de mar. de 2026

Ver no GitHub
 (4 comments) (2 reactions) (1 assignee)Java (6.897 stars) (1.432 forks)batch import
featuregood first issuehelp wanted

Description

Motivation

SeaTunnel currently lacks a meaningful dry-run capability. The existing --check flag (SeaTunnelConfValidateCommand) is effectively a stub — its execute() body contains only a // TODO: validate config using new api comment and performs no real validation beyond checking that the config file path exists.

This means users must submit jobs to a cluster (or run them fully in local mode) just to discover problems like:

  • Typos in option names
  • Missing required fields
  • Wrong types for option values
  • Unreachable source/sink systems
  • Schema mismatches between source output and sink expectation
  • Broken transform SQL expressions

These failures account for the majority of user-filed bugs and support requests, yet they could all be caught before a single byte of data is read or written.

Proposed Solution

Introduce a proper --dry-run mode with four progressive validation layers, each independently selectable via --dry-run=<level>:


Layer 0: Static Analysis (no network, no I/O)

Triggered by: --dry-run=static

  • Config file syntax parsing (HOCON/YAML validity)
  • All Option keys validated against the connector's declared option set (catches typos, unknown fields)
  • Required options presence check
  • Option value type validation (e.g., string value in an integer field)
  • Plugin class loadability check (connector JAR exists and class is loadable)
  • DAG topology validation (at least one Source, at least one Sink, no cycles)
  • Transform SQL syntax parsing (parse-only, no execution)

Cost: milliseconds, zero external dependencies.


Layer 1: Connectivity Check (connect, no data)

Triggered by: --dry-run=connect

  • Establishes connections to source and sink systems
  • Validates credentials and permissions
  • Verifies source tables/topics/paths exist
  • Verifies sink write permissions and target existence (or createability)
  • Infers source schema
  • Checks source schema compatibility with sink schema (field names, type mapping)

Cost: seconds, requires live systems. Catches ~80% of real-world job failures before execution.


Layer 2: Data Sampling (read N rows, sink to memory only)

Triggered by: --dry-run=sample[=N] (default N=100)

  • Reads up to N rows from the source
  • Passes data through the full transform chain
  • Validates transform logic on real data (NULL handling, type casts, SQL semantics)
  • Writes output to memory/Console — never to the real sink
  • For transactional sinks (JDBC, Kafka), opens a transaction/producer but always rolls back / does not commit
  • Prints a sample of the output rows for human review

Cost: seconds to minutes depending on source. Zero side effects on sink.


Layer 3: Shadow Execution (full read, sink discarded)

Triggered by: --dry-run=shadow

  • Full source read + full transform execution
  • Sink replaced with a no-op implementation — no data written to target system
  • Generates complete statistics: row count, null distribution, throughput, checkpoint behavior
  • Useful for capacity planning and validating split enumeration logic at scale

Cost: same as a real job run. Only the final write is skipped.


Design Principles

  1. Fast-fail per layer: on any error within a layer, stop immediately and report the exact location (connector index, option name, field name).
  2. Strict no-side-effect guarantee for Layer 0–2: verified at the framework level, not relying on connector authors to remember to skip writes.
  3. CI/CD-friendly exit codes: exit 0 on pass, non-zero with structured output on failure, so --dry-run=connect can be used as a pre-deploy gate in pipelines.
  4. Layer independence: --dry-run=static does not require live systems; --dry-run=connect does not read data. Users can choose the trade-off between speed and coverage.

Current State vs Target

Layer Target Capability Current State
Layer 0 Full option validation, DAG check, SQL parse Not implemented (--check is a stub)
Layer 1 Connectivity + schema inference + compatibility Not implemented
Layer 2 Data sampling through full transform chain Not implemented
Layer 3 Shadow full execution Not implemented

Suggested Implementation Priority

Layer 0 + Layer 1 should be addressed first. They require the least infrastructure investment and deliver the highest value-to-cost ratio. Most user-reported "job failed immediately" issues fall into these two layers.

Affected Modules

  • seatunnel-core/seatunnel-core-starterAbstractCommandArgs, SeaTunnelConfValidateCommand
  • seatunnel-core/seatunnel-starterClientCommandArgs, ClientExecuteCommand
  • seatunnel-api — option validation framework
  • seatunnel-engine — execution mode extension
  • All connectors (Layer 1 requires a validateConnection() SPI method)

Guia do colaborador