[Feature][Zeta] Implement proper dry-run mode with progressive validation layers · apache/seatunnel#10681

(4 comments) (2 reactions) (1 assignee)Java (6.897 stars) (1.432 forks)batch import

featuregood first issuehelp wanted

Description

Motivation

SeaTunnel currently lacks a meaningful dry-run capability. The existing --check flag (SeaTunnelConfValidateCommand) is effectively a stub — its execute() body contains only a // TODO: validate config using new api comment and performs no real validation beyond checking that the config file path exists.

This means users must submit jobs to a cluster (or run them fully in local mode) just to discover problems like:

Typos in option names
Missing required fields
Wrong types for option values
Unreachable source/sink systems
Schema mismatches between source output and sink expectation
Broken transform SQL expressions

These failures account for the majority of user-filed bugs and support requests, yet they could all be caught before a single byte of data is read or written.

Proposed Solution

Introduce a proper --dry-run mode with four progressive validation layers, each independently selectable via --dry-run=<level>:

Layer 0: Static Analysis (no network, no I/O)

Triggered by: --dry-run=static

Config file syntax parsing (HOCON/YAML validity)
All Option keys validated against the connector's declared option set (catches typos, unknown fields)
Required options presence check
Option value type validation (e.g., string value in an integer field)
Plugin class loadability check (connector JAR exists and class is loadable)
DAG topology validation (at least one Source, at least one Sink, no cycles)
Transform SQL syntax parsing (parse-only, no execution)

Cost: milliseconds, zero external dependencies.

Layer 1: Connectivity Check (connect, no data)

Triggered by: --dry-run=connect

Establishes connections to source and sink systems
Validates credentials and permissions
Verifies source tables/topics/paths exist
Verifies sink write permissions and target existence (or createability)
Infers source schema
Checks source schema compatibility with sink schema (field names, type mapping)

Cost: seconds, requires live systems. Catches ~80% of real-world job failures before execution.

Layer 2: Data Sampling (read N rows, sink to memory only)

Triggered by: --dry-run=sample[=N] (default N=100)

Reads up to N rows from the source
Passes data through the full transform chain
Validates transform logic on real data (NULL handling, type casts, SQL semantics)
Writes output to memory/Console — never to the real sink
For transactional sinks (JDBC, Kafka), opens a transaction/producer but always rolls back / does not commit
Prints a sample of the output rows for human review

Cost: seconds to minutes depending on source. Zero side effects on sink.

Layer 3: Shadow Execution (full read, sink discarded)

Triggered by: --dry-run=shadow

Full source read + full transform execution
Sink replaced with a no-op implementation — no data written to target system
Generates complete statistics: row count, null distribution, throughput, checkpoint behavior
Useful for capacity planning and validating split enumeration logic at scale

Cost: same as a real job run. Only the final write is skipped.

Design Principles

Fast-fail per layer: on any error within a layer, stop immediately and report the exact location (connector index, option name, field name).
Strict no-side-effect guarantee for Layer 0–2: verified at the framework level, not relying on connector authors to remember to skip writes.
CI/CD-friendly exit codes: exit 0 on pass, non-zero with structured output on failure, so --dry-run=connect can be used as a pre-deploy gate in pipelines.
Layer independence: --dry-run=static does not require live systems; --dry-run=connect does not read data. Users can choose the trade-off between speed and coverage.

Current State vs Target

Layer	Target Capability	Current State
Layer 0	Full option validation, DAG check, SQL parse	Not implemented (`--check` is a stub)
Layer 1	Connectivity + schema inference + compatibility	Not implemented
Layer 2	Data sampling through full transform chain	Not implemented
Layer 3	Shadow full execution	Not implemented

Suggested Implementation Priority

Layer 0 + Layer 1 should be addressed first. They require the least infrastructure investment and deliver the highest value-to-cost ratio. Most user-reported "job failed immediately" issues fall into these two layers.

Affected Modules

seatunnel-core/seatunnel-core-starter — AbstractCommandArgs, SeaTunnelConfValidateCommand
seatunnel-core/seatunnel-starter — ClientCommandArgs, ClientExecuteCommand
seatunnel-api — option validation framework
seatunnel-engine — execution mode extension
All connectors (Layer 1 requires a validateConnection() SPI method)

Guia do colaborador

Pilha de tecnologia: java
Domain: toolingdevops
Tipo Issue: feature
Difficulty: 4
Tempo estimado: over 1 week
Status da atividade: active
Clarity: clear
Prerequisites: JavaApache SeaTunneloption validationconnector development
Simpatia para novatos: 25
Direção de pesquisa: Start by examining the existing `SeaTunnelConfValidateCommand` in `seatunnel core/seatunnel core starter` to understand the current stub implementation. Then explore the option validation framework in `seatunnel api` to identify how to add validation for option keys, types, and required fields. For Layer 1, design a `validateConnection()` SPI method for connectors and implement it in a few representative connectors. The issue has clear design principles and a priority order; focus on Layer 0 and 1 first.