apache/seatunnel

[Discuss] Discuss adding VARIANT SqlType for semi-structured data

Open

#10774 opened on Apr 16, 2026

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Java (6,897 stars) (1,432 forks)batch import
discussionfeaturehelp wanted

Description

Search before asking

I searched existing issues with keywords such as VARIANT, SqlType, and parse_json. There is a related but different SQL Server variant connector issue, but I did not find an issue discussing a generic SeaTunnel API-level VARIANT/semi-structured type.

What would you like to be added?

I would like to start a discussion about whether SeaTunnel should add a VARIANT-like type to the API type system for semi-structured data.

Currently, SqlType does not include a generic semi-structured type:

public enum SqlType {
    ARRAY,
    MAP,
    STRING,
    BOOLEAN,
    TINYINT,
    SMALLINT,
    INT,
    BIGINT,
    FLOAT,
    DOUBLE,
    DECIMAL,
    NULL,
    BYTES,
    DATE,
    TIME,
    TIMESTAMP,
    TIMESTAMP_TZ,
    BINARY_VECTOR,
    FLOAT_VECTOR,
    FLOAT16_VECTOR,
    BFLOAT16_VECTOR,
    SPARSE_FLOAT_VECTOR,
    ROW,
    MULTIPLE_ROW;
}

SeaTunnel already has ARRAY, MAP, and ROW, which work well when the schema is known. However, CDC and JSON-oriented pipelines often need to carry semi-structured values whose shape may vary across rows or evolve frequently.

Why is this needed?

Some common scenarios are difficult to model cleanly today:

  1. CDC pipelines that need to preserve semi-structured source columns without converting everything to STRING.
  2. JSON/Kafka/MongoDB-style data where fields may be dynamic or partially unknown.
  3. Transform use cases such as PARSE_JSON, where users may want to parse a JSON string into a typed semi-structured value instead of a plain string.
  4. Sink mappings for systems that have native JSON/VARIANT-like types.

Current workarounds usually fall into two categories:

  • Use STRING, which preserves the raw value but loses type semantics.
  • Use MAP/ROW, which requires a more fixed schema and is less convenient for highly dynamic JSON payloads.

Proposal for discussion

This issue is intended as a design discussion first, not an immediate implementation request.

Possible directions:

  1. Add a new SqlType.VARIANT or SqlType.JSON.
  2. Add a corresponding SeaTunnelDataType, for example VariantType or JsonType.
  3. Define conversion rules for JSON format, CDC deserialization, catalog mapping, and common sinks.
  4. Add a PARSE_JSON transform/function later, either returning the new semi-structured type or requiring an explicit target schema.
  5. Document connector support as a compatibility matrix, because not every sink can store semi-structured values natively.

Open questions

  1. Should the type be named VARIANT, JSON, or something else?
  2. Should the physical representation preserve the original JSON text, use a structured object model, or support both?
  3. How should this interact with schema evolution events?
  4. Which connectors should support it in the first phase?
  5. Should SeaTunnel first add PARSE_JSON with explicit ROW/MAP output before introducing a new API type?

Compatibility

This can be introduced as an additive API capability. Existing behavior does not need to change unless a connector explicitly opts into the new type.

Contributor guide