[Discuss] Discuss adding VARIANT SqlType for semi-structured data · apache/seatunnel#10774

(1 comment) (0 reactions) (0 assignees)Java (1,432 forks)batch import

discussionfeaturehelp wanted

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Search before asking

I searched existing issues with keywords such as VARIANT, SqlType, and parse_json. There is a related but different SQL Server variant connector issue, but I did not find an issue discussing a generic SeaTunnel API-level VARIANT/semi-structured type.

What would you like to be added?

I would like to start a discussion about whether SeaTunnel should add a VARIANT-like type to the API type system for semi-structured data.

Currently, SqlType does not include a generic semi-structured type:

public enum SqlType {
    ARRAY,
    MAP,
    STRING,
    BOOLEAN,
    TINYINT,
    SMALLINT,
    INT,
    BIGINT,
    FLOAT,
    DOUBLE,
    DECIMAL,
    NULL,
    BYTES,
    DATE,
    TIME,
    TIMESTAMP,
    TIMESTAMP_TZ,
    BINARY_VECTOR,
    FLOAT_VECTOR,
    FLOAT16_VECTOR,
    BFLOAT16_VECTOR,
    SPARSE_FLOAT_VECTOR,
    ROW,
    MULTIPLE_ROW;
}

SeaTunnel already has ARRAY, MAP, and ROW, which work well when the schema is known. However, CDC and JSON-oriented pipelines often need to carry semi-structured values whose shape may vary across rows or evolve frequently.

Why is this needed?

Some common scenarios are difficult to model cleanly today:

CDC pipelines that need to preserve semi-structured source columns without converting everything to STRING.
JSON/Kafka/MongoDB-style data where fields may be dynamic or partially unknown.
Transform use cases such as PARSE_JSON, where users may want to parse a JSON string into a typed semi-structured value instead of a plain string.
Sink mappings for systems that have native JSON/VARIANT-like types.

Current workarounds usually fall into two categories:

Use STRING, which preserves the raw value but loses type semantics.
Use MAP/ROW, which requires a more fixed schema and is less convenient for highly dynamic JSON payloads.

Proposal for discussion

This issue is intended as a design discussion first, not an immediate implementation request.

Possible directions:

Add a new SqlType.VARIANT or SqlType.JSON.
Add a corresponding SeaTunnelDataType, for example VariantType or JsonType.
Define conversion rules for JSON format, CDC deserialization, catalog mapping, and common sinks.
Add a PARSE_JSON transform/function later, either returning the new semi-structured type or requiring an explicit target schema.
Document connector support as a compatibility matrix, because not every sink can store semi-structured values natively.

Open questions

Should the type be named VARIANT, JSON, or something else?
Should the physical representation preserve the original JSON text, use a structured object model, or support both?
How should this interact with schema evolution events?
Which connectors should support it in the first phase?
Should SeaTunnel first add PARSE_JSON with explicit ROW/MAP output before introducing a new API type?

Compatibility

This can be introduced as an additive API capability. Existing behavior does not need to change unless a connector explicitly opts into the new type.

Contributor guide

Research direction: Read the proposal, review the existing SqlType enum, and consider how a VARIANT type would integrate with connectors and transforms. Share your thoughts on the open questions in the issue.
Tech stack: java
Domain: backend
Issue type: Research
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: Java
Newbie friendliness: 65