StarRocks/starrocks

[Feature] Arrow Flight SQL: support Arrow IPC compression (LZ4/ZSTD) for DoGet responses

Open

#73,876 opened on 2026年5月26日

GitHub で見る
 (0 comments) (0 reactions) (0 assignees)Java (5,717 stars) (1,246 forks)batch import
good first issuetype/feature-request

説明

Feature request

Is your feature request related to a problem? Please describe.

StarRocks returns Arrow IPC data over Arrow Flight completely uncompressed. Confirmed at source level:

  • be/src/service/service_be/arrow_flight_sql_service.cppDoGetStatement returns RecordBatchStream(reader) with no IpcWriteOptions argument; the codec defaults to UNCOMPRESSED.
  • fe/.../ArrowFlightSqlService.javaFlightServer.builder has no .compressor() call; nothing compression-related exists on the builder.
  • No BE or FE config parameters exist to enable compression. Confirmed with StarRocks support (Rocky, 2026-05-22).

Client-side workarounds have no effect: grpc-encoding: gzip and grpc.default_compression_algorithm only compress client→server messages. The server must compress its own DoGet responses, which it does not.

Describe the solution you'd like

Add IpcWriteOptions with a codec to RecordBatchStream in DoGetStatement:

arrow::ipc::IpcWriteOptions options = arrow::ipc::IpcWriteOptions::Defaults();
ARROW_ASSIGN_OR_RAISE(options.codec, arrow::util::Codec::Create(arrow::Compression::LZ4_FRAME));
return std::make_unique<arrow::flight::RecordBatchStream>(reader, options);

The Arrow IPC format spec defines CompressionType with exactly two values: LZ4_FRAME and ZSTD (other codecs are not valid for IPC). One implementation note: LZ4_FRAME (frame format, enum value 6) and LZ4 (raw/block format, enum value 5) are different on-wire formats; a user-facing lz4 value must map to LZ4_FRAME.

Ideally exposed as a session variable (SET arrow_flight_compression = 'lz4') for per-connection control, with a cluster-level default via BE config.

Describe alternatives you've considered

  • gRPC message-level compression: no effect on server→client DoGet responses without server-side configuration, and inferior to Arrow IPC compression regardless — gRPC compresses arbitrary byte frames rather than column-aligned record batches, breaking Arrow's zero-copy path and yielding worse compression ratios.

Additional context

  • IpcWriteOptions has a min_space_savings field (Arrow ≥ 5.0): skip compression when savings are below a threshold. Worth hardcoding a small default (e.g. 0.05) to avoid negative compression on already-dense numeric columns.
  • write_legacy_ipc_format must be false (the default) for compression to work; the legacy format does not support compression. Should be verified in ArrowFlightBatchReader.

コントリビューターガイド