apache/seatunnel

[Feature][Connector] Add BigQuery Sink Connector

Open

#10355 opened on Jan 17, 2026

View on GitHub
 (3 comments) (0 reactions) (1 assignee)Java (6,897 stars) (1,432 forks)batch import
help wanted

Description

Background

BigQuery is Google Cloud's serverless, highly scalable, and cost-effective multi-cloud data warehouse. It is widely used by enterprises globally for data analytics, business intelligence, and machine learning workloads.

Currently, SeaTunnel lacks native support for BigQuery as a sink, which limits its ability to integrate with the Google Cloud ecosystem efficiently.

Motivation

  • High Market Demand: BigQuery is a core service in Google Cloud Platform (GCP) with a large enterprise customer base
  • Cloud-Native Architecture: While JDBC drivers exist for BigQuery, they provide poor performance and limited functionality compared to native SDK
  • Multi-Table Support: Need to sync multiple tables from various sources to BigQuery in a single job
  • Advanced Features: Native connector can support:
    • Streaming inserts for real-time data ingestion
    • Table partitioning and clustering
    • Nested and repeated fields (STRUCT, ARRAY)
    • Integration with Cloud Storage for efficient bulk loading
    • Schema auto-detection and evolution

Proposed Solution

Implement a dedicated BigQuery Sink connector using the Google Cloud Java SDK with multi-table support following SeaTunnel's standard architecture:

Core Features

  1. Multi-Table Support (Critical)
    • Support multiple tables in single configuration
    • Table routing and mapping similar to CDC connectors
    • Per-table configuration (partition, clustering, schema)
    • Example:
    sink {
      BigQuery {
        project = "my-gcp-project"
        dataset = "my_dataset"
        
        # Multi-table configuration
        table-configs = [
          {
            table = "customers"
            partition_field = "order_date"
            partition_type = "DAY"
            clustering_fields = ["customer_id", "region"]
          },
          {
            table = "orders"
            partition_field = "created_at"
            partition_type = "HOUR"
          }
        ]
      }
    }
    

Configuration Examples

Multi-Table Configuration

sink {
  BigQuery {
    project = "my-gcp-project"
    dataset = "analytics"
    
    # Authentication
    credentials_file = "/path/to/service-account.json"
    
    # Multi-table support
    table-configs = [
      {
        table = "user_events"
        write_mode = "streaming"
        partition_field = "event_timestamp"
        partition_type = "DAY"
        clustering_fields = ["user_id", "event_type"]
        create_disposition = "CREATE_IF_NEEDED"
      },
      {
        table = "orders"
        write_mode = "batch"
        partition_field = "order_date"
        partition_type = "MONTH"
        clustering_fields = ["customer_id"]
      },
      {
        table = "products"
        write_mode = "batch"
        create_disposition = "CREATE_NEVER"
      }
    ]
    
    # Global settings
    max_batch_size = 1000
    flush_interval_ms = 5000
    enable_auto_schema_update = true
  }
}

Technical Considerations

Dependencies

  • google-cloud-bigquery SDK
  • google-cloud-storage (for batch loading)

Multi-Table Implementation

  • Follow SeaTunnel's CDC connector pattern for table discovery and routing
  • Use TableId for table identification
  • Support table filtering and mapping
  • Per-table configuration merging similar to JdbcSourceTableConfig

Authentication

  • Service account JSON
  • Application default credentials
  • Workload identity (for GKE)

Testing

  • BigQuery emulator for unit tests
  • Integration tests with test project
  • Multi-table test cases

References

Contributor guide