[Feature][Connector] Add BigQuery Sink Connector · apache/seatunnel#10355

(3 comments) (0 reactions) (1 assignee)Java (6,897 stars) (1,432 forks)batch import

help wanted

Description

Background

BigQuery is Google Cloud's serverless, highly scalable, and cost-effective multi-cloud data warehouse. It is widely used by enterprises globally for data analytics, business intelligence, and machine learning workloads.

Currently, SeaTunnel lacks native support for BigQuery as a sink, which limits its ability to integrate with the Google Cloud ecosystem efficiently.

Motivation

High Market Demand: BigQuery is a core service in Google Cloud Platform (GCP) with a large enterprise customer base
Cloud-Native Architecture: While JDBC drivers exist for BigQuery, they provide poor performance and limited functionality compared to native SDK
Multi-Table Support: Need to sync multiple tables from various sources to BigQuery in a single job
Advanced Features: Native connector can support:
- Streaming inserts for real-time data ingestion
- Table partitioning and clustering
- Nested and repeated fields (STRUCT, ARRAY)
- Integration with Cloud Storage for efficient bulk loading
- Schema auto-detection and evolution

Proposed Solution

Implement a dedicated BigQuery Sink connector using the Google Cloud Java SDK with multi-table support following SeaTunnel's standard architecture:

Core Features

Multi-Table Support (Critical)

Support multiple tables in single configuration
Table routing and mapping similar to CDC connectors
Per-table configuration (partition, clustering, schema)
Example:

sink {
  BigQuery {
    project = "my-gcp-project"
    dataset = "my_dataset"
    
    # Multi-table configuration
    table-configs = [
      {
        table = "customers"
        partition_field = "order_date"
        partition_type = "DAY"
        clustering_fields = ["customer_id", "region"]
      },
      {
        table = "orders"
        partition_field = "created_at"
        partition_type = "HOUR"
      }
    ]
  }
}

Configuration Examples

Multi-Table Configuration

sink {
  BigQuery {
    project = "my-gcp-project"
    dataset = "analytics"
    
    # Authentication
    credentials_file = "/path/to/service-account.json"
    
    # Multi-table support
    table-configs = [
      {
        table = "user_events"
        write_mode = "streaming"
        partition_field = "event_timestamp"
        partition_type = "DAY"
        clustering_fields = ["user_id", "event_type"]
        create_disposition = "CREATE_IF_NEEDED"
      },
      {
        table = "orders"
        write_mode = "batch"
        partition_field = "order_date"
        partition_type = "MONTH"
        clustering_fields = ["customer_id"]
      },
      {
        table = "products"
        write_mode = "batch"
        create_disposition = "CREATE_NEVER"
      }
    ]
    
    # Global settings
    max_batch_size = 1000
    flush_interval_ms = 5000
    enable_auto_schema_update = true
  }
}

Technical Considerations

Dependencies

google-cloud-bigquery SDK
google-cloud-storage (for batch loading)

Multi-Table Implementation

Follow SeaTunnel's CDC connector pattern for table discovery and routing
Use TableId for table identification
Support table filtering and mapping
Per-table configuration merging similar to JdbcSourceTableConfig

Authentication

Service account JSON
Application default credentials
Workload identity (for GKE)

Testing

BigQuery emulator for unit tests
Integration tests with test project
Multi-table test cases

References

Contributor guide

Tech stack: java
Domain: datadatabasecloud
Issue type: feature
Difficulty: 5
Estimated time: over 1 week
Activity status: blocked
Clarity: clear
Prerequisites: JavaGoogle Cloud BigQuerySeaTunnel connector architectureMulti table sink patterns
Newbie friendliness: 15
Research direction: Review existing SeaTunnel sink connectors (e.g., JDBC sink) for architecture patterns. Examine the BigQuery Java SDK and Storage Write API documentation. Study the CDC connector pattern for multi table support. Set up a local development environment with BigQuery emulator or a GCP test project. The issue provides detailed configuration examples and required dependencies.