help wanted
Description
Background
BigQuery is Google Cloud's serverless, highly scalable, and cost-effective multi-cloud data warehouse. It is widely used by enterprises globally for data analytics, business intelligence, and machine learning workloads.
Currently, SeaTunnel lacks native support for BigQuery as a sink, which limits its ability to integrate with the Google Cloud ecosystem efficiently.
Motivation
- High Market Demand: BigQuery is a core service in Google Cloud Platform (GCP) with a large enterprise customer base
- Cloud-Native Architecture: While JDBC drivers exist for BigQuery, they provide poor performance and limited functionality compared to native SDK
- Multi-Table Support: Need to sync multiple tables from various sources to BigQuery in a single job
- Advanced Features: Native connector can support:
- Streaming inserts for real-time data ingestion
- Table partitioning and clustering
- Nested and repeated fields (STRUCT, ARRAY)
- Integration with Cloud Storage for efficient bulk loading
- Schema auto-detection and evolution
Proposed Solution
Implement a dedicated BigQuery Sink connector using the Google Cloud Java SDK with multi-table support following SeaTunnel's standard architecture:
Core Features
- Multi-Table Support (Critical)
- Support multiple tables in single configuration
- Table routing and mapping similar to CDC connectors
- Per-table configuration (partition, clustering, schema)
- Example:
sink { BigQuery { project = "my-gcp-project" dataset = "my_dataset" # Multi-table configuration table-configs = [ { table = "customers" partition_field = "order_date" partition_type = "DAY" clustering_fields = ["customer_id", "region"] }, { table = "orders" partition_field = "created_at" partition_type = "HOUR" } ] } }
Configuration Examples
Multi-Table Configuration
sink {
BigQuery {
project = "my-gcp-project"
dataset = "analytics"
# Authentication
credentials_file = "/path/to/service-account.json"
# Multi-table support
table-configs = [
{
table = "user_events"
write_mode = "streaming"
partition_field = "event_timestamp"
partition_type = "DAY"
clustering_fields = ["user_id", "event_type"]
create_disposition = "CREATE_IF_NEEDED"
},
{
table = "orders"
write_mode = "batch"
partition_field = "order_date"
partition_type = "MONTH"
clustering_fields = ["customer_id"]
},
{
table = "products"
write_mode = "batch"
create_disposition = "CREATE_NEVER"
}
]
# Global settings
max_batch_size = 1000
flush_interval_ms = 5000
enable_auto_schema_update = true
}
}
Technical Considerations
Dependencies
google-cloud-bigquerySDKgoogle-cloud-storage(for batch loading)
Multi-Table Implementation
- Follow SeaTunnel's CDC connector pattern for table discovery and routing
- Use
TableIdfor table identification - Support table filtering and mapping
- Per-table configuration merging similar to
JdbcSourceTableConfig
Authentication
- Service account JSON
- Application default credentials
- Workload identity (for GKE)
Testing
- BigQuery emulator for unit tests
- Integration tests with test project
- Multi-table test cases