apache/seatunnel

[Feature][Connector] Add Azure CosmosDB Source Connector

Open

#10357 opened on Jan 17, 2026

View on GitHub
 (2 comments) (0 reactions) (0 assignees)Java (6,897 stars) (1,432 forks)batch import
good first issuehelp wanted

Description

Background

Azure Cosmos DB is Microsoft's globally distributed, multi-model NoSQL database service designed for mission-critical applications. It offers 99.999% SLA, single-digit millisecond latency, and support for multiple API models (SQL, MongoDB, Cassandra, Gremlin, Table).

Currently, SeaTunnel lacks native support for Azure Cosmos DB as a data source, limiting its ability to integrate with Azure cloud-native applications and globally distributed systems.

Motivation

  • Azure Cloud Leadership: Cosmos DB is Microsoft Azure's flagship NoSQL database service.
  • Multi-Model Support: Single database supporting SQL, document, key-value, graph, and column-family data models.
  • Multi-Container Integration: Need to sync multiple containers from single or multiple databases.
  • No JDBC Support: Requires native SDK for optimal performance and feature access.

Proposed Solution

Implement a dedicated Azure Cosmos DB Source connector supporting multiple API modes with multi-container support.

Crucially, this connector will follow SeaTunnel's standard multi-table configuration (aligned with JDBC Source) using table_list and table_path.

Core Features

  1. Multi-Container Support (Standardized)

    • Use table_list standard parameter for multi-container definition.
    • Use table_path (format: database.container) to identify resources, consistent with other SeaTunnel connectors.
    • Support specialized configuration per container (partition keys, queries).
  2. API Support

    • SQL API (Core/SQL)

Configuration Examples

Multi-Container SQL API Configuration (Standardized)

env {
  parallelism = 2
  job.mode = "BATCH"
}

source {
  CosmosDB {
    # Connection
    endpoint = "https://myaccount.documents.azure.com:443/"
    auth_type = "master_key"
    master_key = "your-primary-key"
    api_type = "sql"
    
    # Multi-container standard configuration
    table_list = [
      {
        # Standard table_path format: database.container
        table_path = "ecommerce.customers"
        
        # Container specific settings
        partition_key = "/customerId"
        
        # Extraction settings
        extraction_mode = "incremental"
        incremental_field = "_ts"
        start_timestamp = 1640995200
        
        # Custom query (optional)
        query = "SELECT * FROM c WHERE c._ts > @lastTimestamp AND c.status = 'active'"
      },
      {
        table_path = "ecommerce.orders"
        partition_key = "/orderId"
        
        extraction_mode = "incremental"
        incremental_field = "_ts"
      },
      {
        table_path = "analytics.user_events"
        partition_key = "/userId"
        
        # Change feed (CDC) mode
        extraction_mode = "change_feed"
        change_feed_mode = "incremental" 
        lease_container_name = "leases"
      }
    ]
    
    # Global settings
    max_ru_per_second = 1000
    request_timeout_ms = 30000
  }
}

sink {
  Console {}
}

Technical Considerations

Multi-Container Configuration Standardization

  • Parameter Alignment: Adopt table_list to replace custom container-configs proposal. This ensures consistency with JDBC, StarRocks, and other multi-table sources.
  • Table Path Parsing: Utilize SeaTunnel's TablePath class to parse database.container strings automatically.

Dependencies

  • SQL API: azure-cosmos Java SDK

Contributor guide