kvcache-ai/Mooncake

[RFC]: Hot Standby Mode for Master Service Metadata High Availability

Open

Aperta il 12 dic 2025

Vedi su GitHub
 (2 commenti) (3 reazioni) (0 assegnatari)C++ (5470 star) (803 fork)auto 404
good first issue

Descrizione

Introduction

To address the issue where all metadata is lost after a master service failure and recovery, this solution proposes a hot-standby synchronization mechanism and a hybrid approach combining hot standby with snapshots, ensuring rapid data recovery following a master service failure.

Motivation

The current HA (High Availability) mode only supports rapid restart of the master service after a failure but cannot guarantee fast data recovery. The pain points of the current HA mode are as follows:

Issue Impact Severity
Complete loss of metadata after Leader failure All KV cache location information is lost and must be rebuilt 🔴 Critical
Follower lacks data pre-warming After successful election, the new Leader must start serving from scratch 🟠 Medium
Long recovery time Relies on clients to re-register; may take several minutes to tens of minutes 🟠 Medium
Service unavailability window Requests cannot be processed during Leader failover 🟡 Low

Goals:

  1. Near-zero RPO: Minimize the data loss window
  2. Fast failover: Achieve second-level RTO (Recovery Time Objective)
  3. Data consistency guarantee: Ensure correct synchronization between primary and standby
  4. Verifiability: Include mechanisms to validate data integrity
  5. Low performance overhead: Avoid impacting normal business operations

Proposal

1. Core Design Approach

Use etcd as an intermediate reliability component to implement OpLog primary-standby synchronization:

  • OpLog Mechanism: Primary Master records all state change operations to OpLog
  • etcd Storage: OpLog is written to etcd, leveraging etcd's strong consistency and persistence capabilities
  • Watch Mechanism: Standby Master receives OpLog in real-time through etcd Watch mechanism
  • Ordering Guarantee: Guarantee operation order through global sequence_id and key-level key_sequence_id

2. Architecture Design

(1) Overall Architecture

The overall architecture diagram shows the interaction relationships between Primary Master, etcd Cluster, and Standby Master:

Architecture Description:

  • Primary Master: Handles client requests, records OpLog and writes to etcd
  • etcd Cluster: Acts as intermediate storage, providing strong consistency and Watch mechanism
  • Standby Master: Receives OpLog in real-time by watching etcd and applies to local metadata store

(2) Data Flow Diagram

The data flow diagram shows the complete flow from Client request to Standby synchronization:

Flow Description:

a. Client sends PutEnd request to Primary Master b. Primary Master records operation through OpLogManager, generating sequence_id c. EtcdOpLogStore writes OpLog to etcd d. etcd notifies Standby Master through Watch mechanism e. OpLogWatcher receives events and passes to OpLogApplier f. OpLogApplier checks order and applies to Standby's metadata store

(3) Failover Sequence

The failover sequence diagram shows the complete process from Primary failure to Standby promotion to Primary:

Flow Description:

a. Normal Operation: Primary maintains Lease, Standby continuously synchronizes OpLog through Watch b. Primary Failure: Primary's Lease expires, etcd notifies Standby c. Standby Promotion: Stop Standby service, initialize Lease, clean expired metadata, start leader election

3. Core Component Design

(1) OpLogManager (Primary Side)

Responsibilities:

  • Record all state change operations (PUT_END, PUT_REVOKE, REMOVE)
  • Generate global sequence_id and key-level key_sequence_id
  • Maintain memory buffer (for fast queries)

Key Methods:

class OpLogManager {
    uint64_t Append(OpType type, const std::string& key, 
                    const std::string& payload = "");
    std::vector<OpLogEntry> GetEntriesSince(uint64_t since_seq_id, 
                                            size_t limit = 1000) const;
    uint64_t GetLastSequenceId() const;
};

(2) EtcdOpLogStore (Primary Side)

Responsibilities:

  • Write OpLog to etcd
  • Update latest sequence_id
  • Record snapshot corresponding sequence_id
  • Clean up old OpLog

etcd Key Design:

  • OpLog Entry: mooncake-store/oplog/{cluster_id}/{sequence_id}
  • Latest Sequence ID: mooncake-store/oplog/{cluster_id}/latest
  • Snapshot Sequence ID: mooncake-store/oplog/{cluster_id}/snapshot/{snapshot_id}/sequence_id

(3) OpLogWatcher (Standby Side)

Responsibilities:

  • Watch etcd OpLog changes
  • Read historical OpLog (for initial synchronization)
  • Process Watch events and pass to OpLogApplier

Key Methods:

class OpLogWatcher {
    void Start();
    void Stop();
    bool ReadOpLogSince(uint64_t start_seq_id, 
                        std::vector<OpLogEntry>& entries);
};

(4) OpLogApplier (Standby Side)

Responsibilities:

  • Apply OpLog Entry to local metadata store
  • Check global and key-level order
  • Handle sequence number discontinuities and out-of-order cases
  • Periodically clean up key_sequence_map_ (memory optimization)

Key Methods:

class OpLogApplier {
    bool ApplyOpLogEntry(const OpLogEntry& entry);
    bool CheckSequenceOrder(const OpLogEntry& entry);
    void CleanupStaleKeySequences();
};

(5) HotStandbyService (Standby Side)

Responsibilities:

  • Manage Standby mode lifecycle
  • Coordinate OpLogWatcher and OpLogApplier
  • Handle Standby promotion to Primary logic

Key Methods:

class HotStandbyService {
    void StartStandby();
    void Stop();
    void Promote();
};

4. OpLog Entry Data Structure

struct OpLogEntry {
    uint64_t sequence_id{0};        // Globally monotonically increasing sequence
    uint64_t timestamp_ms{0};        // Timestamp (milliseconds)
    OpType op_type{OpType::PUT_END}; // PUT_END, PUT_REVOKE, REMOVE
    std::string object_key;          // Object key
    std::string payload;             // Optional payload (carries replica info for PUT_END)
    uint32_t checksum{0};           // Checksum
    uint32_t prefix_hash{0};        // Key prefix hash
    uint64_t key_sequence_id{0};     // Per-key operation sequence (for ordering guarantee)
};

JSON Serialization Format:

{
  "sequence_id": 12345,
  "timestamp": 1704110400123,
  "op_type": "PUT_END",
  "key": "object_key_123",
  "payload": "optional_payload",
  "checksum": 1234567890,
  "prefix_hash": 987654321,
  "key_sequence_id": 5
}

5. Ordering Guarantee Mechanism

(1) Global Sequence Number (sequence_id)

  • Purpose: Guarantee global order of all OpLog events
  • Generation: Generated globally incrementally by Primary's OpLogManager
  • Check: Standby checks if sequence_id is continuous

(2) Key-Level Sequence Number (key_sequence_id)

  • Purpose: Guarantee operation order for the same key
  • Generation: Incremented separately for each key on Primary side
  • Check: Standby checks if key_sequence_id is increasing

(3) Out-of-Order Handling

When key_sequence_id out-of-order is detected:

  • Rollback: Delete all state of the key from metadata_store
  • Replay: Re-read all OpLog from etcd starting from the key's first sequence_id
  • Rewrite: Re-apply all OpLog in correct order to rebuild metadata

6. Snapshot Integration

(1) Record Sequence ID During Snapshot

  • When snapshot is generated, record current OpLog sequence_id
  • Write snapshot info to etcd: mooncake-store/oplog/{cluster_id}/snapshot/{snapshot_id}/sequence_id

(2) OpLog Cleanup

  • After snapshot generation, OpLog before snapshot can be cleaned up
  • Cleanup strategy: Query minimum existing sequence_id from etcd, use DeleteRange to delete

7. Standby Service Integration

(1) Problem

In existing code, Standby only blocks and waits during leader election, without running Standby service to synchronize OpLog.

(2) Solution

In MasterServiceSupervisor::Start():

  • Check if there is currently a leader
  • If there is a leader and it's not self → Start Standby service (watch OpLog and apply)
  • After successful election → Stop Standby service and promote to Primary

8. Lease Initialization When Standby Promotes to Primary

(1) Problem

Objects on Standby all have lease = 0 (because OpLog only contains PUT_END, not renewal information), and all objects will expire immediately after promotion to Primary.

(2) Solution

In HotStandbyService::Promote():

  • Stop Standby service
  • Iterate through all metadata
  • For objects with lease_timeout = 0, grant default lease time (default_kv_lease_ttl)

9. Memory Optimization: key_sequence_map_ Cleanup

(1) Problem

key_sequence_map_ on Standby side is used to track key_sequence_id for each key. After metadata is deleted, these entries are still retained, which may cause memory leaks during long-term operation.

(2) Solution

Implement periodic cleanup mechanism:

  • Cleanup Condition: Last operation is REMOVE and more than 1 hour has passed
  • Cleanup Frequency: Scan once per hour
  • Retention Strategy: Keys with PUT_END and PUT_REVOKE operations are not cleaned

Key Design Points Summary

  1. etcd as Intermediate Storage: Leverage etcd's strong consistency and Watch mechanism
  2. Record Only Critical Operations: PUT_END, PUT_REVOKE, REMOVE, do not record LEASE_RENEW
  3. Dual Sequence Number Guarantee: Global sequence_id + key-level key_sequence_id
  4. Snapshot Integration: Integrate with existing snapshot mechanism, support OpLog cleanup
  5. Standby Service Runs in Parallel: Continuously synchronize data during election waiting period
  6. Memory Optimization: Periodically clean up expired entries in key_sequence_map_

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Guida contributor