[RFC]: Hot Standby Mode for Master Service Metadata High Availability · kvcache-ai/Mooncake#1200

(2 commenti) (3 reazioni) (0 assegnatari)C++ (5470 star) (803 fork)auto 404

good first issue

Descrizione

Introduction

To address the issue where all metadata is lost after a master service failure and recovery, this solution proposes a hot-standby synchronization mechanism and a hybrid approach combining hot standby with snapshots, ensuring rapid data recovery following a master service failure.

Motivation

The current HA (High Availability) mode only supports rapid restart of the master service after a failure but cannot guarantee fast data recovery. The pain points of the current HA mode are as follows:

Issue	Impact	Severity
Complete loss of metadata after Leader failure	All KV cache location information is lost and must be rebuilt	🔴 Critical
Follower lacks data pre-warming	After successful election, the new Leader must start serving from scratch	🟠 Medium
Long recovery time	Relies on clients to re-register; may take several minutes to tens of minutes	🟠 Medium
Service unavailability window	Requests cannot be processed during Leader failover	🟡 Low

Goals:

Near-zero RPO: Minimize the data loss window
Fast failover: Achieve second-level RTO (Recovery Time Objective)
Data consistency guarantee: Ensure correct synchronization between primary and standby
Verifiability: Include mechanisms to validate data integrity
Low performance overhead: Avoid impacting normal business operations

Proposal

1. Core Design Approach

Use etcd as an intermediate reliability component to implement OpLog primary-standby synchronization:

OpLog Mechanism: Primary Master records all state change operations to OpLog
etcd Storage: OpLog is written to etcd, leveraging etcd's strong consistency and persistence capabilities
Watch Mechanism: Standby Master receives OpLog in real-time through etcd Watch mechanism
Ordering Guarantee: Guarantee operation order through global sequence_id and key-level key_sequence_id

2. Architecture Design

(1) Overall Architecture

The overall architecture diagram shows the interaction relationships between Primary Master, etcd Cluster, and Standby Master:

Architecture Description:

Primary Master: Handles client requests, records OpLog and writes to etcd
etcd Cluster: Acts as intermediate storage, providing strong consistency and Watch mechanism
Standby Master: Receives OpLog in real-time by watching etcd and applies to local metadata store

(2) Data Flow Diagram

The data flow diagram shows the complete flow from Client request to Standby synchronization:

Flow Description:

a. Client sends PutEnd request to Primary Master b. Primary Master records operation through OpLogManager, generating sequence_id c. EtcdOpLogStore writes OpLog to etcd d. etcd notifies Standby Master through Watch mechanism e. OpLogWatcher receives events and passes to OpLogApplier f. OpLogApplier checks order and applies to Standby's metadata store

(3) Failover Sequence

The failover sequence diagram shows the complete process from Primary failure to Standby promotion to Primary:

Flow Description:

a. Normal Operation: Primary maintains Lease, Standby continuously synchronizes OpLog through Watch b. Primary Failure: Primary's Lease expires, etcd notifies Standby c. Standby Promotion: Stop Standby service, initialize Lease, clean expired metadata, start leader election

3. Core Component Design

(1) OpLogManager (Primary Side)

Responsibilities:

Record all state change operations (PUT_END, PUT_REVOKE, REMOVE)
Generate global sequence_id and key-level key_sequence_id
Maintain memory buffer (for fast queries)

Key Methods:

class OpLogManager {
    uint64_t Append(OpType type, const std::string& key, 
                    const std::string& payload = "");
    std::vector<OpLogEntry> GetEntriesSince(uint64_t since_seq_id, 
                                            size_t limit = 1000) const;
    uint64_t GetLastSequenceId() const;
};

(2) EtcdOpLogStore (Primary Side)

Responsibilities:

Write OpLog to etcd
Update latest sequence_id
Record snapshot corresponding sequence_id
Clean up old OpLog

etcd Key Design:

OpLog Entry: mooncake-store/oplog/{cluster_id}/{sequence_id}
Latest Sequence ID: mooncake-store/oplog/{cluster_id}/latest
Snapshot Sequence ID: mooncake-store/oplog/{cluster_id}/snapshot/{snapshot_id}/sequence_id

(3) OpLogWatcher (Standby Side)

Responsibilities:

Watch etcd OpLog changes
Read historical OpLog (for initial synchronization)
Process Watch events and pass to OpLogApplier

Key Methods:

class OpLogWatcher {
    void Start();
    void Stop();
    bool ReadOpLogSince(uint64_t start_seq_id, 
                        std::vector<OpLogEntry>& entries);
};

(4) OpLogApplier (Standby Side)

Responsibilities:

Apply OpLog Entry to local metadata store
Check global and key-level order
Handle sequence number discontinuities and out-of-order cases
Periodically clean up key_sequence_map_ (memory optimization)

Key Methods:

class OpLogApplier {
    bool ApplyOpLogEntry(const OpLogEntry& entry);
    bool CheckSequenceOrder(const OpLogEntry& entry);
    void CleanupStaleKeySequences();
};

(5) HotStandbyService (Standby Side)

Responsibilities:

Manage Standby mode lifecycle
Coordinate OpLogWatcher and OpLogApplier
Handle Standby promotion to Primary logic

Key Methods:

class HotStandbyService {
    void StartStandby();
    void Stop();
    void Promote();
};

4. OpLog Entry Data Structure

struct OpLogEntry {
    uint64_t sequence_id{0};        // Globally monotonically increasing sequence
    uint64_t timestamp_ms{0};        // Timestamp (milliseconds)
    OpType op_type{OpType::PUT_END}; // PUT_END, PUT_REVOKE, REMOVE
    std::string object_key;          // Object key
    std::string payload;             // Optional payload (carries replica info for PUT_END)
    uint32_t checksum{0};           // Checksum
    uint32_t prefix_hash{0};        // Key prefix hash
    uint64_t key_sequence_id{0};     // Per-key operation sequence (for ordering guarantee)
};

JSON Serialization Format:

{
  "sequence_id": 12345,
  "timestamp": 1704110400123,
  "op_type": "PUT_END",
  "key": "object_key_123",
  "payload": "optional_payload",
  "checksum": 1234567890,
  "prefix_hash": 987654321,
  "key_sequence_id": 5
}

5. Ordering Guarantee Mechanism

(1) Global Sequence Number (sequence_id)

Purpose: Guarantee global order of all OpLog events
Generation: Generated globally incrementally by Primary's OpLogManager
Check: Standby checks if sequence_id is continuous

(2) Key-Level Sequence Number (key_sequence_id)

Purpose: Guarantee operation order for the same key
Generation: Incremented separately for each key on Primary side
Check: Standby checks if key_sequence_id is increasing

(3) Out-of-Order Handling

When key_sequence_id out-of-order is detected:

Rollback: Delete all state of the key from metadata_store
Replay: Re-read all OpLog from etcd starting from the key's first sequence_id
Rewrite: Re-apply all OpLog in correct order to rebuild metadata

6. Snapshot Integration

(1) Record Sequence ID During Snapshot

When snapshot is generated, record current OpLog sequence_id
Write snapshot info to etcd: mooncake-store/oplog/{cluster_id}/snapshot/{snapshot_id}/sequence_id

(2) OpLog Cleanup

After snapshot generation, OpLog before snapshot can be cleaned up
Cleanup strategy: Query minimum existing sequence_id from etcd, use DeleteRange to delete

7. Standby Service Integration

(1) Problem

In existing code, Standby only blocks and waits during leader election, without running Standby service to synchronize OpLog.

(2) Solution

In MasterServiceSupervisor::Start():

Check if there is currently a leader
If there is a leader and it's not self → Start Standby service (watch OpLog and apply)
After successful election → Stop Standby service and promote to Primary

8. Lease Initialization When Standby Promotes to Primary

(1) Problem

Objects on Standby all have lease = 0 (because OpLog only contains PUT_END, not renewal information), and all objects will expire immediately after promotion to Primary.

(2) Solution

In HotStandbyService::Promote():

Stop Standby service
Iterate through all metadata
For objects with lease_timeout = 0, grant default lease time (default_kv_lease_ttl)

9. Memory Optimization: key_sequence_map_ Cleanup

(1) Problem

key_sequence_map_ on Standby side is used to track key_sequence_id for each key. After metadata is deleted, these entries are still retained, which may cause memory leaks during long-term operation.

(2) Solution

Implement periodic cleanup mechanism:

Cleanup Condition: Last operation is REMOVE and more than 1 hour has passed
Cleanup Frequency: Scan once per hour
Retention Strategy: Keys with PUT_END and PUT_REVOKE operations are not cleaned

Key Design Points Summary

etcd as Intermediate Storage: Leverage etcd's strong consistency and Watch mechanism
Record Only Critical Operations: PUT_END, PUT_REVOKE, REMOVE, do not record LEASE_RENEW
Dual Sequence Number Guarantee: Global sequence_id + key-level key_sequence_id
Snapshot Integration: Integrate with existing snapshot mechanism, support OpLog cleanup
Standby Service Runs in Parallel: Continuously synchronize data during election waiting period
Memory Optimization: Periodically clean up expired entries in key_sequence_map_

Before submitting a new issue...

Make sure you already searched for relevant issues and read the documentation

Guida contributor

Tech stack: cpp
Dominio: backendinfrastructure
Tipo issue: feature
Difficoltà: 4
Tempo stimato: over 1 week
Stato attività: fresh
Chiarezza: clear
Prerequisiti: C++distributed systemsetcd
Adatta ai principianti: 15
Direzione di ricerca: Familiarizzati con il codice del servizio master di Mooncake e l'integrazione esistente con etcd. Comprendi il meccanismo OpLog proposto e implementa i componenti: OpLogManager, EtcdOpLogStore, OpLogWatcher, OpLogApplier e HotStandbyService. Integra con il meccanismo di snapshot esistente e il flusso di elezione del leader.