kvcache-ai/Mooncake

[RFC]: dynamic replica management

Open

Aperta il 24 nov 2025

Vedi su GitHub
 (11 commenti) (0 reazioni) (0 assegnatari)C++ (5470 star) (803 fork)auto 404
enhancementgood first issue

Descrizione

Changes proposed

Introduction

This RFC proposes a dynamic replica management system for Mooncake Store that adaptively adjusts replica configurations based on workload patterns.

Motivation

Mooncake Store currently uses a static replica configuration where each object has a fixed number of replicas determined at write time. This approach does not account for changing access patterns, leading to suboptimal performance and resource utilization. And also now the mooncake-store only get the first completion replica for read, which may not be the most efficient one compare to choose the local replica. So we need a mechanism to dynamically adjust replicas based on object "hotness" and aslo may need to optimize read replica selection.

This may introduce the following benefits:

  1. Improved Read Performance: Because the hot data is preferred to more memory replicas and will have more probability to be served in the local when read.
  2. Better Resource Utilization: Our memory will have more proportionally allocated to frequently accessed data, reducing waste.
  3. Optimized Read Replica Selection: By selecting the most efficient replica for read operations, we can further enhance performance.

This proposal introduces a comprehensive design including hot data detection algorithm, replica adjustment and placement strategy, and read replica selection optimization.

In Scope

  1. Hot data statistics and identification.
  2. Master side initialized replica adjustment (scale up/down).
  3. Optimized read replica selection based on locality.
  4. log/testing/monitoring

Out of Scope

  1. data promotion/demotion with multiple tier storage

Proposal

Architecture

Architecture Description

Hot data detection algorithm - Access Frequency-Based Hot Data Detection(Basic Idea)

Description:

The algorithm tracks how many times each object is accessed within a sliding time window. Objects are classified as HOT, WARM, or COLD based on their access frequency and recency:

  • HOT: Accessed frequently
  • WARM: Moderate access patterns
  • COLD: Rarely accessed

Atomic Replica Primitives

Description:

To ensure consistency and reliability during replica management, we define two fundamental atomic operations on the Master: ReplicaCopy and ReplicaMove. These operations are designed as two-phase transactions to handle the long-running nature of data transfer while maintaining metadata consistency.

  1. ReplicaCopy (Scale Up)

    • Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as PROCESSING and is not visible to readers.
    • Phase 2: End (Commit) - After successful data transfer, Master marks the replica as COMPLETE. It becomes immediately visible to readers.
  2. ReplicaMove (Migration)

    • Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked PROCESSING).
    • Phase 2: End (Atomic Swap) - After successful data transfer, Master atomically:
      • Marks the new replica as COMPLETE.
      • Removes the source replica from the metadata.
    • This ensures that at no point is the object unavailable or the replica count inconsistent (from the perspective of availability).

Replica Adjustment Strategy - Master Initiated Dynamic Replica Adjustment

Description:

The master service tracks object temperature and makes decisions on replica adjustments (e.g., scale up/down). The Master actively orchestrates the replication process. Clients register themselves as "Workers" capable of executing commands. The Master selects a suitable Worker and sends RPC commands to trigger replica transfers.

This approach allows for global optimization of resource usage and transfer scheduling, decoupled from user read requests. It requires a bi-directional communication channel (or Reverse RPC) where the Master can invoke methods on the Client.

External Replica Control APIs

Description:

To support external systems that require explicit control over replica placement (e.g., for locality-aware scheduling), we provide explicit Copy and Move APIs. These allow clients to manually trigger replication or migration to specific nodes, utilizing the same atomic primitives as the internal scaling logic.

Roadmap

Phase 1: Master-Worker Communication Infrastructure

  • Implement WorkerService on Client side to handle Master commands.
  • Implement Worker Registry on Master side (Client registration).
  • Establish Reverse RPC mechanism (Master -> Client).

Phase 2: Replica Migration Primitives

  • Implement ReplicaCopy and ReplicaMove operations on Master.
  • Implement data transfer logic on Client (WorkerService).
  • Implement Copy and Move APIs for external control.

Phase 3: Hot Data Detection & Basic Framework

  • Implement AccessTracker and HotDataDetector in MasterService.
  • Integrate access recording into GetReplicaList.
  • Add metrics for access patterns and object temperature.

Phase 4: Dynamic Replica Increment (Scale Up)

  • Implement Master-side scaling logic (background thread).
  • Implement ExecuteReplicaCopy in WorkerService.
  • Integrate hot data detection with scaling decisions.
  • Basic load balancing for target node selection.

Phase 5: Replica Decrement & Optimization

  • Implement cold data detection logic.
  • Implement scale-down logic (garbage collection of excess replicas).
  • Optimize source/target selection for transfers (topology awareness).

Other Feature consideration

Current Plan (Phase 1 And Phase 2)

Currently we need support the Phase 1 and Phase 2 to build the basic Master-Worker communication infrastructure and replica migration primitives.

Design

  1. add a new worker server in the client side to handle the master command.
  2. add a worker registry in the master side to manage the client registration and selection.
  3. add a new request handle logic in the master side to handle the ReplicaCopy and ReplicaMove request.
  4. exposure the Copy and Move APIs for external usage.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Guida contributor