Description
Changes proposed
Introduction
This RFC proposes a dynamic replica management system for Mooncake Store that adaptively adjusts replica configurations based on workload patterns.
Motivation
Mooncake Store currently uses a static replica configuration where each object has a fixed number of replicas determined at write time. This approach does not account for changing access patterns, leading to suboptimal performance and resource utilization. And also now the mooncake-store only get the first completion replica for read, which may not be the most efficient one compare to choose the local replica. So we need a mechanism to dynamically adjust replicas based on object "hotness" and aslo may need to optimize read replica selection.
This may introduce the following benefits:
- Improved Read Performance: Because the hot data is preferred to more memory replicas and will have more probability to be served in the local when read.
- Better Resource Utilization: Our memory will have more proportionally allocated to frequently accessed data, reducing waste.
- Optimized Read Replica Selection: By selecting the most efficient replica for read operations, we can further enhance performance.
This proposal introduces a comprehensive design including hot data detection algorithm, replica adjustment and placement strategy, and read replica selection optimization.
In Scope
- Hot data statistics and identification.
- Master side initialized replica adjustment (scale up/down).
- Optimized read replica selection based on locality.
- log/testing/monitoring
Out of Scope
- data promotion/demotion with multiple tier storage
Proposal
Architecture
Architecture Description
Hot data detection algorithm - Access Frequency-Based Hot Data Detection(Basic Idea)
Description:
The algorithm tracks how many times each object is accessed within a sliding time window. Objects are classified as HOT, WARM, or COLD based on their access frequency and recency:
- HOT: Accessed frequently
- WARM: Moderate access patterns
- COLD: Rarely accessed
Atomic Replica Primitives
Description:
To ensure consistency and reliability during replica management, we define two fundamental atomic operations on the Master: ReplicaCopy and ReplicaMove. These operations are designed as two-phase transactions to handle the long-running nature of data transfer while maintaining metadata consistency.
-
ReplicaCopy (Scale Up)
- Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as
PROCESSINGand is not visible to readers. - Phase 2: End (Commit) - After successful data transfer, Master marks the replica as
COMPLETE. It becomes immediately visible to readers.
- Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as
-
ReplicaMove (Migration)
- Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked
PROCESSING). - Phase 2: End (Atomic Swap) - After successful data transfer, Master atomically:
- Marks the new replica as
COMPLETE. - Removes the source replica from the metadata.
- Marks the new replica as
- This ensures that at no point is the object unavailable or the replica count inconsistent (from the perspective of availability).
- Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked
Replica Adjustment Strategy - Master Initiated Dynamic Replica Adjustment
Description:
The master service tracks object temperature and makes decisions on replica adjustments (e.g., scale up/down). The Master actively orchestrates the replication process. Clients register themselves as "Workers" capable of executing commands. The Master selects a suitable Worker and sends RPC commands to trigger replica transfers.
This approach allows for global optimization of resource usage and transfer scheduling, decoupled from user read requests. It requires a bi-directional communication channel (or Reverse RPC) where the Master can invoke methods on the Client.
External Replica Control APIs
Description:
To support external systems that require explicit control over replica placement (e.g., for locality-aware scheduling), we provide explicit Copy and Move APIs. These allow clients to manually trigger replication or migration to specific nodes, utilizing the same atomic primitives as the internal scaling logic.
Roadmap
Phase 1: Master-Worker Communication Infrastructure
- Implement
WorkerServiceon Client side to handle Master commands. - Implement Worker Registry on Master side (Client registration).
- Establish Reverse RPC mechanism (Master -> Client).
Phase 2: Replica Migration Primitives
- Implement
ReplicaCopyandReplicaMoveoperations on Master. - Implement data transfer logic on Client (WorkerService).
- Implement
CopyandMoveAPIs for external control.
Phase 3: Hot Data Detection & Basic Framework
- Implement
AccessTrackerandHotDataDetectorin MasterService. - Integrate access recording into
GetReplicaList. - Add metrics for access patterns and object temperature.
Phase 4: Dynamic Replica Increment (Scale Up)
- Implement Master-side scaling logic (background thread).
- Implement
ExecuteReplicaCopyin WorkerService. - Integrate hot data detection with scaling decisions.
- Basic load balancing for target node selection.
Phase 5: Replica Decrement & Optimization
- Implement cold data detection logic.
- Implement scale-down logic (garbage collection of excess replicas).
- Optimize source/target selection for transfers (topology awareness).
Other Feature consideration
- Graceful shutdown for the client https://github.com/kvcache-ai/Mooncake/issues/607
Current Plan (Phase 1 And Phase 2)
Currently we need support the Phase 1 and Phase 2 to build the basic Master-Worker communication infrastructure and replica migration primitives.
Design
- add a new worker server in the client side to handle the master command.
- add a worker registry in the master side to manage the client registration and selection.
- add a new request handle logic in the master side to handle the
ReplicaCopyandReplicaMoverequest. - exposure the
CopyandMoveAPIs for external usage.
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation