-
Notifications
You must be signed in to change notification settings - Fork 464
Description
Changes proposed
Introduction
This RFC proposes the introduction of explicit ReplicaCopy and ReplicaMove primitives within the Mooncake Store. These primitives provide the fundamental mechanisms required to migrate data replicas between different storage nodes. By exposing these operations, we enable external orchestration systems (such as gateways or schedulers) to implement cache-aware scheduling policies. Furthermore, this lays the groundwork for future internal features like dynamic replica management.
Motivation
Currently, Mooncake Store's replica placement is largely determined at creation time. As access patterns change or cluster topology evolves, the initial placement may become suboptimal. There is a growing need to adjust replica placement dynamically to optimize performance and resource utilization.
The primary drivers for this feature are:
- External Cache-Aware Scheduling: Upper-layer applications or gateways (e.g., LMCache) often have global visibility into request patterns and compute scheduling. They need a way to proactively move data to where compute will happen (data locality) or replicate hot data to multiple nodes to spread the load. Without explicit copy/move APIs, these external systems cannot leverage Mooncake for optimized data placement.
- Dynamic Replica Management: To support future features like auto-scaling of hot keys (creating more replicas for frequently accessed data) or rebalancing storage usage (moving cold data), the underlying store must support atomic copy and move operations.
- Resource Optimization: Moving replicas allows for better utilization of cluster resources by rebalancing data distribution across nodes, preventing storage hotspots.
In Scope
- Design and implement
ReplicaCopyandReplicaMoveoperations in the Master service for DRAM-based replicas. - Design and implement the corresponding data transfer logic in the Client service.
- Expose APIs for external systems to invoke these operations.
- Add new basic
copy/movepython api in the client side. - Add logging and metrics to monitor replica movements.
- Testing and validation of the new features.
Out of Scope
- Support for Disk or SSD storage media (DRAM only).
- Cross-tier migration (e.g., Memory to Disk).
Proposal
In the previous RFC, we introduced a simple high level design for the ReplicaCopy and ReplicaMove operations. But the previous design use master initiated data transfer, which may introduce extra overhead for the current system architecture. And thanks to the PR, we resue the basic idea and introduce a new design which change the transfer task initialzie from master push to client poll.
Architecture
Component Diagram (Client Poll Model)
Pros and cons (Master push vs Client poll)
-
Master push
- Pros:
- Real-Time Latency: trigger a transfer immediately.
- Traffic Efficiency: No
empty checksfrom clients.
- Cons:
- Master and client complexity: introduce a new communication mechanism which will increase system complexity.
- Pros:
-
Client poll (Recommended)
- Pros:
- Simplicity & Stability: consistent with existing architecture.
- Natural Flow Control:the Master doesn't need to implement complex backpressure logic.
- Cons:
- Latency:there is a delay up to the polling interval.
- Pros:
Sequence Diagram
Core components
-
External RPC Interface (Master Service)
These APIs are exposed to external systems (e.g., Gateways) to trigger operations.
struct ReplicaCopyRequest { std::string key; std::List<std::string> targets; // localhost_name or client_id? }; struct ReplicaMoveRequest { std::string key; std::string source; // localhost_name or client_id? std::string target; // localhost_name or client_id? }; // RPC Methods tl::expected<UUID, ErrorCode> Copy(ReplicaCopyRequest req); tl::expected<UUID, ErrorCode> Move(ReplicaMoveRequest req);
-
Task queue and definition
enum class TaskType { REPLICA_COPY, REPLICA_MOVE, }; enum class TaskStatus { PENDING, RUNNING, SUCCESS, FAILED }; struct Task { uint64_t task_id; TaskType type; TaskStatus status; std::string payload; // Serialized arguments based on type }; struct ReplicaCopyPayload { std::string key; std::string source; std::List<std::string> targets; }; struct ReplicaMovePayload { std::string key; std::string source; std::string target; }; // Task Queue Management class ClientTaskManager { std::mutex client_tasks_mutex_; // the key can be client_id or host_name std::unordered_map<std::string, std::vector<Task>> client_tasks_; // use host_name or client_id void AddTask(const std::string& client_host_name, Task task); }; // RPC Method tl::expected<std::vector<Task>, ErrorCode> PollTasks(const std::string& client_id); // client_id or host_name
-
Internal RPC Interface
These APIs are used by the mooncake client to complete the replica copy/move tasks.
struct AllocationConfig { std::vector<std::string> segment_names; }; // RPC Methods // Copy tl::expected<std::vector<Replica::Descriptor>, ErrorCode> CopyStart(const std::string& key, const std::vector<size_t>& slice_lengths, const AllocationConfig& config); tl::expected<void, ErrorCode> CopyEnd(const std::string& key); // Move tl::expected<std::vector<Replica::Descriptor>, ErrorCode> MoveStart(const std::string& key, const std::vector<size_t>& slice_lengths, const AllocationConfig& config); tl::expected<void, ErrorCode> MoveEnd(const std::string& key);
-
Task executor (client)
The task executor component in the client is responsible for executing the replica copy/move tasks received from the master. It handles the data transfer and communicates the status back to the master.
class ReplicaCopyExecutor { public: void Execute(const ReplicaCopyPayload& payload) { // 1. Read data from local buffer(preferably) or other node auto query_result = client_->Query(key); client_->Get(key, query_result.value(), slices); // 2. Allocate Target AllocationConfig config = { payload.targets }; auto target_descs = master_client_->CopyStart(payload.key, GetLengths(slices), config); // 3. Transfer Data for (const auto& replica : target_descs.value()) { TransferWrite(replica, slices); } // add rollback logic here if needed // 4. Commit master_client_->CopyEnd(payload.key); } }; class ReplicaMoveExecutor { public: void Execute(const ReplicaMovePayload& payload) { // 1. Read data from local buffer(preferably) or other node auto query_result = client_->Query(key); client_->Get(key, query_result.value(), slices); // 2. Allocate Target AllocationConfig config = { {payload.target} }; auto target_descs = master_client_->MoveStart(payload.key, GetLengths(slices), config); // 3. Transfer Data TransferWrite(target_descs[0], slices); // add rollback logic here if needed // 4. Commit master_client_->MoveEnd(payload.key); } };
-
Task Status Monitor (nice to have)
The master service should provide an external API for external systems to monitor the status of ongoing replica copy/move tasks.
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation

