Skip to content

[RFC]: Replica copy and move support #1159

@zhongzhouTan-coder

Description

@zhongzhouTan-coder

Changes proposed

Introduction

This RFC proposes the introduction of explicit ReplicaCopy and ReplicaMove primitives within the Mooncake Store. These primitives provide the fundamental mechanisms required to migrate data replicas between different storage nodes. By exposing these operations, we enable external orchestration systems (such as gateways or schedulers) to implement cache-aware scheduling policies. Furthermore, this lays the groundwork for future internal features like dynamic replica management.

Motivation

Currently, Mooncake Store's replica placement is largely determined at creation time. As access patterns change or cluster topology evolves, the initial placement may become suboptimal. There is a growing need to adjust replica placement dynamically to optimize performance and resource utilization.

The primary drivers for this feature are:

  1. External Cache-Aware Scheduling: Upper-layer applications or gateways (e.g., LMCache) often have global visibility into request patterns and compute scheduling. They need a way to proactively move data to where compute will happen (data locality) or replicate hot data to multiple nodes to spread the load. Without explicit copy/move APIs, these external systems cannot leverage Mooncake for optimized data placement.
  2. Dynamic Replica Management: To support future features like auto-scaling of hot keys (creating more replicas for frequently accessed data) or rebalancing storage usage (moving cold data), the underlying store must support atomic copy and move operations.
  3. Resource Optimization: Moving replicas allows for better utilization of cluster resources by rebalancing data distribution across nodes, preventing storage hotspots.

In Scope

  1. Design and implement ReplicaCopy and ReplicaMove operations in the Master service for DRAM-based replicas.
  2. Design and implement the corresponding data transfer logic in the Client service.
  3. Expose APIs for external systems to invoke these operations.
  4. Add new basic copy/move python api in the client side.
  5. Add logging and metrics to monitor replica movements.
  6. Testing and validation of the new features.

Out of Scope

  1. Support for Disk or SSD storage media (DRAM only).
  2. Cross-tier migration (e.g., Memory to Disk).

Proposal

In the previous RFC, we introduced a simple high level design for the ReplicaCopy and ReplicaMove operations. But the previous design use master initiated data transfer, which may introduce extra overhead for the current system architecture. And thanks to the PR, we resue the basic idea and introduce a new design which change the transfer task initialzie from master push to client poll.

Architecture

Component Diagram (Client Poll Model)

Image

Pros and cons (Master push vs Client poll)

  1. Master push

    • Pros:
      • Real-Time Latency: trigger a transfer immediately.
      • Traffic Efficiency: No empty checks from clients.
    • Cons:
      • Master and client complexity: introduce a new communication mechanism which will increase system complexity.
  2. Client poll (Recommended)

    • Pros:
      • Simplicity & Stability: consistent with existing architecture.
      • Natural Flow Control:the Master doesn't need to implement complex backpressure logic.
    • Cons:
      • Latency:there is a delay up to the polling interval.

Sequence Diagram

  1. Replica Copy

    Image
  2. Replica Move

    Image

Core components

  1. External RPC Interface (Master Service)

    These APIs are exposed to external systems (e.g., Gateways) to trigger operations.

    struct ReplicaCopyRequest {
        std::string key;
        std::List<std::string> targets; // localhost_name or client_id?
    };
    
    struct ReplicaMoveRequest {
        std::string key;
        std::string source; // localhost_name or client_id?
        std::string target; // localhost_name or client_id?
    };
    
    // RPC Methods
    tl::expected<UUID, ErrorCode> Copy(ReplicaCopyRequest req);
    tl::expected<UUID, ErrorCode> Move(ReplicaMoveRequest req);
  2. Task queue and definition

    enum class TaskType {
        REPLICA_COPY,
        REPLICA_MOVE,
    };
    
    enum class TaskStatus {
        PENDING,
        RUNNING,
        SUCCESS,
        FAILED
    };
    
    struct Task {
        uint64_t task_id;
        TaskType type;
        TaskStatus status;
        std::string payload; // Serialized arguments based on type
    };
    
    struct ReplicaCopyPayload {
        std::string key;
        std::string source;
        std::List<std::string> targets;
    };
    
    struct ReplicaMovePayload {
        std::string key;
        std::string source;
        std::string target;
    };
    
    // Task Queue Management
    class ClientTaskManager {
        std::mutex client_tasks_mutex_;
        // the key can be client_id or host_name
        std::unordered_map<std::string, std::vector<Task>> client_tasks_;
    
        // use host_name or client_id
        void AddTask(const std::string& client_host_name, Task task);
    };
    
    // RPC Method
    tl::expected<std::vector<Task>, ErrorCode> PollTasks(const std::string& client_id); // client_id or host_name
  3. Internal RPC Interface

    These APIs are used by the mooncake client to complete the replica copy/move tasks.

    struct AllocationConfig {
        std::vector<std::string> segment_names;
    };
    
    // RPC Methods
    // Copy
    tl::expected<std::vector<Replica::Descriptor>, ErrorCode> CopyStart(const std::string& key, const std::vector<size_t>& slice_lengths, const AllocationConfig& config);
    tl::expected<void, ErrorCode> CopyEnd(const std::string& key);
    // Move
    tl::expected<std::vector<Replica::Descriptor>, ErrorCode> MoveStart(const std::string& key, const std::vector<size_t>& slice_lengths, const AllocationConfig& config);
    tl::expected<void, ErrorCode> MoveEnd(const std::string& key);
  4. Task executor (client)

    The task executor component in the client is responsible for executing the replica copy/move tasks received from the master. It handles the data transfer and communicates the status back to the master.

    class ReplicaCopyExecutor {
    public:
        void Execute(const ReplicaCopyPayload& payload) {
            // 1. Read data from local buffer(preferably) or other node
            auto query_result = client_->Query(key);
            client_->Get(key, query_result.value(), slices);
    
            // 2. Allocate Target
            AllocationConfig config = { payload.targets };
            auto target_descs = master_client_->CopyStart(payload.key, GetLengths(slices), config);
            
            // 3. Transfer Data
            for (const auto& replica : target_descs.value()) {
                TransferWrite(replica, slices);
            }
    
            // add rollback logic here if needed
    
            // 4. Commit
            master_client_->CopyEnd(payload.key);
        }
    };
    
    class ReplicaMoveExecutor {
    public:
        void Execute(const ReplicaMovePayload& payload) {
            // 1. Read data from local buffer(preferably) or other node
            auto query_result = client_->Query(key);
            client_->Get(key, query_result.value(), slices);
    
            // 2. Allocate Target
            AllocationConfig config = { {payload.target} };
            auto target_descs = master_client_->MoveStart(payload.key, GetLengths(slices), config);
    
            // 3. Transfer Data
            TransferWrite(target_descs[0], slices);
            
            // add rollback logic here if needed
    
            // 4. Commit
            master_client_->MoveEnd(payload.key);
        }
    };
  5. Task Status Monitor (nice to have)

    The master service should provide an external API for external systems to monitor the status of ongoing replica copy/move tasks.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions