Skip to content

Conversation

@JimyMa
Copy link
Contributor

@JimyMa JimyMa commented Dec 18, 2025

⚠️ Breaking Changes
API Refactoring: Significant changes to the Python API. Previous methods like write_batch and read_batch have been replaced by unified write and read interfaces. Input arguments have shifted from object lists to vectorized parameters (e.g., vector<uintptr_t>) to reduce Python overhead.

Async Pattern Shift: The explicit async_op flag has been removed in favor of a full Future-based pattern. All communication operations (Send/Recv/Read/Write) now return Future objects (e.g., SlimeSendFuture), requiring the user to explicitly call .wait() for synchronization.

🚀 New Features & Core Improvements
Unified Endpoint Architecture:

Introduced the RDMAEndpoint class, which unifies the previously separated RDMAIOEndpoint (for one-sided RDMA) and RDMAMsgEndpoint (for two-sided messaging). Users can now manage multiple communication modes via a single endpoint instance.

Async Worker Runtime:

Implemented RDMAWorker and GlobalWorkerManager. A dedicated background thread now handles Completion Queue (CQ) events without relying on main-thread polling.

Optimized Resource Management:

Introduced GlobalContextManager and pooled RDMAContext management. This supports the reuse of contexts across multiple devices, reducing initialization overhead.

PyTorch Backend Upgrade:

Rewrote csrc/torch/slime_backend.cpp. The PyTorch distributed backend has been migrated to the new Unified Endpoint and Worker architecture, improving efficiency when used as a ProcessGroup.

Enhanced Developer Experience:

Added dlslime/_slime_c.pyi type stubs, significantly improving code completion and type checking in IDEs.

⚠️ 破坏性变更 (Breaking Changes)
API 重构:Python 侧的 API 发生了重大变化。旧有的 write_batch、read_batch 等方法被替换为统一的 write、read 接口,且参数传递方式由原来的对象列表变为向量化参数(如 vector<uintptr_t>),以减少 Python 层的开销。

异步模式变更:不再通过参数控制 async_op,而是全面转向 Future 模式。所有通信操作(Send/Recv/Read/Write)现在均返回 Future 对象(如 SlimeSendFuture),用户需显式调用 .wait() 来同步操作。

🚀 新特性与核心改进
统一端点架构 (Unified Endpoint):

引入了全新的 RDMAEndpoint 类,将原先分散的 RDMAIOEndpoint(负责单边读写)和 RDMAMsgEndpoint(负责双边消息)进行了统一封装。用户现在可以通过一个端点实例同时管理多种通信模式。

异步 Worker 机制 (Async Runtime):

新增 RDMAWorker 和 GlobalWorkerManager。引入了独立的后台线程来处理完成队列(CQ)事件,不再依赖主线程的轮询。

资源管理优化:

引入 GlobalContextManager 和 RDMAContext 池化管理,支持多设备上下文的复用,减少了资源初始化开销。

PyTorch 后端升级:

重写了 csrc/torch/slime_backend.cpp,将 PyTorch 分布式后端迁移至新的 Unified Endpoint 和 Worker 架构,提升了作为 ProcessGroup 使用时的效率。

开发体验提升:

新增 dlslime/_slime_c.pyi 类型桩文件(Type Stubs),显著改善了 IDE 中的代码补全和类型检查体验。

@JimyMa JimyMa requested a review from HaoLiuuu December 18, 2025 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants