Skip to content

Add comm engine attrs for collective and async ops#723

Open
FangRui0 wants to merge 1 commit into
hw-native-sys:mainfrom
FangRui0:align_args
Open

Add comm engine attrs for collective and async ops#723
FangRui0 wants to merge 1 commit into
hw-native-sys:mainfrom
FangRui0:align_args

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

No description provided.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 28, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Add comm engine attrs for collective and async ops #723 Add comm engine attrs for collective and async ops
  • Author: FangRui0
  • Base/Head: main / align_args
  • Head SHA: 36aaefa49536
  • Trigger: PR 有新提交
  • Generated At: 2026-05-28T12:56:56Z
  • Previous Head SHA: f9009868f461
  • Status: completed

Summary

检查到 2 个 P2 问题:新增 engine attrs 引入了生成 builder 的兼容性回归,而且 async session/transfer 的 engine 一致性没有被 verifier 约束。

Findings

  1. P2 在既有 op 签名中间插入 engine attr,破坏了生成 builder 的位置参数兼容性 include/PTO/IR/PTOOps.td:1740

dmaEngine/collEngine 被直接插入到了现有 ODS 参数列表中间:BuildAsyncSessionOp 放在 sync_id 前,collective ops 放在 root/reduceOp 前。TableGen 生成的 C++/Python builder 会按这个顺序重排位置参数,因此下游原来合法的调用(例如 TBroadcastOp(src, ping, group, 1)CommTScatterOp(src, ping, group, 1)TReduceOp(..., reduceOp, 1),以及 BuildAsyncSessionOp(..., 0) 这种把第三个参数当 sync_id 的写法)现在都会报错或绑定到错误形参。PR 里已经不得不把多个 sample 改成 keyword 调用,说明这是实际的 source-compatibility 回归。

  1. P2 async session 和 async transfer 的 `dmaEngine` 可以互相矛盾,但当前 verifier 不会拦截 lib/PTO/IR/PTO.cpp:12254

这个 PR 让 comm.build_async_sessioncomm.tput_asynccomm.tget_async 都各自携带独立的 dmaEngine,但 verifier 仍然只有 verifyAsyncTransferOp 的 shape/type 检查,没有任何跨 op 一致性校验。与此同时,EmitC lowering 会分别按各自属性发出 BuildAsyncSession<...>TPUT_ASYNC/TGET_ASYNC<...> 模板调用,所以现在可以构造出 session=SDMAtransfer=URMA 这种通过 verifier 的 IR,并生成明显自相矛盾的运行时代码。至少在 session 直接来自 BuildAsyncSessionOp 时,需要拒绝这种新引入的 contract mismatch。

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces DMA and collective communication engine attributes (dmaEngine and collEngine) to the PTO dialect, updating TableGen definitions, C API exports, Python bindings, and the EmitC lowering pass to support selecting different engines (e.g., SDMA vs. URMA, and AIV vs. CCU). The review feedback highlights critical issues regarding Python bindings and API backward compatibility. Specifically, using MlirContext directly in the Python binding lambdas instead of DefaultingPyMlirContext can lead to segmentation faults when the context is omitted. Additionally, inserting the new collEngine attribute before existing attributes like root and reduceOp in TableGen arguments breaks backward compatibility for positional arguments in both C++ and Python APIs; these new default-valued attributes should be appended to the end of the argument lists.

Comment on lines +383 to +384
[](py::object cls, mlir::pto::DmaEngine value, MlirContext ctx) -> py::object {
MlirAttribute a = mlirPTODmaEngineAttrGet(ctx, static_cast<int32_t>(value));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using MlirContext directly in the Python binding lambda prevents the context from being optional (defaulting to the active context in the thread-local stack). If the context is omitted or passed as None in Python, this will either raise a TypeError or pass a null context to mlirPTODmaEngineAttrGet, leading to a segmentation fault when unwrap(ctx) is called. Use DefaultingPyMlirContext instead, which automatically resolves to the default context.

Suggested change
[](py::object cls, mlir::pto::DmaEngine value, MlirContext ctx) -> py::object {
MlirAttribute a = mlirPTODmaEngineAttrGet(ctx, static_cast<int32_t>(value));
[](py::object cls, mlir::pto::DmaEngine value, DefaultingPyMlirContext ctx) -> py::object {
MlirAttribute a = mlirPTODmaEngineAttrGet(ctx.get(), static_cast<int32_t>(value));

Comment on lines +396 to +397
[](py::object cls, mlir::pto::CollEngine value, MlirContext ctx) -> py::object {
MlirAttribute a = mlirPTOCollEngineAttrGet(ctx, static_cast<int32_t>(value));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using MlirContext directly in the Python binding lambda prevents the context from being optional. If the context is omitted or passed as None in Python, this will either raise a TypeError or pass a null context to mlirPTOCollEngineAttrGet, leading to a segmentation fault when unwrap(ctx) is called. Use DefaultingPyMlirContext instead, which automatically resolves to the default context.

Suggested change
[](py::object cls, mlir::pto::CollEngine value, MlirContext ctx) -> py::object {
MlirAttribute a = mlirPTOCollEngineAttrGet(ctx, static_cast<int32_t>(value));
[](py::object cls, mlir::pto::CollEngine value, DefaultingPyMlirContext ctx) -> py::object {
MlirAttribute a = mlirPTOCollEngineAttrGet(ctx.get(), static_cast<int32_t>(value));

Comment thread include/PTO/IR/PTOOps.td
Comment on lines +1939 to 1940
DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine,
I32Attr:$root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Inserting the new collEngine attribute before the existing root attribute in the TableGen arguments list breaks backward compatibility for both C++ and Python APIs. Any existing code calling TBroadcastOp with positional arguments will now have their arguments mismatched (e.g., passing the root rank to collEngine). Placing new optional or default-valued attributes at the end of the arguments list preserves the positional argument order and maintains backward compatibility.

      I32Attr:$root,
      DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine

Comment thread include/PTO/IR/PTOOps.td
Comment on lines +1957 to 1958
DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine,
I32Attr:$root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Inserting the new collEngine attribute before the existing root attribute in the TableGen arguments list breaks backward compatibility for both C++ and Python APIs. Placing new optional or default-valued attributes at the end of the arguments list preserves the positional argument order and maintains backward compatibility.

      I32Attr:$root,
      DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine

Comment thread include/PTO/IR/PTOOps.td
Comment on lines +1975 to 1976
DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine,
I32Attr:$root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Inserting the new collEngine attribute before the existing root attribute in the TableGen arguments list breaks backward compatibility for both C++ and Python APIs. Placing new optional or default-valued attributes at the end of the arguments list preserves the positional argument order and maintains backward compatibility.

      I32Attr:$root,
      DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine

Comment thread include/PTO/IR/PTOOps.td
Comment on lines +1994 to 1996
DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine,
PTO_ReduceOpAttr:$reduceOp,
I32Attr:$root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Inserting the new collEngine attribute before the existing reduceOp and root attributes in the TableGen arguments list breaks backward compatibility for both C++ and Python APIs. Placing new optional or default-valued attributes at the end of the arguments list preserves the positional argument order and maintains backward compatibility.

      PTO_ReduceOpAttr:$reduceOp,
      I32Attr:$root,
      DefaultValuedAttr<PTO_CollEngineAttr, "::mlir::pto::CollEngine::AIV">:$collEngine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants