Skip to content

feat: add synchronous communication ops#467

Open
FangRui0 wants to merge 4 commits intohw-native-sys:mainfrom
FangRui0:add_newop
Open

feat: add synchronous communication ops#467
FangRui0 wants to merge 4 commits intohw-native-sys:mainfrom
FangRui0:add_newop

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a suite of synchronous point-to-point and collective communication operations to the PTO dialect, including tput, tget, signal operations (tnotify, twait, ttest), and collectives (tbroadcast, gather, scatter, reduce). The changes encompass IR definitions, documentation, C/Python bindings, memory effect specifications, and EmitC lowering patterns. Feedback focuses on correcting memory effects for staging and accumulation tiles in several operations to ensure accurate side-effect analysis. Additionally, there are suggestions to complete the atomic type support in tput lowering, reduce code duplication in the EmitC conversion patterns, and improve documentation consistency for the new collective operations.

Comment on lines +10376 to +10377
addEffect(effects, &getAccMutable(), MemoryEffects::Read::get());
addEffect(effects, &getRecvPingMutable(), MemoryEffects::Read::get());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appear to be incorrect memory effects for the acc and recvPing operands:

  • recvPing is a receive staging tile, so it is written to. Its effect should be MemoryEffects::Write.
  • acc is an accumulation tile. It is both read from and written to during the reduction. It should have both Read and Write effects.
  addEffect(effects, &getAccMutable(), MemoryEffects::Read::get());
  addEffect(effects, &getAccMutable(), MemoryEffects::Write::get());
  addEffect(effects, &getRecvPingMutable(), MemoryEffects::Write::get());

Comment on lines +5545 to +5548
if constexpr (std::is_same_v<OpTy, pto::TPutOp>) {
if (op.getAtomicType() == pto::AtomicType::AtomicAdd)
actualCallee = "pto::comm::TPUT<pto::AtomicType::AtomicAdd>";
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The lowering for pto.tput only handles AtomicType::AtomicAdd explicitly. The pto::atomic_type enum also supports AtomicAnd, AtomicOr, AtomicXor, AtomicMin, and AtomicMax. This implementation is incomplete and will not lower other atomic types correctly. Please extend it to support all variants of pto::AtomicType.

Suggested change
if constexpr (std::is_same_v<OpTy, pto::TPutOp>) {
if (op.getAtomicType() == pto::AtomicType::AtomicAdd)
actualCallee = "pto::comm::TPUT<pto::AtomicType::AtomicAdd>";
}
if constexpr (std::is_same_v<OpTy, pto::TPutOp>) {
switch (op.getAtomicType()) {
case pto::AtomicType::AtomicNone:
break; // default callee is correct
case pto::AtomicType::AtomicAdd:
actualCallee = "pto::comm::TPUT<pto::AtomicType::AtomicAdd>";
break;
// TODO: Add cases for other atomic types like And, Or, Xor, Min, Max
default:
return rewriter.notifyMatchFailure(op, "unsupported atomic type for tput lowering");
}
}

Comment on lines +1105 to +1115
##### `pto.comm_tgather` - Collective Gather

**Summary:** Communication collective that lowers to `pto::comm::TGATHER(...)`. This op is distinct from tile-level `pto.tgather`.

**Arguments:** `dst`, `ping`, optional `pong`, variadic `group`, `root`

**Constraints & Verification:**

- `group` must be non-empty and all members must have identical types.
- `dst` element type must match the group element type.
- `ping` / `pong` must be local VEC tile-like values with matching element type.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for pto.comm_tgather is quite sparse compared to other operations like pto.tput. To improve clarity and consistency, please consider adding a detailed argument table and a basic MLIR example.

Comment on lines +1119 to +1129
##### `pto.comm_tscatter` - Collective Scatter

**Summary:** Communication collective that lowers to `pto::comm::TSCATTER(...)`. This op is distinct from tile-level `pto.tscatter`.

**Arguments:** `src`, `ping`, optional `pong`, variadic `group`, `root`

**Constraints & Verification:**

- `group` must be non-empty and all members must have identical types.
- `src` element type must match the group element type.
- `ping` / `pong` must be local VEC tile-like values with matching element type.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to pto.comm_tgather, the documentation for pto.comm_tscatter lacks detail. Please add an argument table and a basic example to align with the documentation style of other operations in this file.

Comment on lines +5340 to +5520
template <typename CollectiveOp>
struct PTOCommCollectiveToEmitC : public OpConversionPattern<CollectiveOp> {
using OpConversionPattern<CollectiveOp>::OpConversionPattern;

explicit PTOCommCollectiveToEmitC(TypeConverter &typeConverter, MLIRContext *ctx,
StringRef apiName)
: OpConversionPattern<CollectiveOp>(typeConverter, ctx),
apiName(apiName.str()) {}

LogicalResult matchAndRewrite(CollectiveOp op, typename CollectiveOp::Adaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
auto *ctx = rewriter.getContext();
Location loc = op.getLoc();

SmallVector<Value> operands;
std::string lambda = "([&](";

auto appendParam = [&](StringRef name) {
if (lambda.back() != '(')
lambda += ", ";
lambda += "auto &";
lambda += name.str();
};

auto appendOperand = [&](Value value, StringRef name) {
appendParam(name);
operands.push_back(value);
};

auto buildPong = [&](Value original, Value emitted, StringRef name) -> FailureOr<Value> {
if (!original)
return failure();
return buildCommTileValue(rewriter, loc, original, emitted);
};

if constexpr (std::is_same_v<CollectiveOp, pto::TBroadcastOp>) {
FailureOr<Value> srcGT =
buildCommGlobalTensorValue(rewriter, loc, op.getSrc(), adaptor.getSrc(),
op.getOperation());
FailureOr<Value> pingTile =
buildCommTileValue(rewriter, loc, op.getPing(), adaptor.getPing());
auto groupGTs =
buildCommGroupGlobalTensors(rewriter, loc, op, op.getGroup(), adaptor.getGroup());
if (failed(srcGT) || failed(pingTile) || failed(groupGTs))
return rewriter.notifyMatchFailure(op, "failed to materialize broadcast operands");
appendOperand(*srcGT, "__src");
appendOperand(*pingTile, "__ping");
if (op.getPong()) {
FailureOr<Value> pongTile =
buildPong(op.getPong(), adaptor.getPong(), "__pong");
if (failed(pongTile))
return rewriter.notifyMatchFailure(op, "failed to materialize pong tile");
appendOperand(*pongTile, "__pong");
}
for (size_t i = 0; i < groupGTs->size(); ++i)
appendOperand((*groupGTs)[i], ("__g" + Twine(i)).str());
lambda += ") { ";
lambda += "using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {";
for (size_t i = 0; i < groupGTs->size(); ++i) {
if (i)
lambda += ", ";
lambda += "__g" + std::to_string(i);
}
lambda += "}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, ";
lambda += std::to_string(groupGTs->size()) + ", " + std::to_string(op.getRoot());
lambda += "); pto::comm::TBROADCAST(__pg, __src, __ping";
if (op.getPong())
lambda += ", __pong";
lambda += "); })";
} else if constexpr (std::is_same_v<CollectiveOp, pto::CommTGatherOp>) {
FailureOr<Value> dstGT =
buildCommGlobalTensorValue(rewriter, loc, op.getDst(), adaptor.getDst(),
op.getOperation());
FailureOr<Value> pingTile =
buildCommTileValue(rewriter, loc, op.getPing(), adaptor.getPing());
auto groupGTs =
buildCommGroupGlobalTensors(rewriter, loc, op, op.getGroup(), adaptor.getGroup());
if (failed(dstGT) || failed(pingTile) || failed(groupGTs))
return rewriter.notifyMatchFailure(op, "failed to materialize gather operands");
appendOperand(*dstGT, "__dst");
appendOperand(*pingTile, "__ping");
if (op.getPong()) {
FailureOr<Value> pongTile =
buildPong(op.getPong(), adaptor.getPong(), "__pong");
if (failed(pongTile))
return rewriter.notifyMatchFailure(op, "failed to materialize pong tile");
appendOperand(*pongTile, "__pong");
}
for (size_t i = 0; i < groupGTs->size(); ++i)
appendOperand((*groupGTs)[i], ("__g" + Twine(i)).str());
lambda += ") { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {";
for (size_t i = 0; i < groupGTs->size(); ++i) {
if (i)
lambda += ", ";
lambda += "__g" + std::to_string(i);
}
lambda += "}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, ";
lambda += std::to_string(groupGTs->size()) + ", " + std::to_string(op.getRoot());
lambda += "); pto::comm::TGATHER(__pg, __dst, __ping";
if (op.getPong())
lambda += ", __pong";
lambda += "); })";
} else if constexpr (std::is_same_v<CollectiveOp, pto::CommTScatterOp>) {
FailureOr<Value> srcGT =
buildCommGlobalTensorValue(rewriter, loc, op.getSrc(), adaptor.getSrc(),
op.getOperation());
FailureOr<Value> pingTile =
buildCommTileValue(rewriter, loc, op.getPing(), adaptor.getPing());
auto groupGTs =
buildCommGroupGlobalTensors(rewriter, loc, op, op.getGroup(), adaptor.getGroup());
if (failed(srcGT) || failed(pingTile) || failed(groupGTs))
return rewriter.notifyMatchFailure(op, "failed to materialize scatter operands");
appendOperand(*srcGT, "__src");
appendOperand(*pingTile, "__ping");
if (op.getPong()) {
FailureOr<Value> pongTile =
buildPong(op.getPong(), adaptor.getPong(), "__pong");
if (failed(pongTile))
return rewriter.notifyMatchFailure(op, "failed to materialize pong tile");
appendOperand(*pongTile, "__pong");
}
for (size_t i = 0; i < groupGTs->size(); ++i)
appendOperand((*groupGTs)[i], ("__g" + Twine(i)).str());
lambda += ") { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {";
for (size_t i = 0; i < groupGTs->size(); ++i) {
if (i)
lambda += ", ";
lambda += "__g" + std::to_string(i);
}
lambda += "}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, ";
lambda += std::to_string(groupGTs->size()) + ", " + std::to_string(op.getRoot());
lambda += "); pto::comm::TSCATTER(__pg, __src, __ping";
if (op.getPong())
lambda += ", __pong";
lambda += "); })";
} else {
FailureOr<Value> dstGT =
buildCommGlobalTensorValue(rewriter, loc, op.getDst(), adaptor.getDst(),
op.getOperation());
FailureOr<Value> accTile =
buildCommTileValue(rewriter, loc, op.getAcc(), adaptor.getAcc());
FailureOr<Value> recvPing =
buildCommTileValue(rewriter, loc, op.getRecvPing(), adaptor.getRecvPing());
auto groupGTs =
buildCommGroupGlobalTensors(rewriter, loc, op, op.getGroup(), adaptor.getGroup());
if (failed(dstGT) || failed(accTile) || failed(recvPing) || failed(groupGTs))
return rewriter.notifyMatchFailure(op, "failed to materialize reduce operands");
appendOperand(*dstGT, "__dst");
appendOperand(*accTile, "__acc");
appendOperand(*recvPing, "__recv_ping");
if (op.getRecvPong()) {
FailureOr<Value> recvPong =
buildPong(op.getRecvPong(), adaptor.getRecvPong(), "__recv_pong");
if (failed(recvPong))
return rewriter.notifyMatchFailure(op, "failed to materialize recv_pong");
appendOperand(*recvPong, "__recv_pong");
}
for (size_t i = 0; i < groupGTs->size(); ++i)
appendOperand((*groupGTs)[i], ("__g" + Twine(i)).str());
lambda += ") { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {";
for (size_t i = 0; i < groupGTs->size(); ++i) {
if (i)
lambda += ", ";
lambda += "__g" + std::to_string(i);
}
lambda += "}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, ";
lambda += std::to_string(groupGTs->size()) + ", " + std::to_string(op.getRoot());
lambda += "); pto::comm::TREDUCE(__pg, __dst, __acc, __recv_ping";
if (op.getRecvPong())
lambda += ", __recv_pong";
lambda += ", " + reduceOpTok(op.getReduceOp()) + "); })";
}

rewriter.create<emitc::CallOpaqueOp>(loc, TypeRange{}, lambda, ArrayAttr{},
ArrayAttr{}, operands);
rewriter.eraseOp(op);
return success();
}

std::string apiName;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The matchAndRewrite method in PTOCommCollectiveToEmitC has a significant amount of duplicated code across the if constexpr branches. The logic for creating the __group C-style array and the pto::comm::ParallelGroup is nearly identical in all branches.

To improve maintainability, consider extracting this common logic into a helper function. This would reduce code duplication and make the lowering logic for each collective op clearer.

FangRui0 and others added 3 commits April 10, 2026 10:56
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:manual
  • 源码提交:ce6e7a2d124f
  • 结果汇总:OK 185 / FAIL 3 / SKIP 1
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260410_110805_manual_pr467.log
  • 手动指令:/run a3
  • 触发人:FangRui0
  • 触发评论:feat: add synchronous communication ops #467 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • mrgsort_format2 (run, exit=2)
  • comm_p2p (run, exit=2)
  • comm_collective (run, exit=2)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #467

mrgsort_format2

stage=run info=exit=2

[ERROR] Mismatch: golden_v5.bin vs v5.bin, max diff=1.0294660965701041 at idx=636 (golden=1.0295116901397705, out=4.559356966638006e-05, dtype=float32)
[ERROR] compare failed
[2026-04-10 11:14:07] ERROR: testcase failed (exit 2): mrgsort_format2
comm_p2p

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:99:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:99:4)'
  ([&](auto &__signal, auto __value){ pto::comm::TNOTIFY(__signal, __value, pto::comm::NotifyOp::Set); })(v17, v5);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:99:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__signal, auto __value){ pto::comm::TNOTIFY(__signal, __value, pto::comm::NotifyOp::Set); })(v17, v5);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:100:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:100:4)'
  ([&](auto &__signal, auto __cmp){ pto::comm::TWAIT(__signal, __cmp, pto::comm::WaitCmp::GE); })(v17, v5);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:100:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__signal, auto __cmp){ pto::comm::TWAIT(__signal, __cmp, pto::comm::WaitCmp::GE); })(v17, v5);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:101:14: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:101:15)'
  bool v20 = ([&](auto &__signal, auto __cmp){ return pto::comm::TTEST(__signal, __cmp, pto::comm::WaitCmp::EQ); })(v17, v5);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_p2p/comm_p2p_kernel.cpp:101:15: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  bool v20 = ([&](auto &__signal, auto __cmp){ return pto::comm::TTEST(__signal, __cmp, pto::comm::WaitCmp::EQ); })(v17, v5);
              ^
3 errors generated.
gmake[2]: *** [CMakeFiles/comm_p2p_kernel.dir/build.make:76: CMakeFiles/comm_p2p_kernel.dir/comm_p2p_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/comm_p2p_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-04-10 11:18:00] ERROR: testcase failed (exit 2): comm_p2p
comm_collective

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:103:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:103:4)'
  ([&](auto &__src, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TBROADCAST(__pg, __src, __ping); })(v16, v26, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:103:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__src, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TBROADCAST(__pg, __src, __ping); })(v16, v26, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:104:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:104:4)'
  ([&](auto &__src, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TBROADCAST(__pg, __src, __ping, __pong); })(v16, v26, v27, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:104:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__src, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TBROADCAST(__pg, __src, __ping, __pong); })(v16, v26, v27, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:105:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:105:4)'
  ([&](auto &__dst, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TGATHER(__pg, __dst, __ping); })(v13, v26, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:105:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__dst, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TGATHER(__pg, __dst, __ping); })(v13, v26, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:106:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:106:4)'
  ([&](auto &__dst, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TGATHER(__pg, __dst, __ping, __pong); })(v13, v26, v27, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:106:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__dst, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TGATHER(__pg, __dst, __ping, __pong); })(v13, v26, v27, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:107:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:107:4)'
  ([&](auto &__src, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TSCATTER(__pg, __src, __ping); })(v16, v26, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:107:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__src, auto &__ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TSCATTER(__pg, __src, __ping); })(v16, v26, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:108:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:108:4)'
  ([&](auto &__src, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TSCATTER(__pg, __src, __ping, __pong); })(v16, v26, v27, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:108:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__src, auto &__ping, auto &__pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TSCATTER(__pg, __src, __ping, __pong); })(v16, v26, v27, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:109:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:109:4)'
  ([&](auto &__dst, auto &__acc, auto &__recv_ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TREDUCE(__pg, __dst, __acc, __recv_ping, pto::comm::ReduceOp::Sum); })(v13, v28, v26, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:109:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__dst, auto &__acc, auto &__recv_ping, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TREDUCE(__pg, __dst, __acc, __recv_ping, pto::comm::ReduceOp::Sum); })(v13, v28, v26, v19, v22, v25);
   ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:110:3: error: no matching function for call to object of type '(lambda at /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:110:4)'
  ([&](auto &__dst, auto &__acc, auto &__recv_ping, auto &__recv_pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TREDUCE(__pg, __dst, __acc, __recv_ping, __recv_pong, pto::comm::ReduceOp::Max); })(v13, v28, v26, v27, v19, v22, v25);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260410_110805_manual_pr467/npu_validation/CommSync/comm_collective/comm_collective_kernel.cpp:110:4: note: candidate function not viable: call to [host] function from __global__ [aicore] function
  ([&](auto &__dst, auto &__acc, auto &__recv_ping, auto &__recv_pong, auto &__g0, auto &__g1, auto &__g2) { using __GT = std::decay_t<decltype(__g0)>; __GT __group[] = {__g0, __g1, __g2}; auto __pg = pto::comm::ParallelGroup<__GT>::Create(__group, 3, 1); pto::comm::TREDUCE(__pg, __dst, __acc, __recv_ping, __recv_pong, pto::comm::ReduceOp::Max); })(v13, v28, v26, v27, v19, v22, v25);
   ^
8 errors generated.
gmake[2]: *** [CMakeFiles/comm_collective_kernel.dir/build.make:76: CMakeFiles/comm_collective_kernel.dir/comm_collective_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/comm_collective_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-04-10 11:18:01] ERROR: testcase failed (exit 2): comm_collective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants