Summary
PTOAS today rejects the IR pattern PyPTO produces when a mixed cube+vec kernel runs with SplitMode::None on the a2a3 backend and contains an odd-valid trim (e.g. slicing a 16-row Q_HEAD_PAD tile down to 5 rows for Q_HEAD_BATCH output inside a single mixed root).
The IR pattern in question is not an UP_DOWN row-split — fa_fused declares no optimizations=[pl.split(...)] and SplitVectorKernel does not halve any tiles. Instead, the a2a3 backend's RequiresNoSplitDualAivDispatch == true causes PyPTO's ExpandMixedKernel to emit a dual-AIV no-op replay wrapper:
if (subblock_idx == 0) { lane0_body } // real work
else { lane1_body } // no-op replay
The lane1 body is built by BuildNoSplitLane1ReplayStmts (split_vector_kernel_pass.cpp around line 906-996), which calls WithZeroValidShape on every tile type and deletes every tile.store. Purpose: keep cube/vec pipe/sync state aligned across both AIV lanes without emitting real writes from lane 1.
The blockers live in three places in lib/PTO/IR/PTO.cpp:
SubViewOp::verify line 10557 rejects explicit valid_row = 0 constant operand:
if (vRow <= 0)
return emitOpError("valid_row must be positive when constant");
SubViewOp::verify line 10606 rejects any dst.v_row that does not equal the inferred expectedVRow (= sizeR by default). PyPTO has no way today to spell "this lane's subview output is empty".
- Six downstream op verifiers reject
valid_shape[0] = 0:
- row/col reduction helpers at lines 1736 / 1830 (used by
trowmax, trowsum, tcolmax, tcolsum)
TLReluOp::verify at line 6230
TRowExpandOp::verify at lines 8396 / 8400
verifyTRowExpandReduceLikeOp at line 8854 (used by trowexpandsub, trowexpandadd, etc.)
For PyPTO's dual-AIV no-op replay to be legal end-to-end, valid_row = 0 must be a legitimate "this lane has no useful output" marker that PTOAS guarantees to lower to either no ISA emission or a hardware-side no-op.
Motivation / Use Case
Companion to hw-native-sys/pypto#1031 (a different but related odd-axis frontend issue) and hw-native-sys/pto-isa#143 (ISA-level Rv=0 no-op spec).
Concrete blocker: Qwen3-14B decode_layer.py cannot fuse online_softmax into fa_fused. fa_fused is SplitMode::None; the trim step ctx[0:Q_HEAD_BATCH=5, :] inside the mixed cube+vec spmd region runs fine in lane 0 (real work, dst v_row=5), but the lane-1 no-op replay rewrites the entire subview chain with valid_shape=0 on all tiles, producing:
// lane 1 replay (else-branch of the dual-AIV if)
%ctx = pto.alloc_tile addr = ... valid_row = %c0_index valid_col = %c0_index
: !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=128, v_row=?, v_col=?, ...>
pto.trowexpanddiv ins(%oi, %li) outs(%ctx : ...)
%ctx_valid = pto.subview %ctx[%c0_index, %c0_index] sizes [5, 128]
: !pto.tile_buf<..., rows=16, cols=128, v_row=?, v_col=?, ...>
-> !pto.tile_buf<..., rows=5, cols=128, v_row=0, v_col=0, ...>
which fails verifier 10606 because dstValid[0]=0 ≠ expectedVRow=5. (Lane 0's matching subview has v_row=5, v_col=128 and passes the verifier — only the replay-rewritten lane-1 form is rejected.)
Empirical confirmation from a recent failed compile of decode_layer.py --max-seq --lm-head skip (fa_fused.pto):
$ grep "pto.subview" build_output/.../ptoas/fa_fused.pto
288: %slice_view = ... -> v_row=5, v_col=128 ← lane 0, gp=0 (then-body)
382: %211 = ... -> v_row=5, v_col=128 ← lane 0, gp=1 (then-body)
487: %224 = ... -> v_row=0, v_col=0 ← lane 1 replay, gp=0 (else-body) — FAILS HERE
579: %226 = ... -> v_row=0, v_col=0 ← lane 1 replay, gp=1 (else-body)
The same pattern blocks other fusions in the same file:
- Fusing
rope_kv_cache (5-row data + 11-row zero-pad assembles) into fa_fused
- Fusing
qk_norm (5-row reshape) into q_proj / k_proj / v_proj
Proposed Behavior
-
Subview valid_row=0 acceptance: relax SubViewOp::verify (lines 10557 and 10606) so that an explicit valid_row = 0 (or valid_col = 0) constant operand is legal, and the result type's v_row (resp. v_col) being 0 is consistent with it. Suggested rule:
- if
getValidRow() is a constant c, require c >= 0 and c <= sizeR, and require dstValid[0] == min(c, sizeR)
- the no-op contract (item 3 below) is what makes
c = 0 safe
-
pad-physical + odd-valid tile shape: document that a tile may carry rows = even N and valid_row = odd K < N. Today AllocTileOp::verify does not reject this, but the contract is not spelled out. Make it explicit so PyPTO frontends can rely on it.
-
Codegen no-op contract for valid_row=0: for any op whose destination has valid_row = 0 at codegen time (statically or runtime), PTOAS guarantees one of:
- emit no ISA instruction for that op (preferred — works on all hardware), OR
- emit a predicated ISA instruction that the hardware guarantees to no-op (requires hw-native-sys/pto-isa support, see companion issue)
Specifically required for: TMOV, TCVT, TASSEMBLE (tile.assemble), TSTORE, TPUSH_TO_AIC, TPUSH_TO_AIV, TSUBVIEW (already pure view).
The 6 ops that currently reject v_row=0 (row_max, row_sum, relu, row_expand_*, ...) should be audited:
- if hw has no sensible behavior on Rv=0 → keep reject, document the reason, and PyPTO must ensure these ops never see Rv=0 input
- if hw is fine → relax with the same no-op contract
-
PyPTO-side commitment (tracked separately on the pypto repo; logged here for cross-team visibility):
BuildNoSplitLane1ReplayStmts must emit pto.subview with an explicit valid_row = 0 / valid_col = 0 operand whenever it rewrites a subview into the lane-1 no-op replay path. The current code (split_vector_kernel_pass.cpp:780-786 for tile.slice arg substitution, plus the general WithZeroValidShape type rewrite at line 793) sets the result type's v_row to 0 but does not propagate that into a valid_row operand on the resulting pto.subview, leaving the IR in the verifier-rejected state described above.
- More generally, PyPTO should emit
pto.subview with an explicit valid_row / valid_col operand whenever the source tile's valid shape is dynamic or potentially zero. The default inferred-from-sizes path stays for purely static, non-ragged cases.
Acceptance Criteria
- A minimal MLIR fixture with
subview(..., valid_row=c0_index, sizes=[5,128]) + tcvt + tassemble compiles successfully through ptoas.
- The generated ISA for the
valid_row=0 codepath does not contain a store to GM (verified by inspecting emitted assembly or running a hardware-side no-op probe).
- Existing PTOAS test suite continues to pass.
Related
Revision Note
The original version of this issue (Nov 2025) described the failing IR pattern as coming from "UP_DOWN split lowering". That was wrong — fa_fused is SplitMode::None, no row halving is applied, and the v_row=0 IR comes from the dual-AIV no-op replay path described above. The Summary and Motivation sections have been updated accordingly. The Proposed Behavior items are unchanged; only the framing of the upstream cause was corrected.
Summary
PTOAS today rejects the IR pattern PyPTO produces when a mixed cube+vec kernel runs with
SplitMode::Noneon the a2a3 backend and contains an odd-valid trim (e.g. slicing a 16-rowQ_HEAD_PADtile down to 5 rows forQ_HEAD_BATCHoutput inside a single mixed root).The IR pattern in question is not an UP_DOWN row-split — fa_fused declares no
optimizations=[pl.split(...)]andSplitVectorKerneldoes not halve any tiles. Instead, the a2a3 backend'sRequiresNoSplitDualAivDispatch == truecauses PyPTO'sExpandMixedKernelto emit a dual-AIV no-op replay wrapper:The lane1 body is built by
BuildNoSplitLane1ReplayStmts(split_vector_kernel_pass.cpparound line 906-996), which callsWithZeroValidShapeon every tile type and deletes everytile.store. Purpose: keep cube/vec pipe/sync state aligned across both AIV lanes without emitting real writes from lane 1.The blockers live in three places in
lib/PTO/IR/PTO.cpp:SubViewOp::verifyline 10557 rejects explicitvalid_row = 0constant operand:SubViewOp::verifyline 10606 rejects anydst.v_rowthat does not equal the inferredexpectedVRow(=sizeRby default). PyPTO has no way today to spell "this lane's subview output is empty".valid_shape[0] = 0:trowmax,trowsum,tcolmax,tcolsum)TLReluOp::verifyat line 6230TRowExpandOp::verifyat lines 8396 / 8400verifyTRowExpandReduceLikeOpat line 8854 (used bytrowexpandsub,trowexpandadd, etc.)For PyPTO's dual-AIV no-op replay to be legal end-to-end,
valid_row = 0must be a legitimate "this lane has no useful output" marker that PTOAS guarantees to lower to either no ISA emission or a hardware-side no-op.Motivation / Use Case
Companion to hw-native-sys/pypto#1031 (a different but related odd-axis frontend issue) and hw-native-sys/pto-isa#143 (ISA-level Rv=0 no-op spec).
Concrete blocker: Qwen3-14B
decode_layer.pycannot fuseonline_softmaxintofa_fused. fa_fused isSplitMode::None; the trim stepctx[0:Q_HEAD_BATCH=5, :]inside the mixed cube+vec spmd region runs fine in lane 0 (real work, dst v_row=5), but the lane-1 no-op replay rewrites the entire subview chain withvalid_shape=0on all tiles, producing:which fails verifier 10606 because
dstValid[0]=0 ≠ expectedVRow=5. (Lane 0's matching subview hasv_row=5, v_col=128and passes the verifier — only the replay-rewritten lane-1 form is rejected.)Empirical confirmation from a recent failed compile of
decode_layer.py --max-seq --lm-head skip(fa_fused.pto):The same pattern blocks other fusions in the same file:
rope_kv_cache(5-row data + 11-row zero-pad assembles) intofa_fusedqk_norm(5-row reshape) intoq_proj/k_proj/v_projProposed Behavior
Subview valid_row=0 acceptance: relax
SubViewOp::verify(lines 10557 and 10606) so that an explicitvalid_row = 0(orvalid_col = 0) constant operand is legal, and the result type'sv_row(resp.v_col) being 0 is consistent with it. Suggested rule:getValidRow()is a constantc, requirec >= 0andc <= sizeR, and requiredstValid[0] == min(c, sizeR)c = 0safepad-physical + odd-validtile shape: document that a tile may carryrows = even Nandvalid_row = odd K < N. TodayAllocTileOp::verifydoes not reject this, but the contract is not spelled out. Make it explicit so PyPTO frontends can rely on it.Codegen no-op contract for valid_row=0: for any op whose destination has
valid_row = 0at codegen time (statically or runtime), PTOAS guarantees one of:Specifically required for:
TMOV,TCVT,TASSEMBLE(tile.assemble),TSTORE,TPUSH_TO_AIC,TPUSH_TO_AIV,TSUBVIEW(already pure view).The 6 ops that currently reject
v_row=0(row_max,row_sum,relu,row_expand_*, ...) should be audited:PyPTO-side commitment (tracked separately on the pypto repo; logged here for cross-team visibility):
BuildNoSplitLane1ReplayStmtsmust emitpto.subviewwith an explicitvalid_row = 0/valid_col = 0operand whenever it rewrites a subview into the lane-1 no-op replay path. The current code (split_vector_kernel_pass.cpp:780-786fortile.slicearg substitution, plus the generalWithZeroValidShapetype rewrite at line 793) sets the result type'sv_rowto 0 but does not propagate that into avalid_rowoperand on the resultingpto.subview, leaving the IR in the verifier-rejected state described above.pto.subviewwith an explicitvalid_row/valid_coloperand whenever the source tile's valid shape is dynamic or potentially zero. The default inferred-from-sizes path stays for purely static, non-ragged cases.Acceptance Criteria
subview(..., valid_row=c0_index, sizes=[5,128])+tcvt+tassemblecompiles successfully through ptoas.valid_row=0codepath does not contain a store to GM (verified by inspecting emitted assembly or running a hardware-side no-op probe).Related
SplitVectorKerneleven-dim restriction — orthogonal but related)Revision Note
The original version of this issue (Nov 2025) described the failing IR pattern as coming from "UP_DOWN split lowering". That was wrong — fa_fused is
SplitMode::None, no row halving is applied, and the v_row=0 IR comes from the dual-AIV no-op replay path described above. The Summary and Motivation sections have been updated accordingly. The Proposed Behavior items are unchanged; only the framing of the upstream cause was corrected.