Skip to content

[Feature] Support odd-axis tile shapes via physical-even + odd-valid split, and define a codegen no-op contract for valid_row=0 #708

@lwDavid

Description

@lwDavid

Summary

PTOAS today rejects the IR pattern PyPTO produces when a mixed cube+vec kernel runs with SplitMode::None on the a2a3 backend and contains an odd-valid trim (e.g. slicing a 16-row Q_HEAD_PAD tile down to 5 rows for Q_HEAD_BATCH output inside a single mixed root).

The IR pattern in question is not an UP_DOWN row-split — fa_fused declares no optimizations=[pl.split(...)] and SplitVectorKernel does not halve any tiles. Instead, the a2a3 backend's RequiresNoSplitDualAivDispatch == true causes PyPTO's ExpandMixedKernel to emit a dual-AIV no-op replay wrapper:

if (subblock_idx == 0) { lane0_body }   // real work
else                   { lane1_body }   // no-op replay

The lane1 body is built by BuildNoSplitLane1ReplayStmts (split_vector_kernel_pass.cpp around line 906-996), which calls WithZeroValidShape on every tile type and deletes every tile.store. Purpose: keep cube/vec pipe/sync state aligned across both AIV lanes without emitting real writes from lane 1.

The blockers live in three places in lib/PTO/IR/PTO.cpp:

  1. SubViewOp::verify line 10557 rejects explicit valid_row = 0 constant operand:
    if (vRow <= 0)
      return emitOpError("valid_row must be positive when constant");
  2. SubViewOp::verify line 10606 rejects any dst.v_row that does not equal the inferred expectedVRow (= sizeR by default). PyPTO has no way today to spell "this lane's subview output is empty".
  3. Six downstream op verifiers reject valid_shape[0] = 0:
    • row/col reduction helpers at lines 1736 / 1830 (used by trowmax, trowsum, tcolmax, tcolsum)
    • TLReluOp::verify at line 6230
    • TRowExpandOp::verify at lines 8396 / 8400
    • verifyTRowExpandReduceLikeOp at line 8854 (used by trowexpandsub, trowexpandadd, etc.)

For PyPTO's dual-AIV no-op replay to be legal end-to-end, valid_row = 0 must be a legitimate "this lane has no useful output" marker that PTOAS guarantees to lower to either no ISA emission or a hardware-side no-op.

Motivation / Use Case

Companion to hw-native-sys/pypto#1031 (a different but related odd-axis frontend issue) and hw-native-sys/pto-isa#143 (ISA-level Rv=0 no-op spec).

Concrete blocker: Qwen3-14B decode_layer.py cannot fuse online_softmax into fa_fused. fa_fused is SplitMode::None; the trim step ctx[0:Q_HEAD_BATCH=5, :] inside the mixed cube+vec spmd region runs fine in lane 0 (real work, dst v_row=5), but the lane-1 no-op replay rewrites the entire subview chain with valid_shape=0 on all tiles, producing:

// lane 1 replay (else-branch of the dual-AIV if)
%ctx = pto.alloc_tile addr = ... valid_row = %c0_index valid_col = %c0_index
      : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=128, v_row=?, v_col=?, ...>
pto.trowexpanddiv ins(%oi, %li) outs(%ctx : ...)
%ctx_valid = pto.subview %ctx[%c0_index, %c0_index] sizes [5, 128]
           : !pto.tile_buf<..., rows=16, cols=128, v_row=?, v_col=?, ...>
          -> !pto.tile_buf<..., rows=5, cols=128, v_row=0, v_col=0, ...>

which fails verifier 10606 because dstValid[0]=0 ≠ expectedVRow=5. (Lane 0's matching subview has v_row=5, v_col=128 and passes the verifier — only the replay-rewritten lane-1 form is rejected.)

Empirical confirmation from a recent failed compile of decode_layer.py --max-seq --lm-head skip (fa_fused.pto):

$ grep "pto.subview" build_output/.../ptoas/fa_fused.pto
288:    %slice_view = ... -> v_row=5, v_col=128   ← lane 0, gp=0 (then-body)
382:    %211       = ... -> v_row=5, v_col=128   ← lane 0, gp=1 (then-body)
487:    %224       = ... -> v_row=0, v_col=0     ← lane 1 replay, gp=0 (else-body) — FAILS HERE
579:    %226       = ... -> v_row=0, v_col=0     ← lane 1 replay, gp=1 (else-body)

The same pattern blocks other fusions in the same file:

  • Fusing rope_kv_cache (5-row data + 11-row zero-pad assembles) into fa_fused
  • Fusing qk_norm (5-row reshape) into q_proj / k_proj / v_proj

Proposed Behavior

  1. Subview valid_row=0 acceptance: relax SubViewOp::verify (lines 10557 and 10606) so that an explicit valid_row = 0 (or valid_col = 0) constant operand is legal, and the result type's v_row (resp. v_col) being 0 is consistent with it. Suggested rule:

    • if getValidRow() is a constant c, require c >= 0 and c <= sizeR, and require dstValid[0] == min(c, sizeR)
    • the no-op contract (item 3 below) is what makes c = 0 safe
  2. pad-physical + odd-valid tile shape: document that a tile may carry rows = even N and valid_row = odd K < N. Today AllocTileOp::verify does not reject this, but the contract is not spelled out. Make it explicit so PyPTO frontends can rely on it.

  3. Codegen no-op contract for valid_row=0: for any op whose destination has valid_row = 0 at codegen time (statically or runtime), PTOAS guarantees one of:

    • emit no ISA instruction for that op (preferred — works on all hardware), OR
    • emit a predicated ISA instruction that the hardware guarantees to no-op (requires hw-native-sys/pto-isa support, see companion issue)

    Specifically required for: TMOV, TCVT, TASSEMBLE (tile.assemble), TSTORE, TPUSH_TO_AIC, TPUSH_TO_AIV, TSUBVIEW (already pure view).

    The 6 ops that currently reject v_row=0 (row_max, row_sum, relu, row_expand_*, ...) should be audited:

    • if hw has no sensible behavior on Rv=0 → keep reject, document the reason, and PyPTO must ensure these ops never see Rv=0 input
    • if hw is fine → relax with the same no-op contract
  4. PyPTO-side commitment (tracked separately on the pypto repo; logged here for cross-team visibility):

    • BuildNoSplitLane1ReplayStmts must emit pto.subview with an explicit valid_row = 0 / valid_col = 0 operand whenever it rewrites a subview into the lane-1 no-op replay path. The current code (split_vector_kernel_pass.cpp:780-786 for tile.slice arg substitution, plus the general WithZeroValidShape type rewrite at line 793) sets the result type's v_row to 0 but does not propagate that into a valid_row operand on the resulting pto.subview, leaving the IR in the verifier-rejected state described above.
    • More generally, PyPTO should emit pto.subview with an explicit valid_row / valid_col operand whenever the source tile's valid shape is dynamic or potentially zero. The default inferred-from-sizes path stays for purely static, non-ragged cases.

Acceptance Criteria

  • A minimal MLIR fixture with subview(..., valid_row=c0_index, sizes=[5,128]) + tcvt + tassemble compiles successfully through ptoas.
  • The generated ISA for the valid_row=0 codepath does not contain a store to GM (verified by inspecting emitted assembly or running a hardware-side no-op probe).
  • Existing PTOAS test suite continues to pass.

Related

Revision Note

The original version of this issue (Nov 2025) described the failing IR pattern as coming from "UP_DOWN split lowering". That was wrong — fa_fused is SplitMode::None, no row halving is applied, and the v_row=0 IR comes from the dual-AIV no-op replay path described above. The Summary and Motivation sections have been updated accordingly. The Proposed Behavior items are unchanged; only the framing of the upstream cause was corrected.

Metadata

Metadata

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions