[Feature] Support odd-axis tile shapes via physical-even + odd-valid split, and define a codegen no-op contract for valid_row=0

### Summary

PTOAS today rejects the IR pattern PyPTO produces when a mixed cube+vec kernel runs with `SplitMode::None` on the a2a3 backend and contains an odd-valid trim (e.g. slicing a 16-row `Q_HEAD_PAD` tile down to 5 rows for `Q_HEAD_BATCH` output inside a single mixed root).

The IR pattern in question is **not** an UP_DOWN row-split — fa_fused declares no `optimizations=[pl.split(...)]` and `SplitVectorKernel` does not halve any tiles. Instead, the a2a3 backend's `RequiresNoSplitDualAivDispatch == true` causes PyPTO's `ExpandMixedKernel` to emit a **dual-AIV no-op replay** wrapper:

```
if (subblock_idx == 0) { lane0_body }   // real work
else                   { lane1_body }   // no-op replay
```

The lane1 body is built by [`BuildNoSplitLane1ReplayStmts`](https://github.com/hw-native-sys/pypto/blob/main/src/ir/transforms/split_vector_kernel_pass.cpp) (`split_vector_kernel_pass.cpp` around line 906-996), which calls `WithZeroValidShape` on every tile type and deletes every `tile.store`. Purpose: keep cube/vec pipe/sync state aligned across both AIV lanes without emitting real writes from lane 1.

The blockers live in three places in `lib/PTO/IR/PTO.cpp`:

1. `SubViewOp::verify` line 10557 rejects explicit `valid_row = 0` constant operand:
   ```cpp
   if (vRow <= 0)
     return emitOpError("valid_row must be positive when constant");
   ```
2. `SubViewOp::verify` line 10606 rejects any `dst.v_row` that does not equal the inferred `expectedVRow` (= `sizeR` by default). PyPTO has no way today to spell "this lane's subview output is empty".
3. Six downstream op verifiers reject `valid_shape[0] = 0`:
   - row/col reduction helpers at lines 1736 / 1830 (used by `trowmax`, `trowsum`, `tcolmax`, `tcolsum`)
   - `TLReluOp::verify` at line 6230
   - `TRowExpandOp::verify` at lines 8396 / 8400
   - `verifyTRowExpandReduceLikeOp` at line 8854 (used by `trowexpandsub`, `trowexpandadd`, etc.)

For PyPTO's dual-AIV no-op replay to be legal end-to-end, `valid_row = 0` must be a legitimate "this lane has no useful output" marker that PTOAS guarantees to lower to either no ISA emission or a hardware-side no-op.

### Motivation / Use Case

Companion to hw-native-sys/pypto#1031 (a different but related odd-axis frontend issue) and hw-native-sys/pto-isa#143 (ISA-level Rv=0 no-op spec).

Concrete blocker: Qwen3-14B `decode_layer.py` cannot fuse `online_softmax` into `fa_fused`. fa_fused is `SplitMode::None`; the trim step `ctx[0:Q_HEAD_BATCH=5, :]` inside the mixed cube+vec spmd region runs fine in lane 0 (real work, dst v_row=5), but the lane-1 no-op replay rewrites the entire subview chain with `valid_shape=0` on all tiles, producing:

```mlir
// lane 1 replay (else-branch of the dual-AIV if)
%ctx = pto.alloc_tile addr = ... valid_row = %c0_index valid_col = %c0_index
      : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=128, v_row=?, v_col=?, ...>
pto.trowexpanddiv ins(%oi, %li) outs(%ctx : ...)
%ctx_valid = pto.subview %ctx[%c0_index, %c0_index] sizes [5, 128]
           : !pto.tile_buf<..., rows=16, cols=128, v_row=?, v_col=?, ...>
          -> !pto.tile_buf<..., rows=5, cols=128, v_row=0, v_col=0, ...>
```

which fails verifier 10606 because `dstValid[0]=0 ≠ expectedVRow=5`. (Lane 0's matching subview has `v_row=5, v_col=128` and passes the verifier — only the replay-rewritten lane-1 form is rejected.)

Empirical confirmation from a recent failed compile of `decode_layer.py --max-seq --lm-head skip` (`fa_fused.pto`):
```
$ grep "pto.subview" build_output/.../ptoas/fa_fused.pto
288:    %slice_view = ... -> v_row=5, v_col=128   ← lane 0, gp=0 (then-body)
382:    %211       = ... -> v_row=5, v_col=128   ← lane 0, gp=1 (then-body)
487:    %224       = ... -> v_row=0, v_col=0     ← lane 1 replay, gp=0 (else-body) — FAILS HERE
579:    %226       = ... -> v_row=0, v_col=0     ← lane 1 replay, gp=1 (else-body)
```

The same pattern blocks other fusions in the same file:
- Fusing `rope_kv_cache` (5-row data + 11-row zero-pad assembles) into `fa_fused`
- Fusing `qk_norm` (5-row reshape) into `q_proj` / `k_proj` / `v_proj`

### Proposed Behavior

1. **Subview valid_row=0 acceptance**: relax `SubViewOp::verify` (lines 10557 and 10606) so that an explicit `valid_row = 0` (or `valid_col = 0`) constant operand is legal, and the result type's `v_row` (resp. `v_col`) being 0 is consistent with it. Suggested rule:
   - if `getValidRow()` is a constant `c`, require `c >= 0` and `c <= sizeR`, and require `dstValid[0] == min(c, sizeR)`
   - the no-op contract (item 3 below) is what makes `c = 0` safe

2. **`pad-physical + odd-valid` tile shape**: document that a tile may carry `rows = even N` and `valid_row = odd K < N`. Today `AllocTileOp::verify` does not reject this, but the contract is not spelled out. Make it explicit so PyPTO frontends can rely on it.

3. **Codegen no-op contract for valid_row=0**: for any op whose destination has `valid_row = 0` at codegen time (statically or runtime), PTOAS guarantees one of:
   - emit no ISA instruction for that op (preferred — works on all hardware), OR
   - emit a predicated ISA instruction that the hardware guarantees to no-op (requires hw-native-sys/pto-isa support, see companion issue)

   Specifically required for: `TMOV`, `TCVT`, `TASSEMBLE` (`tile.assemble`), `TSTORE`, `TPUSH_TO_AIC`, `TPUSH_TO_AIV`, `TSUBVIEW` (already pure view).

   The 6 ops that currently reject `v_row=0` (`row_max`, `row_sum`, `relu`, `row_expand_*`, ...) should be audited:
   - if hw has no sensible behavior on Rv=0 → keep reject, document the reason, and PyPTO must ensure these ops never see Rv=0 input
   - if hw is fine → relax with the same no-op contract

4. **PyPTO-side commitment** (tracked separately on the pypto repo; logged here for cross-team visibility):
   - **`BuildNoSplitLane1ReplayStmts` must emit `pto.subview` with an explicit `valid_row = 0` / `valid_col = 0` operand** whenever it rewrites a subview into the lane-1 no-op replay path. The current code (`split_vector_kernel_pass.cpp:780-786` for `tile.slice` arg substitution, plus the general `WithZeroValidShape` type rewrite at line 793) sets the result type's `v_row` to 0 but does not propagate that into a `valid_row` operand on the resulting `pto.subview`, leaving the IR in the verifier-rejected state described above.
   - More generally, PyPTO should emit `pto.subview` with an explicit `valid_row` / `valid_col` operand whenever the source tile's valid shape is dynamic or potentially zero. The default inferred-from-sizes path stays for purely static, non-ragged cases.

### Acceptance Criteria

- A minimal MLIR fixture with `subview(..., valid_row=c0_index, sizes=[5,128])` + `tcvt` + `tassemble` compiles successfully through ptoas.
- The generated ISA for the `valid_row=0` codepath does not contain a store to GM (verified by inspecting emitted assembly or running a hardware-side no-op probe).
- Existing PTOAS test suite continues to pass.

### Related

- hw-native-sys/pypto#1031 (PyPTO-side: `SplitVectorKernel` even-dim restriction — orthogonal but related)
- hw-native-sys/pto-isa#143 (ISA-side: normative Rv=0 no-op semantics)

### Revision Note

The original version of this issue (Nov 2025) described the failing IR pattern as coming from "UP_DOWN split lowering". That was wrong — fa_fused is `SplitMode::None`, no row halving is applied, and the v_row=0 IR comes from the dual-AIV no-op replay path described above. The Summary and Motivation sections have been updated accordingly. The Proposed Behavior items are unchanged; only the framing of the upstream cause was corrected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support odd-axis tile shapes via physical-even + odd-valid split, and define a codegen no-op contract for valid_row=0 #708

Summary

Motivation / Use Case

Proposed Behavior

Acceptance Criteria

Related

Revision Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Support odd-axis tile shapes via physical-even + odd-valid split, and define a codegen no-op contract for valid_row=0 #708

Description

Summary

Motivation / Use Case

Proposed Behavior

Acceptance Criteria

Related

Revision Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions