oisee
diff --git a/‎.gitignore‎
Lines changed: 7 additions & 0 deletions b/‎.gitignore‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎BRUTEFORCE_ROADMAP.md‎
Lines changed: 2 additions & 2 deletions b/‎BRUTEFORCE_ROADMAP.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 9 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 75 additions & 2 deletions b/‎README.md‎
Lines changed: 75 additions & 2 deletions
diff --git a/‎TODO.md‎
Lines changed: 18 additions & 7 deletions b/‎TODO.md‎
Lines changed: 18 additions & 7 deletions
@@ -51,3 +51,10 @@ mulopt
 partopt
 regalloc-enum
 dedelulu.jsonl
+peephole_len2_complete.json
+
+data/enriched_5v.ofb
+data/enriched_6v_dense.ofb
+data/ix_expanded_5v.ofb
+data/merged_5v.ofb
+data/merged_ix_5v.ofb
@@ -11,7 +11,7 @@ A GPU with 5000 cores can test billions of sequences per second. A human can't.
 ### Peephole Rules (602,008 entries)
 - **What:** For every pair of Z80 instructions, find a shorter replacement
 - **Search space:** 4,215² = 17.8M pairs (length-2)
-- **Results:** 602K proven optimizations in `results-len2.json`
+- **Results:** 602K proven optimizations in `data/peephole_len2_complete.json`
 - **Example:** `SLA A / RR A` → `OR A` (saves 3 bytes)
 
 ### Register Allocation Table (61 entries)
@@ -201,7 +201,7 @@ Each thread tests one sequence against the QuickCheck vectors. Survivors get ful
 
 ```
 z80-optimizer/
-├── results-len2.json         # 602K peephole rules (done)
+├── data/peephole_len2_complete.json         # 602K peephole rules (done)
 ├── regalloc_table.json       # 61 register assignments (done)
 ├── mul_table_tier1.json      # constant multiply (in progress)
 ├── mul_table_tier2.json      # + ADC/SBC
 
@@ -43,6 +43,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu    # division/modulo (1
 - `cuda/z80_mulopt_fast.cu` — Constant multiply search (14-op reduced pool, 38x faster)
 - `cuda/z80_divmod_fast.cu` — Division/modulo search (14-op, 5T limit)
 - `cuda/z80_mulopt16.cu` — 16-bit multiply (u8×K=u16, result in HL)
+- `cuda/z80_mulopt16c.cu` — HL×K→HL (16-bit × constant, 12-op pool, EX DE,HL)
 - `cuda/z80_common.h` — Shared Z80 executor, flag tables, test vectors
 
 ### Data
@@ -53,6 +54,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu    # division/modulo (1
 - `data/z80_register_graph.json` — Complete 11-register cost model (moves, ALU, swaps)
 - `data/mulopt8_clobber.json` — 254 mul8 sequences (A×K→A) with clobber masks
 - `data/mulopt16_complete.json` — 254 mul16 sequences (A×K→HL)
+- `data/mulopt16c_complete.json` — 86/254 mul16c sequences (HL×K→HL, all K=2..31 plus select larger)
 - `data/div8_optimal.json` — 254 div8 sequences (A÷K→A) via multiply-and-shift
 - `data/mod8_optimal.json` — 254 mod8 sequences (A%K→A)
 - `data/divmod8_optimal.json` — 254 divmod8 sequences
@@ -64,6 +66,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu    # division/modulo (1
 - `data/bcd_idioms.json` — BCD arithmetic (GPU-proven with H-flag)
 
 ### Documentation
+- `docs/z80_opref.md` — **Complete Z80 operation reference**: all instructions, T-states, flags, encoding, spill tier hierarchy, boolean repr
 - `docs/glossary.md` — Complete glossary of all terms and abbreviations
 - `docs/paper_seed_superopt.md` — Paper/book seed: 8 sections covering all research findings
 - `docs/research_statement.md` — Paper-oriented framing with phase diagram data
@@ -101,10 +104,13 @@ Register count: 7 main (A,B,C,D,E,H,L) + 4 IX/IY halves = **11 registers** for a
 - 37M len-3→len-1 rules (partial, ~0.05% coverage)
 
 ### Constant Multiplication
-- 254/254 constants solved (complete!)
+- 254/254 constants solved (complete!) — A×K→A (mul8) and A×K→HL (mul16)
 - Key finding: 21-instruction universal pool (2.7% of ISA generates ALL optimal arithmetic)
 - NEG trick: ×255 = NEG (1 instruction, 8T)
 - All 254 mul8 preserve A, all DE-safe
+- **mul16c (HL×K→HL)**: 86/254 solved. All K=2..31 complete. Avg 79.3T, 8.4 ops.
+  - Structural limit: floor(log2(K))+hamming(K)≤9. K=47,63,127,255 etc. need SBC HL,rr.
+  - Go table: `pkg/mulopt/Mul16cTable`, `Emit16c(k)`
 
 ### Division/Modulo — COMPLETE (254/254)
 - **div8 v3**: 6 methods, avg **79T** (−49% from v1). All exhaustively verified.
@@ -125,6 +131,8 @@ Register count: 7 main (A,B,C,D,E,H,L) + 4 IX/IY halves = **11 registers** for a
 - ≤4v: 156,506 shapes (78.9% feasible), 40 seconds
 - ≤5v: 17,366,874 shapes (67.7% feasible), 20 minutes
 - 6v dense (tw≥4): 66,118,738 shapes (38.9% feasible), ~6 hours
+- **IX-expanded 5v**: 60.9M entries (79.2% feasible) — `data/merged_ix_5v.bin`
+- **OFB sidecars**: `data/enriched_{4v,5v,6v_dense}.ofb` — 32-bit op-feasibility bitmask per entry (15 flags)
 - Enrichment: 43% lack A (hidden ALU infeasibility), 21% lack HL
 - Smart CALL save: 17T avg (vs 34T naive) = 50% reduction
 - Feasibility cliff: 95.9% (2v) → 0.9% (6v) — phase transition
 
@@ -2,7 +2,15 @@
 
 A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that **never guesses** — every optimization is provably optimal.
 
-## What's New (Birthday Marathon — March 26–29, 2026)
+## What's New (Birthday Marathon — March 26 – April 1, 2026)
+
+### Day 7 Highlights (April 1, 2026)
+
+- **`pkg/regalloc/ofb.go`** — public OFB API: `ComputeOFB()`, `LoadOFB()`, 15 flag constants, `OFBNames()`. No more local duplicates in consumer tools.
+- **OFB sidecars for ALL table files** — `enrich-ofb` now auto-detects ENRT and Z80T v2 formats. Sidecars generated for `merged_ix_5v.bin`, `ix_expanded_5v.bin`, etc. (~233MB each).
+- **`cmd/gen6v-ix-feed`** — fast FuncDesc JSON generator for the 6v IX-expanded GPU run. Pre-computes 562 valid treewidth≥4 masks (out of 32,768 possible 6v graphs) in <0.5s, then iterates masks as outer loop → 200K shapes/sec feeder vs 8 shapes/sec from regalloc-enum CPU bottleneck.
+- **Dual-GPU ix_expanded_6v_dense run** — 298.7M shapes split across GPU0 (masks 0–280) + GPU1 (masks 281–561), running in background, ETA ~5h. Will yield the largest IX-aware regalloc table yet.
+- **EXX zone architecture** — S1+S2 independent table lookups, IXH/IXL/IYH/IYL as zero-cost inter-zone bridges. Full pipeline: `total_cost = lookup(S1) + lookup(S2) + 4T×N_exx + 8T×N_ix_accesses`.
 
 ### Week 1 Highlights
 
@@ -24,6 +32,7 @@ A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that
 | 4 | Mar 28 | Regalloc | Five-level pipeline, backtracking solver, phase cliff — [log](contexts/day4_wisdom.md) |
 | 5 | Mar 29 | Images+u32 | 3 CUDA generators, u32 library, Introspec BB port — [log](contexts/day5_wisdom.md) |
 | 6 | Mar 29 | Division | div8 v3 (−49%), carry_compare, sign/sat/arith16 — [log](contexts/day6_wisdom.md) |
+| 7 | Apr 1 | OFB+6v | OFB public API, Z80T v2 sidecars, gen6v-ix-feed, dual-GPU 298M run |
 
 ### pRNG Image Search & Animation Pipeline — ZX Spectrum Demoscene
 
@@ -435,6 +444,68 @@ Length 4:  4,215^4 = 315T targets  → STOKE only
 Length 5+: combinatorial explosion → STOKE only
 ```
 
+## Pending Tables
+
+These table files are currently being computed or are planned. Each will be a drop-in addition to the existing regalloc pipeline.
+
+### `data/ix_expanded_6v_dense.bin` — **in progress** (ETA ~5h from April 1, 2026)
+
+The largest regalloc table yet: 6 virtual registers with full IX/IY half-register support and treewidth≥4 interference graphs.
+
+**What it enables:**
+- IX-aware register allocation for dense 6-vreg functions — currently the `merged_ix_5v.bin` table (60.9M entries) only covers up to 5 vregs with IX halves
+- Complete `pkg/regalloc` O(1) lookup for the common case of 6 live variables with pointer or EXX-zone patterns
+- Covers `HLH'L' u32` patterns (HL in main bank + BC or DE free as shadow) that appear in 32-bit arithmetic loops
+
+**Generation:**
+```bash
+# Currently running (background, dual-GPU on main i7):
+./gen6v-ix-feed -mask-start 0   -mask-end 281 | ./cuda/z80_regalloc --server --gpu-id 0 > data/ix_6v_gpu0.jsonl &
+./gen6v-ix-feed -mask-start 281 -mask-end 562 | ./cuda/z80_regalloc --server --gpu-id 1 > data/ix_6v_gpu1.jsonl &
+
+# After completion:
+cat data/ix_6v_gpu0.jsonl data/ix_6v_gpu1.jsonl > data/ix_expanded_6v_dense.jsonl
+CGO_ENABLED=0 ~/go/bin/go1.24.3 run ./cmd/build-ix-table/ \
+  -n-locsets8 6 -max-vregs 6 < data/ix_expanded_6v_dense.jsonl > data/ix_expanded_6v_dense.bin
+./enrich-ofb -input data/ix_expanded_6v_dense.bin -output data/ix_expanded_6v_dense.ofb
+```
+
+**Stats (projected):** ~298.7M shapes, ~79% feasible ≈ ~236M feasible assignments, ~2.5GB raw binary.
+
+The key algorithmic insight behind `gen6v-ix-feed`: of 32,768 possible nv=6 interference graphs, only **562 have treewidth≥4** (the ones worth exhaustive search). Pre-computing these 562 masks and iterating them as the outer loop reduces the feeder from 8 shapes/sec (CPU bottleneck) to 200K shapes/sec.
+
+### OFB sidecars — complete for all current `.bin` tables
+
+OFB (Op Feasibility Bag) sidecars precompute 15 per-assignment flags in O(1), aligned 1:1 with the source file:
+
+| Sidecar | Source | Size | Description |
+|---------|--------|------|-------------|
+| `data/enriched_4v.ofb` | enriched_4v.enr | ~625KB | 156K entries |
+| `data/enriched_5v.ofb` | enriched_5v.enr | ~67MB | 17.4M entries |
+| `data/enriched_6v_dense.ofb` | enriched_6v_dense.enr | ~253MB | 66.1M entries |
+| `data/merged_ix_5v.ofb` | merged_ix_5v.bin | ~233MB | 60.9M entries |
+| `data/ix_expanded_5v.ofb` | ix_expanded_5v.bin | ~233MB | 60.9M entries |
+| `data/ix_expanded_6v_dense.ofb` | ix_expanded_6v_dense.bin | ~1.0GB | ~298M entries *(pending)* |
+
+OFB flags let the backend skip table lookups for common feasibility checks: `OFBMul8Safe` (H/L/C all free → safe to clobber for mul8), `OFBDJNZFree` (B free → DJNZ without save), `OFBHLArith` (HL assigned → ADD HL,rr native), etc.
+
+```go
+// pkg/regalloc usage:
+import "github.com/oisee/z80-optimizer/pkg/regalloc"
+
+ofb := regalloc.ComputeOFB(entry.Assignment)  // O(1), no sidecar needed
+// Or load precomputed sidecar:
+table, _ := regalloc.LoadOFB("data/enriched_5v.ofb")
+ofb := table.Get(entryIndex)
+
+if ofb & regalloc.OFBMul8Safe != 0 {
+    // H, L, C all free — safe to use H/L/C as mul8 scratch
+}
+if ofb & regalloc.OFBDJNZFree != 0 {
+    // B not assigned — emit DJNZ loop without PUSH BC / POP BC
+}
+```
+
 ## What's next
 
 ### In progress
@@ -523,7 +594,9 @@ GPU brute-force over all possible register allocation constraint shapes. For eac
 |-------|--------|----------|----------|
 | ≤4 vregs | 156,506 | 40 sec | 78.9% |
 | ≤5 vregs | 17,366,874 | 20 min | 67.7% |
-| 6 vregs (dense, tw≥4) | 66,118,738 | ~6 hours | TBD |
+| 6 vregs (dense, tw≥4) | 66,118,738 | ~6 hours | 38.9% |
+| **IX-expanded ≤5v** | **60,900,000** | **~2h** | **79.2%** — `data/merged_ix_5v.bin` |
+| **IX-expanded 6v** | **~298,700,000** | **~5.2h** | *pending* — dual-GPU run in progress |
 
 **Key findings:**
 
 
@@ -1,6 +1,6 @@
 # TODO — Z80 Superoptimizer Roadmap
 
-> Last updated: 2026-03-29 (Day 6 birthday marathon)
+> Last updated: 2026-04-01 (Day 7 — IX expansion, OFB, mul16c)
 
 Legend: `[x]` done, `[-]` in progress, `[ ]` planned.
 Effort: S = hours, M = day, L = days, XL = week+.
@@ -12,9 +12,11 @@ Effort: S = hours, M = day, L = days, XL = week+.
 ### 1.1 Multiply — COMPLETE
 - [x] **mul8**: 254/254 constants, A×K→A — `data/mulopt8_clobber.json` (S)
 - [x] **mul16**: 254/254 constants, A×K→HL — `data/mulopt16_complete.json` (M)
-- [ ] **mul16c**: HL×K→HL (16-bit × constant, full 16-bit) — needs new CUDA kernel (M)
-  - Approach: decompose as HL×K = L×K + H×K×256, use mul16 building blocks
-  - Or: new CUDA search with HL input, reduced op pool
+- [-] **mul16c**: HL×K→HL (16-bit × constant) — `cuda/z80_mulopt16c.cu` built, running (M)
+  - 12-op pool: ADD HL,HL/BC/DE + 8×LD saves + EX DE,HL
+  - ~86/254 found (max-len=10). Structural limit: floor(log2(K))+hamming(K)≤9
+  - Not found (168/254): K values with high Hamming weight — need SBC HL,rr to add
+  - Next: gen_mul16c_table.py → pkg/mulopt/mul16c_table.go; Go wrapper Emit16c(k)
 
 ### 1.2 Division / Modulo — COMPLETE (u8)
 - [x] **div8**: 254/254 constants, A÷K→A — `data/div8_optimal.json` v3 (M)
@@ -112,11 +114,16 @@ Effort: S = hours, M = day, L = days, XL = week+.
 
 ## 3. Register Allocation
 
-### 3.1 Tables — COMPLETE
+### 3.1 Tables — COMPLETE + IX EXPANDED
 - [x] **83.6M shapes** (≤6v): enumerated, enriched, compressed (78MB)
 - [x] **37.6M feasible**: each with optimal assignment + 15 metrics
 - [x] **O(1) lookup**: signature = (interference_shape, operation_bag) → hash
 - [x] **Enriched tables**: 43% lack A, 21% lack HL, smart CALL save 17T avg
+- [x] **IX-expanded 5v**: `data/ix_expanded_5v.bin` (60.9M entries, 79.2% feasible, 117s GPU)
+- [x] **Merged IX+5v**: `data/merged_ix_5v.bin` (60.9M entries, 79.8% feasible, 382MB)
+- [x] **OFB sidecars**: `data/enriched_{4v,5v,6v_dense}.ofb` — 32-bit feasibility bitmask per entry
+  - 15 bits: ALU, ptr ops, mul8-safe, DJNZ, EXX bridge, HLH'L' u32, ADC/SBC src validity
+  - 5v: 11.76M feasible; 6v dense: 25.77M feasible
 
 ### 3.2 Five-Level Pipeline — MOSTLY COMPLETE
 - [x] Level 1: Cut vertex decomposition (free split, 87%)
@@ -176,10 +183,11 @@ Effort: S = hours, M = day, L = days, XL = week+.
 
 ### 5.1 CUDA Kernels — WORKING
 - [x] **z80_search_v2.cu**: 3-stage pipeline (QC→Mid→Exhaustive), dual-GPU
-- [x] **z80_regalloc.cu**: GPU allocator + CPU backtracking fallback
+- [x] **z80_regalloc.cu**: GPU allocator + CPU backtracking fallback (constrained enumeration fix: 97× speedup)
 - [x] **z80_mulopt_fast.cu**: 14-op constant multiply (38× faster)
 - [x] **z80_divmod_fast.cu**: 14-op division/modulo search
-- [x] **z80_mulopt16.cu**: 16-bit multiply search
+- [x] **z80_mulopt16.cu**: 16-bit multiply search (A×K→HL)
+- [x] **z80_mulopt16c.cu**: HL×K→HL search, 12-op pool (new day 7)
 - [x] **z80_common.h**: shared executor, flag tables, test vectors
 
 ### 5.2 Multi-Platform DSL — WORKING
@@ -204,6 +212,9 @@ Effort: S = hours, M = day, L = days, XL = week+.
 - [x] `pkg/regalloc/`: LoadBinary(path), IndexOf(shape), Lookup(idx)
 - [x] `pkg/peephole/`: Lookup(source) top500, LoadRules(path) full 739K
 - [x] `pkg/gpugen/`: ISA DSL for multi-platform code generation
+- [ ] `cmd/enrich-ofb/`: Emit OFB sidecar — built and validated (day 7)
+  - [ ] Add to Go package exports for use by regalloc pipeline
+- [ ] `pkg/mulopt/mul16c_table.go`: Emit16c(k) HL×K→HL — pending mul16c JSON complete
 
 ### 6.2 Pending Integration
 - [ ] **div8 inline expansion**: MinZ codegen wiring for JP __div8 → inline (S)