Skip to content

Commit 5dbc8ec

Browse files
oiseeclaude
andcommitted
Day 7: OFB public API, gen6v-ix-feed, Z80T sidecars, mul16c, EXX zone arch
- pkg/regalloc/ofb.go: ComputeOFB(), LoadOFB(), OFBTable, 15 flag constants (public API) - pkg/regalloc/zone.go: EXX zone composition, IXH/IXL bridge model - pkg/enr/: ENRT reader package for enriched .enr files - cmd/enrich-ofb: auto-detect ENRT vs Z80T v2 format; generates OFB sidecars for all tables - cmd/gen6v-ix-feed: fast FuncDesc generator for ix_expanded_6v_dense GPU run - pre-computes 562 valid treewidth≥4 masks out of 32768 possible 6v graphs (<0.5s) - 200K shapes/sec vs 8 shapes/sec CPU bottleneck; -mask-start/-mask-end for dual-GPU split - cmd/build-ix-table, cmd/merge-tables, cmd/derive-ix: IX table pipeline commands - cuda/z80_regalloc.cu: add --gpu-id N flag for dual-GPU regalloc server - cuda/z80_mulopt16c.cu: HL×K→HL constant multiply search (mul16c) - data/: OFB sidecars for enriched_{4v,5v,6v_dense}, merged_ix_5v, merged_5v, ix_expanded_5v - data/mulopt16c_complete.json: 86/254 mul16c sequences (all K=2..31 plus select larger) - data/peephole_len2_complete.json: 739K len-2→len-1 rules (complete) - docs/adr/: ADRs 005-008 (arithmetic tables, bool repr, enrichment, 18-reg EXX model) - docs/z80_opref.md: complete Z80 operation reference - docs/regalloc_integration_plan.md: integration plan for compiler backends - scripts/gen_mul16c_table.py: mul16c table generator - README: Day 7 highlights, Pending Tables section, IX-expanded table rows - cmd/regalloc-enum: locSets8 extended to 6 sets (added IX/IY halves locset) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent bbb97df commit 5dbc8ec

33 files changed

Lines changed: 6245 additions & 227 deletions

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,10 @@ mulopt
5151
partopt
5252
regalloc-enum
5353
dedelulu.jsonl
54+
peephole_len2_complete.json
55+
56+
data/enriched_5v.ofb
57+
data/enriched_6v_dense.ofb
58+
data/ix_expanded_5v.ofb
59+
data/merged_5v.ofb
60+
data/merged_ix_5v.ofb

BRUTEFORCE_ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ A GPU with 5000 cores can test billions of sequences per second. A human can't.
1111
### Peephole Rules (602,008 entries)
1212
- **What:** For every pair of Z80 instructions, find a shorter replacement
1313
- **Search space:** 4,215² = 17.8M pairs (length-2)
14-
- **Results:** 602K proven optimizations in `results-len2.json`
14+
- **Results:** 602K proven optimizations in `data/peephole_len2_complete.json`
1515
- **Example:** `SLA A / RR A``OR A` (saves 3 bytes)
1616

1717
### Register Allocation Table (61 entries)
@@ -201,7 +201,7 @@ Each thread tests one sequence against the QuickCheck vectors. Survivors get ful
201201

202202
```
203203
z80-optimizer/
204-
├── results-len2.json # 602K peephole rules (done)
204+
├── data/peephole_len2_complete.json # 602K peephole rules (done)
205205
├── regalloc_table.json # 61 register assignments (done)
206206
├── mul_table_tier1.json # constant multiply (in progress)
207207
├── mul_table_tier2.json # + ADC/SBC

CLAUDE.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu # division/modulo (1
4343
- `cuda/z80_mulopt_fast.cu` — Constant multiply search (14-op reduced pool, 38x faster)
4444
- `cuda/z80_divmod_fast.cu` — Division/modulo search (14-op, 5T limit)
4545
- `cuda/z80_mulopt16.cu` — 16-bit multiply (u8×K=u16, result in HL)
46+
- `cuda/z80_mulopt16c.cu` — HL×K→HL (16-bit × constant, 12-op pool, EX DE,HL)
4647
- `cuda/z80_common.h` — Shared Z80 executor, flag tables, test vectors
4748

4849
### Data
@@ -53,6 +54,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu # division/modulo (1
5354
- `data/z80_register_graph.json` — Complete 11-register cost model (moves, ALU, swaps)
5455
- `data/mulopt8_clobber.json` — 254 mul8 sequences (A×K→A) with clobber masks
5556
- `data/mulopt16_complete.json` — 254 mul16 sequences (A×K→HL)
57+
- `data/mulopt16c_complete.json` — 86/254 mul16c sequences (HL×K→HL, all K=2..31 plus select larger)
5658
- `data/div8_optimal.json` — 254 div8 sequences (A÷K→A) via multiply-and-shift
5759
- `data/mod8_optimal.json` — 254 mod8 sequences (A%K→A)
5860
- `data/divmod8_optimal.json` — 254 divmod8 sequences
@@ -64,6 +66,7 @@ nvcc -O3 -o cuda/z80_divmod_fast cuda/z80_divmod_fast.cu # division/modulo (1
6466
- `data/bcd_idioms.json` — BCD arithmetic (GPU-proven with H-flag)
6567

6668
### Documentation
69+
- `docs/z80_opref.md`**Complete Z80 operation reference**: all instructions, T-states, flags, encoding, spill tier hierarchy, boolean repr
6770
- `docs/glossary.md` — Complete glossary of all terms and abbreviations
6871
- `docs/paper_seed_superopt.md` — Paper/book seed: 8 sections covering all research findings
6972
- `docs/research_statement.md` — Paper-oriented framing with phase diagram data
@@ -101,10 +104,13 @@ Register count: 7 main (A,B,C,D,E,H,L) + 4 IX/IY halves = **11 registers** for a
101104
- 37M len-3→len-1 rules (partial, ~0.05% coverage)
102105

103106
### Constant Multiplication
104-
- 254/254 constants solved (complete!)
107+
- 254/254 constants solved (complete!) — A×K→A (mul8) and A×K→HL (mul16)
105108
- Key finding: 21-instruction universal pool (2.7% of ISA generates ALL optimal arithmetic)
106109
- NEG trick: ×255 = NEG (1 instruction, 8T)
107110
- All 254 mul8 preserve A, all DE-safe
111+
- **mul16c (HL×K→HL)**: 86/254 solved. All K=2..31 complete. Avg 79.3T, 8.4 ops.
112+
- Structural limit: floor(log2(K))+hamming(K)≤9. K=47,63,127,255 etc. need SBC HL,rr.
113+
- Go table: `pkg/mulopt/Mul16cTable`, `Emit16c(k)`
108114

109115
### Division/Modulo — COMPLETE (254/254)
110116
- **div8 v3**: 6 methods, avg **79T** (−49% from v1). All exhaustively verified.
@@ -125,6 +131,8 @@ Register count: 7 main (A,B,C,D,E,H,L) + 4 IX/IY halves = **11 registers** for a
125131
- ≤4v: 156,506 shapes (78.9% feasible), 40 seconds
126132
- ≤5v: 17,366,874 shapes (67.7% feasible), 20 minutes
127133
- 6v dense (tw≥4): 66,118,738 shapes (38.9% feasible), ~6 hours
134+
- **IX-expanded 5v**: 60.9M entries (79.2% feasible) — `data/merged_ix_5v.bin`
135+
- **OFB sidecars**: `data/enriched_{4v,5v,6v_dense}.ofb` — 32-bit op-feasibility bitmask per entry (15 flags)
128136
- Enrichment: 43% lack A (hidden ALU infeasibility), 21% lack HL
129137
- Smart CALL save: 17T avg (vs 34T naive) = 50% reduction
130138
- Feasibility cliff: 95.9% (2v) → 0.9% (6v) — phase transition

README.md

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,15 @@
22

33
A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that **never guesses** — every optimization is provably optimal.
44

5-
## What's New (Birthday Marathon — March 26–29, 2026)
5+
## What's New (Birthday Marathon — March 26 – April 1, 2026)
6+
7+
### Day 7 Highlights (April 1, 2026)
8+
9+
- **`pkg/regalloc/ofb.go`** — public OFB API: `ComputeOFB()`, `LoadOFB()`, 15 flag constants, `OFBNames()`. No more local duplicates in consumer tools.
10+
- **OFB sidecars for ALL table files**`enrich-ofb` now auto-detects ENRT and Z80T v2 formats. Sidecars generated for `merged_ix_5v.bin`, `ix_expanded_5v.bin`, etc. (~233MB each).
11+
- **`cmd/gen6v-ix-feed`** — fast FuncDesc JSON generator for the 6v IX-expanded GPU run. Pre-computes 562 valid treewidth≥4 masks (out of 32,768 possible 6v graphs) in <0.5s, then iterates masks as outer loop → 200K shapes/sec feeder vs 8 shapes/sec from regalloc-enum CPU bottleneck.
12+
- **Dual-GPU ix_expanded_6v_dense run** — 298.7M shapes split across GPU0 (masks 0–280) + GPU1 (masks 281–561), running in background, ETA ~5h. Will yield the largest IX-aware regalloc table yet.
13+
- **EXX zone architecture** — S1+S2 independent table lookups, IXH/IXL/IYH/IYL as zero-cost inter-zone bridges. Full pipeline: `total_cost = lookup(S1) + lookup(S2) + 4T×N_exx + 8T×N_ix_accesses`.
614

715
### Week 1 Highlights
816

@@ -24,6 +32,7 @@ A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that
2432
| 4 | Mar 28 | Regalloc | Five-level pipeline, backtracking solver, phase cliff — [log](contexts/day4_wisdom.md) |
2533
| 5 | Mar 29 | Images+u32 | 3 CUDA generators, u32 library, Introspec BB port — [log](contexts/day5_wisdom.md) |
2634
| 6 | Mar 29 | Division | div8 v3 (−49%), carry_compare, sign/sat/arith16 — [log](contexts/day6_wisdom.md) |
35+
| 7 | Apr 1 | OFB+6v | OFB public API, Z80T v2 sidecars, gen6v-ix-feed, dual-GPU 298M run |
2736

2837
### pRNG Image Search & Animation Pipeline — ZX Spectrum Demoscene
2938

@@ -435,6 +444,68 @@ Length 4: 4,215^4 = 315T targets → STOKE only
435444
Length 5+: combinatorial explosion → STOKE only
436445
```
437446

447+
## Pending Tables
448+
449+
These table files are currently being computed or are planned. Each will be a drop-in addition to the existing regalloc pipeline.
450+
451+
### `data/ix_expanded_6v_dense.bin`**in progress** (ETA ~5h from April 1, 2026)
452+
453+
The largest regalloc table yet: 6 virtual registers with full IX/IY half-register support and treewidth≥4 interference graphs.
454+
455+
**What it enables:**
456+
- IX-aware register allocation for dense 6-vreg functions — currently the `merged_ix_5v.bin` table (60.9M entries) only covers up to 5 vregs with IX halves
457+
- Complete `pkg/regalloc` O(1) lookup for the common case of 6 live variables with pointer or EXX-zone patterns
458+
- Covers `HLH'L' u32` patterns (HL in main bank + BC or DE free as shadow) that appear in 32-bit arithmetic loops
459+
460+
**Generation:**
461+
```bash
462+
# Currently running (background, dual-GPU on main i7):
463+
./gen6v-ix-feed -mask-start 0 -mask-end 281 | ./cuda/z80_regalloc --server --gpu-id 0 > data/ix_6v_gpu0.jsonl &
464+
./gen6v-ix-feed -mask-start 281 -mask-end 562 | ./cuda/z80_regalloc --server --gpu-id 1 > data/ix_6v_gpu1.jsonl &
465+
466+
# After completion:
467+
cat data/ix_6v_gpu0.jsonl data/ix_6v_gpu1.jsonl > data/ix_expanded_6v_dense.jsonl
468+
CGO_ENABLED=0 ~/go/bin/go1.24.3 run ./cmd/build-ix-table/ \
469+
-n-locsets8 6 -max-vregs 6 < data/ix_expanded_6v_dense.jsonl > data/ix_expanded_6v_dense.bin
470+
./enrich-ofb -input data/ix_expanded_6v_dense.bin -output data/ix_expanded_6v_dense.ofb
471+
```
472+
473+
**Stats (projected):** ~298.7M shapes, ~79% feasible ≈ ~236M feasible assignments, ~2.5GB raw binary.
474+
475+
The key algorithmic insight behind `gen6v-ix-feed`: of 32,768 possible nv=6 interference graphs, only **562 have treewidth≥4** (the ones worth exhaustive search). Pre-computing these 562 masks and iterating them as the outer loop reduces the feeder from 8 shapes/sec (CPU bottleneck) to 200K shapes/sec.
476+
477+
### OFB sidecars — complete for all current `.bin` tables
478+
479+
OFB (Op Feasibility Bag) sidecars precompute 15 per-assignment flags in O(1), aligned 1:1 with the source file:
480+
481+
| Sidecar | Source | Size | Description |
482+
|---------|--------|------|-------------|
483+
| `data/enriched_4v.ofb` | enriched_4v.enr | ~625KB | 156K entries |
484+
| `data/enriched_5v.ofb` | enriched_5v.enr | ~67MB | 17.4M entries |
485+
| `data/enriched_6v_dense.ofb` | enriched_6v_dense.enr | ~253MB | 66.1M entries |
486+
| `data/merged_ix_5v.ofb` | merged_ix_5v.bin | ~233MB | 60.9M entries |
487+
| `data/ix_expanded_5v.ofb` | ix_expanded_5v.bin | ~233MB | 60.9M entries |
488+
| `data/ix_expanded_6v_dense.ofb` | ix_expanded_6v_dense.bin | ~1.0GB | ~298M entries *(pending)* |
489+
490+
OFB flags let the backend skip table lookups for common feasibility checks: `OFBMul8Safe` (H/L/C all free → safe to clobber for mul8), `OFBDJNZFree` (B free → DJNZ without save), `OFBHLArith` (HL assigned → ADD HL,rr native), etc.
491+
492+
```go
493+
// pkg/regalloc usage:
494+
import "github.com/oisee/z80-optimizer/pkg/regalloc"
495+
496+
ofb := regalloc.ComputeOFB(entry.Assignment) // O(1), no sidecar needed
497+
// Or load precomputed sidecar:
498+
table, _ := regalloc.LoadOFB("data/enriched_5v.ofb")
499+
ofb := table.Get(entryIndex)
500+
501+
if ofb & regalloc.OFBMul8Safe != 0 {
502+
// H, L, C all free — safe to use H/L/C as mul8 scratch
503+
}
504+
if ofb & regalloc.OFBDJNZFree != 0 {
505+
// B not assigned — emit DJNZ loop without PUSH BC / POP BC
506+
}
507+
```
508+
438509
## What's next
439510

440511
### In progress
@@ -523,7 +594,9 @@ GPU brute-force over all possible register allocation constraint shapes. For eac
523594
|-------|--------|----------|----------|
524595
| ≤4 vregs | 156,506 | 40 sec | 78.9% |
525596
| ≤5 vregs | 17,366,874 | 20 min | 67.7% |
526-
| 6 vregs (dense, tw≥4) | 66,118,738 | ~6 hours | TBD |
597+
| 6 vregs (dense, tw≥4) | 66,118,738 | ~6 hours | 38.9% |
598+
| **IX-expanded ≤5v** | **60,900,000** | **~2h** | **79.2%**`data/merged_ix_5v.bin` |
599+
| **IX-expanded 6v** | **~298,700,000** | **~5.2h** | *pending* — dual-GPU run in progress |
527600

528601
**Key findings:**
529602

TODO.md

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# TODO — Z80 Superoptimizer Roadmap
22

3-
> Last updated: 2026-03-29 (Day 6 birthday marathon)
3+
> Last updated: 2026-04-01 (Day 7 — IX expansion, OFB, mul16c)
44
55
Legend: `[x]` done, `[-]` in progress, `[ ]` planned.
66
Effort: S = hours, M = day, L = days, XL = week+.
@@ -12,9 +12,11 @@ Effort: S = hours, M = day, L = days, XL = week+.
1212
### 1.1 Multiply — COMPLETE
1313
- [x] **mul8**: 254/254 constants, A×K→A — `data/mulopt8_clobber.json` (S)
1414
- [x] **mul16**: 254/254 constants, A×K→HL — `data/mulopt16_complete.json` (M)
15-
- [ ] **mul16c**: HL×K→HL (16-bit × constant, full 16-bit) — needs new CUDA kernel (M)
16-
- Approach: decompose as HL×K = L×K + H×K×256, use mul16 building blocks
17-
- Or: new CUDA search with HL input, reduced op pool
15+
- [-] **mul16c**: HL×K→HL (16-bit × constant) — `cuda/z80_mulopt16c.cu` built, running (M)
16+
- 12-op pool: ADD HL,HL/BC/DE + 8×LD saves + EX DE,HL
17+
- ~86/254 found (max-len=10). Structural limit: floor(log2(K))+hamming(K)≤9
18+
- Not found (168/254): K values with high Hamming weight — need SBC HL,rr to add
19+
- Next: gen_mul16c_table.py → pkg/mulopt/mul16c_table.go; Go wrapper Emit16c(k)
1820

1921
### 1.2 Division / Modulo — COMPLETE (u8)
2022
- [x] **div8**: 254/254 constants, A÷K→A — `data/div8_optimal.json` v3 (M)
@@ -112,11 +114,16 @@ Effort: S = hours, M = day, L = days, XL = week+.
112114

113115
## 3. Register Allocation
114116

115-
### 3.1 Tables — COMPLETE
117+
### 3.1 Tables — COMPLETE + IX EXPANDED
116118
- [x] **83.6M shapes** (≤6v): enumerated, enriched, compressed (78MB)
117119
- [x] **37.6M feasible**: each with optimal assignment + 15 metrics
118120
- [x] **O(1) lookup**: signature = (interference_shape, operation_bag) → hash
119121
- [x] **Enriched tables**: 43% lack A, 21% lack HL, smart CALL save 17T avg
122+
- [x] **IX-expanded 5v**: `data/ix_expanded_5v.bin` (60.9M entries, 79.2% feasible, 117s GPU)
123+
- [x] **Merged IX+5v**: `data/merged_ix_5v.bin` (60.9M entries, 79.8% feasible, 382MB)
124+
- [x] **OFB sidecars**: `data/enriched_{4v,5v,6v_dense}.ofb` — 32-bit feasibility bitmask per entry
125+
- 15 bits: ALU, ptr ops, mul8-safe, DJNZ, EXX bridge, HLH'L' u32, ADC/SBC src validity
126+
- 5v: 11.76M feasible; 6v dense: 25.77M feasible
120127

121128
### 3.2 Five-Level Pipeline — MOSTLY COMPLETE
122129
- [x] Level 1: Cut vertex decomposition (free split, 87%)
@@ -176,10 +183,11 @@ Effort: S = hours, M = day, L = days, XL = week+.
176183

177184
### 5.1 CUDA Kernels — WORKING
178185
- [x] **z80_search_v2.cu**: 3-stage pipeline (QC→Mid→Exhaustive), dual-GPU
179-
- [x] **z80_regalloc.cu**: GPU allocator + CPU backtracking fallback
186+
- [x] **z80_regalloc.cu**: GPU allocator + CPU backtracking fallback (constrained enumeration fix: 97× speedup)
180187
- [x] **z80_mulopt_fast.cu**: 14-op constant multiply (38× faster)
181188
- [x] **z80_divmod_fast.cu**: 14-op division/modulo search
182-
- [x] **z80_mulopt16.cu**: 16-bit multiply search
189+
- [x] **z80_mulopt16.cu**: 16-bit multiply search (A×K→HL)
190+
- [x] **z80_mulopt16c.cu**: HL×K→HL search, 12-op pool (new day 7)
183191
- [x] **z80_common.h**: shared executor, flag tables, test vectors
184192

185193
### 5.2 Multi-Platform DSL — WORKING
@@ -204,6 +212,9 @@ Effort: S = hours, M = day, L = days, XL = week+.
204212
- [x] `pkg/regalloc/`: LoadBinary(path), IndexOf(shape), Lookup(idx)
205213
- [x] `pkg/peephole/`: Lookup(source) top500, LoadRules(path) full 739K
206214
- [x] `pkg/gpugen/`: ISA DSL for multi-platform code generation
215+
- [ ] `cmd/enrich-ofb/`: Emit OFB sidecar — built and validated (day 7)
216+
- [ ] Add to Go package exports for use by regalloc pipeline
217+
- [ ] `pkg/mulopt/mul16c_table.go`: Emit16c(k) HL×K→HL — pending mul16c JSON complete
207218

208219
### 6.2 Pending Integration
209220
- [ ] **div8 inline expansion**: MinZ codegen wiring for JP __div8 → inline (S)

0 commit comments

Comments
 (0)