Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100) by pkim02 · Pull Request #488 · openai/parameter-golf

pkim02 · 2026-03-23T01:16:48Z

Summary

Best result: val_bpb = 1.3267 (sliding window stride=64, post int6 quantization, 1xH100 600s)
11 layers, MLP3x, GQA 8/4 heads, 26.5M params, 13.3 MB submission size
Systematic ablation study with 6 candidate scripts

Techniques

Int6 grouped quantization (group_rows=8) for all weights
QAT: STE fake-quantization in last 15% of wallclock, halves quant gap
zstd-22 compression (saves ~1 MB vs zlib)
Wallclock-fraction warmdown (last 15%): fixes the buggy iter-based formula that breaks with torch.compile overhead
SWA: 7 checkpoints during warmdown, properly guarded against premature collection
Sliding window eval stride=64
Muon optimizer with WD=0.04, momentum warmup 0.92→0.99

Ablation results (1xH100, 600s wallclock)

Variant	Sliding bpb	Size
Clean 11L Int6 QAT (baseline A)	1.3761	14.0 MB
A + BigramHash + SmearGate	1.3796	14.4 MB
A + Mixed Int5/Int6	1.3874	11.6 MB
A + Warmdown + SWA	1.3267	13.3 MB
A + Warmdown + Late EMA	1.3436	13.2 MB

Test plan

Verify on 8xH100 with torchrun --nproc_per_node=8
Confirm submission size < 16 MB
Validate sliding window bpb matches reported numbers

🤖 Generated with Claude Code

…=1.3267) Adds 6 candidate scripts from systematic ablation study on 1xH100: - 2026-03-20 SmearGate+BigramHash+SWA candidate (original starting point) - 2026-03-22 Clean_11L_Int6_QAT: stripped baseline (sliding bpb=1.3761) - 2026-03-22 Clean_11L_Int6_QAT_BigramSmear: +bigram+smear ablation (1.3796) - 2026-03-22 Clean_11L_Mixed_Int5Int6_QAT: int5 MLP / int6 attn (1.3874, 11.6MB) - 2026-03-22 Phase2_A_Warmdown: wallclock-fraction warmdown + SWA (1.3267, best) - 2026-03-23 Phase2_A_Warmdown_EMA: late-start EMA during warmdown (1.3436) Key techniques: int6 grouped quantization, QAT (STE fake-quant last 15%), zstd-22 compression, sliding-window eval stride=64, wallclock-fraction warmdown schedule (fixes the buggy iter-based warmdown), Muon WD=0.04. Also updates root README with non-MLX smoke test instructions and fixes enable_gqa compatibility for PyTorch <2.5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:12:53Z

Community Review — Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)

BPB: 1.3267 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA af3eeb9d9817, file records/track_10min_16mb/2026-03-20_11L_Int6_MLP3x_SmearGate_BigramHash_SWA_Candidate/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488
pkim02 wants to merge 1 commit intoopenai:mainfrom
pkim02:11L-int6-qat-warmdown

pkim02 commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pkim02 commented Mar 23, 2026

Summary

Techniques

Ablation results (1xH100, 600s wallclock)

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants