Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488
Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488pkim02 wants to merge 1 commit intoopenai:mainfrom
Conversation
…=1.3267) Adds 6 candidate scripts from systematic ablation study on 1xH100: - 2026-03-20 SmearGate+BigramHash+SWA candidate (original starting point) - 2026-03-22 Clean_11L_Int6_QAT: stripped baseline (sliding bpb=1.3761) - 2026-03-22 Clean_11L_Int6_QAT_BigramSmear: +bigram+smear ablation (1.3796) - 2026-03-22 Clean_11L_Mixed_Int5Int6_QAT: int5 MLP / int6 attn (1.3874, 11.6MB) - 2026-03-22 Phase2_A_Warmdown: wallclock-fraction warmdown + SWA (1.3267, best) - 2026-03-23 Phase2_A_Warmdown_EMA: late-start EMA during warmdown (1.3436) Key techniques: int6 grouped quantization, QAT (STE fake-quant last 15%), zstd-22 compression, sliding-window eval stride=64, wallclock-fraction warmdown schedule (fixes the buggy iter-based warmdown), Muon WD=0.04. Also updates root README with non-MLX smoke test instructions and fixes enable_gqa compatibility for PyTorch <2.5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)BPB: 1.3267 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Techniques
Ablation results (1xH100, 600s wallclock)
Test plan
torchrun --nproc_per_node=8🤖 Generated with Claude Code