Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493
Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493parinzee wants to merge 1 commit intoopenai:mainfrom
Conversation
…1309) 3-seed validation results: - Seed 42: val_bpb=1.13109, artifact=15,764,564 bytes - Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes - Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes - Mean: 1.13087 (std: 0.00017) Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers, LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization, zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500. Built on baseline by @thwu1 (PR openai#180).
- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts
…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
- Interleaved draft tokens: soft predictions placed between real tokens for 1-2 token lookahead via standard causal attention - SmearGate and BigramHash naturally gain future context on interleaved seq - Bigram noise curriculum: drafts anneal from GT to realistic noise - Two-pass eval: pass 1 generates drafts, pass 2 refines with interleaving - LeakyReLU(0.5)² activation toggle (free -0.003 BPB from PR openai#493) - W&B logging (opt-in via WANDB_PROJECT env var) - Sweep runner with 13 configs covering baselines, draft variants, and ablations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)BPB: 1.1309 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
3-Seed Results
Key Changes from Baseline
Run Command
Built on SOTA baseline by @thwu1 (PR #180).