Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)#489
Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)#489sofiabod wants to merge 9 commits intoopenai:mainfrom
Conversation
- add BigramHash(2048,128) with zero-init and learnable scale - add SmearGate: per-dim gate blending with prev token - weight decay 0.04 on Muon (leaderboard standard) - muon_momentum 0.99 (from 0.95, leaderboard standard) - best config baked in: 7L mlp_mult=3 seq_len=4096 etc - bigram/smear params explicitly added to optimizer groups
- add forward_logits() method to GPT for eval without loss computation - add eval_val_sliding() with configurable stride (default 64) - each scored token gets ~4032 tokens of context instead of ~2048 average - eval-only change: no training modifications, no artifact size change - expected ~0.03 BPB improvement in reported score
- init attn_scale and mlp_scale to 1/sqrt(layer_idx+1) instead of 1.0 - deeper layers get smaller residual contributions, stabilizes training - zero extra params, zero compute overhead - used by all top submissions per vault research
- apply rotary embeddings to first 16 dims of 64 head_dim (25%) - remaining 48 dims are position-free, improving generalization - zero extra params, used by all top submissions per vault research - configurable via ROPE_DIMS env var (0=all, default=16)
- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s
Community Review — Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)BPB: 1.1327 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=7, vocab=1024, code=58707 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=7, vocab=1024, code=58707 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)
Summary
Approach
7L d=512 transformer with MLP 3x ReLU², tied embeddings (vocab 1024), int8+zlib compression.
Key techniques stacked on top of baseline:
Architecture:
Optimizer:
Evaluation:
Results (3-seed, sliding window stride=64)
Comparison to prior SOTA
Run command
All hyperparameters are set as defaults in
train_gpt.py.