Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327) by sofiabod · Pull Request #489 · openai/parameter-golf

sofiabod · 2026-03-23T01:26:45Z

Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)

Summary

Mean val_bpb 1.1327 (3-seed), best 1.1314 — beats prior SOTA 1.1428 by -0.010
BigramHash(2048) + SmearGate + partial RoPE + depth damping + AdamW TTT 5ep
Training: ~10,480 steps in 600s on 8xH100, eval: TTT 106s + sliding window 233s

Approach

7L d=512 transformer with MLP 3x ReLU², tied embeddings (vocab 1024), int8+zlib compression.

Key techniques stacked on top of baseline:

Architecture:

BigramHash(2048, dim=128): hash consecutive token pairs into learned embeddings, additive before RMSNorm
SmearGate: per-dimension learned gate blending each token with previous token
Partial RoPE (16/64 dims): rotary embeddings on 25% of head dimensions, rest position-free
LN scale depth damping: init attn/mlp scales to 1/sqrt(layer_idx+1)
Sequence length 4096 for training and evaluation

Optimizer:

Muon with weight decay 0.04, momentum 0.99
Tied embedding lr=0.01, matrix lr=0.03
Warmdown 6000 iters, logit softcap 15

Evaluation:

Test-time training: AdamW(lr=0.0005, wd=0.0) for 5 epochs on validation tokens, DDP-synced
Sliding window evaluation with stride=64

Results (3-seed, sliding window stride=64)

Seed	Steps	val_bpb
1337	10482	1.1323
42	10488	1.1314
7	10470	1.1343
Mean±Std		1.1327 ± 0.0015

Comparison to prior SOTA

Metric	Prior SOTA (thwu1)	Ours
Mean BPB	1.1428	1.1327
Architecture	10L Int5-MLP	7L MLP3x
Token tricks	BigramHash(10240)	BigramHash(2048) + SmearGate
Quantization	Int5/Int6 + zstd	Int8 + zlib
TTT	None	AdamW 5ep
Eval	Standard	Sliding window stride=64

Run command

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

All hyperparameters are set as defaults in train_gpt.py.

- add BigramHash(2048,128) with zero-init and learnable scale - add SmearGate: per-dim gate blending with prev token - weight decay 0.04 on Muon (leaderboard standard) - muon_momentum 0.99 (from 0.95, leaderboard standard) - best config baked in: 7L mlp_mult=3 seq_len=4096 etc - bigram/smear params explicitly added to optimizer groups

- add forward_logits() method to GPT for eval without loss computation - add eval_val_sliding() with configurable stride (default 64) - each scored token gets ~4032 tokens of context instead of ~2048 average - eval-only change: no training modifications, no artifact size change - expected ~0.03 BPB improvement in reported score

- init attn_scale and mlp_scale to 1/sqrt(layer_idx+1) instead of 1.0 - deeper layers get smaller residual contributions, stabilizes training - zero extra params, zero compute overhead - used by all top submissions per vault research

- apply rotary embeddings to first 16 dims of 64 head_dim (25%) - remaining 48 dims are position-free, improving generalization - zero extra params, used by all top submissions per vault research - configurable via ROPE_DIMS env var (0=all, default=16)

- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s

…327)

MatoTeziTanka · 2026-04-11T20:07:23Z

Community Review — Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)

BPB: 1.1327 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA ea8f7c42cf10, file records/track_10min_16mb/2026-03-22_7L-BigramHash-TTT/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=7, vocab=1024, code=58707 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=7, vocab=1024, code=58707 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

sofiabod added 9 commits March 18, 2026 14:34

initial

45422a6

add modal launcher for 8xh100 training

f13c234

fix md + tests

7df4c4b

Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1…

ea8f7c4

…327)

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

gb250e mentioned this pull request Apr 2, 2026

TPI-020 eval-first monkey model external send required gate gb250e/parameter-golf#19

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)#489

Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)#489
sofiabod wants to merge 9 commits intoopenai:mainfrom
sofiabod:submission/7L-BigramHash-TTT-1.1327

sofiabod commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sofiabod commented Mar 23, 2026

Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)

Summary

Approach

Results (3-seed, sliding window stride=64)

Comparison to prior SOTA

Run command

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants