Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) by parinzee · Pull Request #493 · openai/parameter-golf

parinzee · 2026-03-23T02:34:50Z

Summary

val_bpb: 1.1309 (mean of 3 seeds, std: 0.00017)
11 layers, 512 dim, 8H/4KV GQA
Artifact: ~15.8 MB (all seeds under 16 MB)

3-Seed Results

Seed	val_bpb	artifact_bytes
42	1.13109	15,764,564
1337	1.13085	15,626,741
2024	1.13067	15,923,256
Mean	1.13087
Std	0.00017

Key Changes from Baseline

11 layers (up from 10), 512 dim, 8 heads / 4 KV heads (GQA)
XSA (Exclusive Self Attention) on last 4 layers for better representation
LeakyReLU(0.5)² activation — squared leaky ReLU with 0.5 negative slope
Partial RoPE — only 16/64 dims use rotary embeddings
EMA weight averaging (decay=0.997) for smoother final weights
Int6 quantization for all large weight matrices + zstd-22 compression
Scale clamping fix — clamp_min(1/clip_range) improves quantization quality
Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)
BigramHash(2048, dim=128) token embeddings
warmdown_iters=4500 for learning rate schedule
Higher learning rates (matrix_lr=0.025, scalar_lr=0.025)

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

@thwu1

…1309) 3-seed validation results: - Seed 42: val_bpb=1.13109, artifact=15,764,564 bytes - Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes - Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes - Mean: 1.13087 (std: 0.00017) Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers, LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization, zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500. Built on baseline by @thwu1 (PR openai#180).

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).

- Interleaved draft tokens: soft predictions placed between real tokens for 1-2 token lookahead via standard causal attention - SmearGate and BigramHash naturally gain future context on interleaved seq - Bigram noise curriculum: drafts anneal from GT to realistic noise - Two-pass eval: pass 1 generates drafts, pass 2 refines with interleaving - LeakyReLU(0.5)² activation toggle (free -0.003 BPB from PR openai#493) - W&B logging (opt-in via WANDB_PROJECT env var) - Sweep runner with 13 configs covering baselines, draft variants, and ablations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…McClendon

…McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

…stopher-Lee-McClendon

MatoTeziTanka · 2026-04-11T20:07:07Z

Community Review — Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)

BPB: 1.1309 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 739a729f6b4a, file records/track_10min_16mb/2026-03-23_11L_EMA_Int6_XSA_LeakyReLU_PartialRoPE/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

…stopher-Lee-McClendon

…stopher-Lee-McClendon

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Merged

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

b08d72a

…stopher-Lee-McClendon

RoyiRa mentioned this pull request Mar 23, 2026

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589

Closed

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

Asukabot0 mentioned this pull request Mar 24, 2026

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638

Closed

anthony-maio mentioned this pull request Mar 24, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657

Closed

6 tasks

This was referenced Mar 25, 2026

11L LeakyReLU² + Partial RoPE + Int6 + EMA (~1.1200 BPB) vibhu1510/parameter-golf-vibhu#1

Closed

Non-record: 11L LeakyReLU² + Int6 + EMA (~1.1200 BPB) vibhu1510/parameter-golf-vibhu#2

Merged

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

stukenov mentioned this pull request Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733

Closed

gowtham0992 mentioned this pull request Mar 25, 2026

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970) #738

Closed

stukenov mentioned this pull request Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745

Closed

This was referenced Mar 25, 2026

Non-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253) #635

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

Asukabot0 mentioned this pull request Mar 25, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

anthony-maio mentioned this pull request Mar 25, 2026

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #175

Open

6 tasks

Robby955 mentioned this pull request Mar 25, 2026

Record: 0.9623 BPB — 7-Gram Entropy Cache + XSA-all + EBLS #777

Closed

8 tasks

Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

178fd95

…stopher-Lee-McClendon

SirSaltySalmon mentioned this pull request Mar 26, 2026

(Nonrecord) Applied Async Prefetching Potentially Boosts Performance #785

Open

Robby955 mentioned this pull request Mar 26, 2026

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796

Closed

TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026

Fix author attributions: PR #493 @parinzee, PR #461 @Christopher-Lee-…

4b7641a

…McClendon

nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

7866ff4

…stopher-Lee-McClendon

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

mikeapedia mentioned this pull request Mar 29, 2026

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089

Open

teddyoweh mentioned this pull request Mar 29, 2026

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB #1092

Open

anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

2569ee6

…stopher-Lee-McClendon

AnubhavBharadwaaj mentioned this pull request Mar 30, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

d8bd62f

…stopher-Lee-McClendon

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

yaowubarbara mentioned this pull request Apr 2, 2026

Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run #1062

Open

aryanbhosale mentioned this pull request Apr 3, 2026

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean) #1290

Open

MatoTeziTanka mentioned this pull request Apr 3, 2026

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289

Closed

8 tasks

aryanbhosale mentioned this pull request Apr 3, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1296

Open

AR6420 mentioned this pull request Apr 3, 2026

Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record) #1316

Open

monisha-max mentioned this pull request Apr 4, 2026

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325

Open

4 tasks

stukenov mentioned this pull request Apr 4, 2026

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364

Open

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

PhamPhuHoa-23 mentioned this pull request Apr 8, 2026

Non-record: XSA-11 + Parallel Residual (L7+) + Depth Recurrence — val_bpb 1.1056 (1-seed, 1×H100) #1467

Open

MatoTeziTanka mentioned this pull request Apr 11, 2026

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff #915

Open

Idan3011 mentioned this pull request Apr 12, 2026

val_bpb 1.1036 - 12L sp9000 + depth recurrence + hash-TTT #1565

Open

shivangbaveja mentioned this pull request Apr 12, 2026

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337) #1573

Open

123-code pushed a commit to 123-code/parameter-golf that referenced this pull request Apr 19, 2026

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

f4fbd8c

…stopher-Lee-McClendon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6

parinzee commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

parinzee commented Mar 23, 2026

Summary

3-Seed Results

Key Changes from Baseline

Run Command

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants