Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack by anantdgoel · Pull Request #487 · openai/parameter-golf

anantdgoel · 2026-03-23T01:13:29Z

val_bpb: 1.1720 | 19.4 MB (unlimited compute) | 1xA6000, 9500 steps, 14.5hr

Summary

Value Residual (ResFormer, arXiv:2410.17897): caches layer-0 V vectors, mixes into subsequent layers via learnable scalars. -0.015 BPB, 22 params added.
Gated Attention (arXiv:2505.06708): per-head sigmoid gate after SDPA, eliminates attention sinks. -0.003 BPB, ~37K params added.
Techniques stack additively (-0.0172 combined), validated via controlled ablation on 9L baseline.
Full community meta-stack: 11L MLP3x + SmearGate + BigramHash(2048) + OrthoInit + WD0.04 + XSA(4) + EMA(0.997) + Partial RoPE + LN Scale + Logit Softcap.
Both techniques independently adopted by 5+ community submissions, including a record-tier entry (1.1101 BPB).

Ablation (9L v1024, 1000 steps, 131K batch, 1x3090)

Config	val_bpb	Delta
Control	1.4697	—
+ Gated Attention	1.4665	-0.0032
+ Value Residual	1.4546	-0.0151
+ Both	1.4525	-0.0172

Production Results

Metric	Value
Pre-quant val_bpb	1.1710
Post-quant val_bpb	1.1720
Quant gap	0.0010
Artifact	19.4 MB

Files

README.md — full writeup with ablations and reproducibility command
submission.json — metadata
train_gpt.py — training script
train.log — complete training log

…) on 11L production stack Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T14:24:10Z

Community Review — Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR: #487 — "Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack"
Author: anantdgoel
Head SHA: be5355d
Track: non-record-unlimited-compute-16mb
Reported val_bpb: 1.17197837

Check 1: N-gram Family Bug (CLOSE trigger)

The BigramHashEmbedding at line 1018–1023 hashes position i using:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This combines t[i] (current token, position 1:) with t[i-1] (previous token, position :-1). The hash at position i does not incorporate t[i+1] (the future/target token). Position 0 is set to mod (fallback). This is the legal BigramHash pattern — the key does not leak the target token. NOT a bug. NO CLOSE.

Note: BIGRAM_HASH defaults to "0" (disabled). The feature is off unless explicitly enabled. Even if enabled, construction is legal.

Check 2: Pre-Quant TTT (CLOSE trigger)

SGD_TTT defaults to "0" (disabled, line 96). When enabled, eval_val_sgd_ttt runs a two-phase procedure: Phase 1 adapts weights via SGD over val_tokens (lines 453–464), Phase 2 scores with a sliding window on the adapted model (lines 465–532), then restores original weights. This is multi-epoch SGD (not AdamW). No AdamW on val_tokens is present anywhere. The submitted result uses pre_quant_val_bpb: 1.1710 logged from eval_val, the standard non-TTT evaluator — confirmed by SGD_TTT=0 default and no TTT invocation in the submission config. NOT a Pre-Quant TTT violation. NO CLOSE.

Check 3: Legal TTT (CLEAN)

eval_val_sgd_ttt is score-first per-chunk compliant in structure (adapt then score sequentially per window, weights restored after). If submitted BPB were from TTT, this would be CLEAN. However, submitted BPB is from standard eval, so TTT is moot here. N/A — feature not used in submission.

Check 4: Scored-Region SLOT (HOLD)

No scored-region manipulation detected. Sliding-window eval (eval_val_sliding) uses standard stride logic and skips context tokens using s = max(wlen - stride, 0) — consistent with upstream baseline. No cherry-picked region reduction. NO HOLD triggered.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB…

be5355d

…) on 11L production stack Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:non-record-vr-ga-production

anantdgoel commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anantdgoel commented Mar 23, 2026

Summary

Ablation (9L v1024, 1000 steps, 131K batch, 1x3090)

Production Results

Files

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack

Check 1: N-gram Family Bug (CLOSE trigger)

Check 2: Pre-Quant TTT (CLOSE trigger)

Check 3: Legal TTT (CLEAN)

Check 4: Scored-Region SLOT (HOLD)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants