Skip to content

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487

Open
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:non-record-vr-ga-production
Open

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack#487
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:non-record-vr-ga-production

Conversation

@anantdgoel
Copy link
Copy Markdown

val_bpb: 1.1720 | 19.4 MB (unlimited compute) | 1xA6000, 9500 steps, 14.5hr

Summary

  • Value Residual (ResFormer, arXiv:2410.17897): caches layer-0 V vectors, mixes into subsequent layers via learnable scalars. -0.015 BPB, 22 params added.
  • Gated Attention (arXiv:2505.06708): per-head sigmoid gate after SDPA, eliminates attention sinks. -0.003 BPB, ~37K params added.
  • Techniques stack additively (-0.0172 combined), validated via controlled ablation on 9L baseline.
  • Full community meta-stack: 11L MLP3x + SmearGate + BigramHash(2048) + OrthoInit + WD0.04 + XSA(4) + EMA(0.997) + Partial RoPE + LN Scale + Logit Softcap.
  • Both techniques independently adopted by 5+ community submissions, including a record-tier entry (1.1101 BPB).

Ablation (9L v1024, 1000 steps, 131K batch, 1x3090)

Config val_bpb Delta
Control 1.4697
+ Gated Attention 1.4665 -0.0032
+ Value Residual 1.4546 -0.0151
+ Both 1.4525 -0.0172

Production Results

Metric Value
Pre-quant val_bpb 1.1710
Post-quant val_bpb 1.1720
Quant gap 0.0010
Artifact 19.4 MB

Files

  • README.md — full writeup with ablations and reproducibility command
  • submission.json — metadata
  • train_gpt.py — training script
  • train.log — complete training log

…) on 11L production stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR: #487 — "Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack"
Author: anantdgoel
Head SHA: be5355d
Track: non-record-unlimited-compute-16mb
Reported val_bpb: 1.17197837


Check 1: N-gram Family Bug (CLOSE trigger)

The BigramHashEmbedding at line 1018–1023 hashes position i using:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This combines t[i] (current token, position 1:) with t[i-1] (previous token, position :-1). The hash at position i does not incorporate t[i+1] (the future/target token). Position 0 is set to mod (fallback). This is the legal BigramHash pattern — the key does not leak the target token. NOT a bug. NO CLOSE.

Note: BIGRAM_HASH defaults to "0" (disabled). The feature is off unless explicitly enabled. Even if enabled, construction is legal.

Check 2: Pre-Quant TTT (CLOSE trigger)

SGD_TTT defaults to "0" (disabled, line 96). When enabled, eval_val_sgd_ttt runs a two-phase procedure: Phase 1 adapts weights via SGD over val_tokens (lines 453–464), Phase 2 scores with a sliding window on the adapted model (lines 465–532), then restores original weights. This is multi-epoch SGD (not AdamW). No AdamW on val_tokens is present anywhere. The submitted result uses pre_quant_val_bpb: 1.1710 logged from eval_val, the standard non-TTT evaluator — confirmed by SGD_TTT=0 default and no TTT invocation in the submission config. NOT a Pre-Quant TTT violation. NO CLOSE.

Check 3: Legal TTT (CLEAN)

eval_val_sgd_ttt is score-first per-chunk compliant in structure (adapt then score sequentially per window, weights restored after). If submitted BPB were from TTT, this would be CLEAN. However, submitted BPB is from standard eval, so TTT is moot here. N/A — feature not used in submission.

Check 4: Scored-Region SLOT (HOLD)

No scored-region manipulation detected. Sliding-window eval (eval_val_sliding) uses standard stride logic and skips context tokens using s = max(wlen - stride, 0) — consistent with upstream baseline. No cherry-picked region reduction. NO HOLD triggered.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants