Skip to content

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493

Open
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6
Open

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)#493
parinzee wants to merge 1 commit intoopenai:mainfrom
parinzee:submission/2026-03-23-11L-EMA-Int6

Conversation

@parinzee
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1309 (mean of 3 seeds, std: 0.00017)
  • 11 layers, 512 dim, 8H/4KV GQA
  • Artifact: ~15.8 MB (all seeds under 16 MB)

3-Seed Results

Seed val_bpb artifact_bytes
42 1.13109 15,764,564
1337 1.13085 15,626,741
2024 1.13067 15,923,256
Mean 1.13087
Std 0.00017

Key Changes from Baseline

  1. 11 layers (up from 10), 512 dim, 8 heads / 4 KV heads (GQA)
  2. XSA (Exclusive Self Attention) on last 4 layers for better representation
  3. LeakyReLU(0.5)² activation — squared leaky ReLU with 0.5 negative slope
  4. Partial RoPE — only 16/64 dims use rotary embeddings
  5. EMA weight averaging (decay=0.997) for smoother final weights
  6. Int6 quantization for all large weight matrices + zstd-22 compression
  7. Scale clamping fix — clamp_min(1/clip_range) improves quantization quality
  8. Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)
  9. BigramHash(2048, dim=128) token embeddings
  10. warmdown_iters=4500 for learning rate schedule
  11. Higher learning rates (matrix_lr=0.025, scalar_lr=0.025)

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

…1309)

3-seed validation results:
- Seed 42:   val_bpb=1.13109, artifact=15,764,564 bytes
- Seed 1337: val_bpb=1.13085, artifact=15,626,741 bytes
- Seed 2024: val_bpb=1.13067, artifact=15,923,256 bytes
- Mean: 1.13087 (std: 0.00017)

Key techniques: 11 layers, GQA (8H/4KV), XSA on last 4 layers,
LeakyReLU(0.5)², Partial RoPE (16/64), EMA (0.997), int6 quantization,
zstd-22 compression, BigramHash(2048,128), warmdown_iters=4500.

Built on baseline by @thwu1 (PR openai#180).
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 23, 2026
- replace relu(x)^2 with leaky_relu(x, 0.5)^2
- PR openai#493 reaches 1.1309 with partial stack using this activation
- untried on full openai#414 stack — could give -0.002 to -0.005 BPB
- zero param cost, zero speed overhead
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887):
- train_batch_tokens: 524K → 786K (all top entries use this)
- bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240)
- grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping)
- Star-ReLU and TrigramHash enabled in all run scripts
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 23, 2026
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 23, 2026
…bpb 1.1178

3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM.

Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate
with sigmoid soft-round in the backward pass during the final 2% of training,
giving bin-aware gradients that settle weights onto int6 grid points.

Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
Fraser-Greenlee pushed a commit to Fraser-Greenlee/parameter-golf that referenced this pull request Mar 25, 2026
- Interleaved draft tokens: soft predictions placed between real tokens
  for 1-2 token lookahead via standard causal attention
- SmearGate and BigramHash naturally gain future context on interleaved seq
- Bigram noise curriculum: drafts anneal from GT to realistic noise
- Two-pass eval: pass 1 generates drafts, pass 2 refines with interleaving
- LeakyReLU(0.5)² activation toggle (free -0.003 BPB from PR openai#493)
- W&B logging (opt-in via WANDB_PROJECT env var)
- Sweep runner with 13 configs covering baselines, draft variants, and ablations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026
TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026
Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)

BPB: 1.1309 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 739a729f6b4a, file records/track_10min_16mb/2026-03-23_11L_EMA_Int6_XSA_LeakyReLU_PartialRoPE/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=53603 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants