Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) by harsha-gouru · Pull Request #485 · openai/parameter-golf

harsha-gouru · 2026-03-23T00:58:38Z

Summary

val_bpb: 1.1522 (sliding window stride=64, post int5/int6+zstd quantization roundtrip)
10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
Artifact: 15,384,232 bytes (15.38 MB)

Novel Contributions

Count-Initialized Exact Bigram Logit Head

A 1024x1024 lookup table initialized from corpus bigram transition probabilities (B[a,b] = log p(b|a) - log p(b)) before training begins. Provides a strong Markov prior from step 0. Applied BEFORE logit softcap. Int4 nibble-packed (524KB).

Int4 Nibble Packing

Custom pack_i4/unpack_i4 for signed int4 values into uint8 bytes. Halves the bigram table storage.

Adopted Techniques

XSA on last 4 layers (arxiv:2603.09078)
Partial RoPE 16/64 dims
LN Scale 1/sqrt(layer+1)
Higher LR (0.025)
SmearGate, OrthoInit, U-Net skips, SWA, int5/int6 + zstd-22

Results

Steps: 6267 at 95.75 ms/step (8xH100 SXM, 600s wallclock)
Pre-SWA val_bpb: 1.1563
Post-SWA+quant val_bpb: 1.1522
Artifact: 15.38 MB (0.62 MB headroom)

Built on baseline by @thwu1 (PR #180).

Copilot

Pull request overview

Adds a new 10-minute / 16MB track record entry for a 10-layer GPT variant that combines an exact count-initialized bigram logit bias (int4 nibble-packed) with XSA, Partial RoPE, LN scaling, and mixed int5/int6(+int4) quantization + compression, along with training logs and submission metadata.

Changes:

Added a new record folder with training script (train_gpt.py), run log, README, and submission.json.
Implemented count-based initialization for an exact bigram logit head and int4 nibble packing for compact storage.
Added/used XSA on last layers, Partial RoPE, LN scaling, SWA, and mixed quantization + zstd/zlib export with roundtrip evaluation.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File	Description
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log	Captures training + eval/roundtrip outputs for the submitted run.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py	Full training/eval/quantization script implementing the techniques used by the record.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json	Declares reported metrics, sizes, and submission metadata for the record.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md	Documents the approach, run command, hyperparameters, and results for the record.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T01:02:50Z

+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+All parameters are set as defaults in `train_gpt.py`. No env vars needed.
+


README claims “All parameters are set as defaults … No env vars needed.”, but train_gpt.py defaults BIGRAM_LOGIT_HEAD to 0, so running the command as written won’t enable the count-initialized exact bigram logit head described in this record. Either set the default to enabled for this record script, or update the run command/docs to include BIGRAM_LOGIT_HEAD=1 (and any other required env vars) so results are reproducible.

Copilot · 2026-03-23T01:02:50Z

+
+## Architecture
+- 10 layers, 512 dim, 8 heads, 4 KV heads (GQA)
+- MLP 3x expansion (hidden=1536), relu squared activation


The Architecture section says “relu squared activation”, but the implementation uses leaky_relu(..., negative_slope=0.5) followed by square(). Please update the README to match the actual activation, or adjust the code if ReLU^2 was intended.

Suggested change

- MLP 3x expansion (hidden=1536), relu squared activation

- MLP 3x expansion (hidden=1536), leaky relu squared activation (negative_slope=0.5)

Copilot · 2026-03-23T01:02:50Z

+## Training Hyperparameters
+- Muon optimizer: matrix_lr=0.025, WD=0.04, momentum=0.99
+- AdamW for embeddings/scalars: WD=0.04
+- warmdown=2800 iters, warmup=20 steps
+- seq_len=2048, batch=786K tokens
+- grad_clip=0.3, 3% magnitude pruning
+- SWA: start_frac=0.4, every=50 steps (22 checkpoints)
+- Sliding window eval: stride=64


README lists “warmdown=2800 iters”, but train_gpt.py defaults WARMDOWN_ITERS to 3000. Please reconcile the README with the actual default hyperparameters used for this record (or explicitly document the env var override used for the run).

Copilot · 2026-03-23T01:02:51Z

+    bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 4096))
+    bigram_dim = int(os.environ.get("BIGRAM_DIM", 128))
+
+    swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1")))
+    swa_start_frac = float(os.environ.get("SWA_START_FRAC", 0.4))
+    swa_every = int(os.environ.get("SWA_EVERY", 50))
+
+    ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0")))
+    ema_decay = float(os.environ.get("EMA_DECAY", 0.997))
+
+    # Exact bigram logit head (can be enabled alongside BigramHash)
+    bigram_logit_head = bool(int(os.environ.get("BIGRAM_LOGIT_HEAD", "0")))
+
+    # Architectural improvements from top PRs
+    xsa_last_n = int(os.environ.get("XSA_LAST_N", 4))       # XSA on last N layers
+    rope_dims = int(os.environ.get("ROPE_DIMS", 16))         # Partial RoPE: rotate this many dims
+    ln_scale = bool(int(os.environ.get("LN_SCALE", "1")))    # RMSNorm scale by 1/sqrt(layer+1)


This record is described as “CountInitBigram…”, but Hyperparameters.bigram_logit_head defaults to disabled (BIGRAM_LOGIT_HEAD default "0"). With the current defaults, the count-based init block won’t run and the exact bigram logit head won’t be part of the model unless the env var is set. Consider enabling it by default for this record (or ensure the README/run command documents the required env var).

Copilot · 2026-03-23T01:02:51Z

+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")


The quantization/compression logging labels are inconsistent with what the code actually does: it may use zstd (_COMPRESSOR == "zstd") and mixed int6/int5 plus optional int4-packed tensors, but the log line still says Total submission size int8+zlib, and later messages are hard-coded to final_int8_zlib_roundtrip. Please make these log labels reflect the chosen compressor/quantization so the recorded logs and reported metadata aren’t misleading.

Copilot · 2026-03-23T01:02:51Z

+        f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
+        f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
+    )
+    log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")


Related to the above: the final evaluation log keys are hard-coded as final_int8_zlib_roundtrip* even when using zstd and mixed int6/int5 (+ int4 packed) roundtrip. Consider renaming these log keys or including the compressor/quant scheme in the key/value so downstream parsing and human readers don’t misinterpret the result.

Suggested change

f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "

f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"

)

log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")

f"final_mixed_int6_{_COMPRESSOR}_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "

f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"

)

log0(

f"final_mixed_int6_{_COMPRESSOR}_roundtrip_exact "

f"val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}"

)

MatoTeziTanka · 2026-04-11T20:09:16Z

Community Review — Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

BPB: 1.1522 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 142228728144, file records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

1422287

harsha-gouru marked this pull request as ready for review March 23, 2026 00:59

Copilot AI review requested due to automatic review settings March 23, 2026 00:59

Copilot started reviewing on behalf of harsha-gouru March 23, 2026 01:00 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA

harsha-gouru commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- MLP 3x expansion (hidden=1536), relu squared activation
	- MLP 3x expansion (hidden=1536), leaky relu squared activation (negative_slope=0.5)

Conversation

harsha-gouru commented Mar 23, 2026

Summary

Novel Contributions

Count-Initialized Exact Bigram Logit Head

Int4 Nibble Packing

Adopted Techniques

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants