Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485
Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485harsha-gouru wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new 10-minute / 16MB track record entry for a 10-layer GPT variant that combines an exact count-initialized bigram logit bias (int4 nibble-packed) with XSA, Partial RoPE, LN scaling, and mixed int5/int6(+int4) quantization + compression, along with training logs and submission metadata.
Changes:
- Added a new record folder with training script (
train_gpt.py), run log, README, andsubmission.json. - Implemented count-based initialization for an exact bigram logit head and int4 nibble packing for compact storage.
- Added/used XSA on last layers, Partial RoPE, LN scaling, SWA, and mixed quantization + zstd/zlib export with roundtrip evaluation.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log | Captures training + eval/roundtrip outputs for the submitted run. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py | Full training/eval/quantization script implementing the techniques used by the record. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json | Declares reported metrics, sizes, and submission metadata for the record. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md | Documents the approach, run command, hyperparameters, and results for the record. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```bash | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` | ||
|
|
||
| All parameters are set as defaults in `train_gpt.py`. No env vars needed. | ||
|
|
There was a problem hiding this comment.
README claims “All parameters are set as defaults … No env vars needed.”, but train_gpt.py defaults BIGRAM_LOGIT_HEAD to 0, so running the command as written won’t enable the count-initialized exact bigram logit head described in this record. Either set the default to enabled for this record script, or update the run command/docs to include BIGRAM_LOGIT_HEAD=1 (and any other required env vars) so results are reproducible.
|
|
||
| ## Architecture | ||
| - 10 layers, 512 dim, 8 heads, 4 KV heads (GQA) | ||
| - MLP 3x expansion (hidden=1536), relu squared activation |
There was a problem hiding this comment.
The Architecture section says “relu squared activation”, but the implementation uses leaky_relu(..., negative_slope=0.5) followed by square(). Please update the README to match the actual activation, or adjust the code if ReLU^2 was intended.
| - MLP 3x expansion (hidden=1536), relu squared activation | |
| - MLP 3x expansion (hidden=1536), leaky relu squared activation (negative_slope=0.5) |
| ## Training Hyperparameters | ||
| - Muon optimizer: matrix_lr=0.025, WD=0.04, momentum=0.99 | ||
| - AdamW for embeddings/scalars: WD=0.04 | ||
| - warmdown=2800 iters, warmup=20 steps | ||
| - seq_len=2048, batch=786K tokens | ||
| - grad_clip=0.3, 3% magnitude pruning | ||
| - SWA: start_frac=0.4, every=50 steps (22 checkpoints) | ||
| - Sliding window eval: stride=64 |
There was a problem hiding this comment.
README lists “warmdown=2800 iters”, but train_gpt.py defaults WARMDOWN_ITERS to 3000. Please reconcile the README with the actual default hyperparameters used for this record (or explicitly document the env var override used for the run).
| bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 4096)) | ||
| bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) | ||
|
|
||
| swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) | ||
| swa_start_frac = float(os.environ.get("SWA_START_FRAC", 0.4)) | ||
| swa_every = int(os.environ.get("SWA_EVERY", 50)) | ||
|
|
||
| ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0"))) | ||
| ema_decay = float(os.environ.get("EMA_DECAY", 0.997)) | ||
|
|
||
| # Exact bigram logit head (can be enabled alongside BigramHash) | ||
| bigram_logit_head = bool(int(os.environ.get("BIGRAM_LOGIT_HEAD", "0"))) | ||
|
|
||
| # Architectural improvements from top PRs | ||
| xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) # XSA on last N layers | ||
| rope_dims = int(os.environ.get("ROPE_DIMS", 16)) # Partial RoPE: rotate this many dims | ||
| ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) # RMSNorm scale by 1/sqrt(layer+1) |
There was a problem hiding this comment.
This record is described as “CountInitBigram…”, but Hyperparameters.bigram_logit_head defaults to disabled (BIGRAM_LOGIT_HEAD default "0"). With the current defaults, the count-based init block won’t run and the exact bigram logit head won’t be part of the model unless the env var is set. Consider enabling it by default for this record (or ensure the README/run command documents the required env var).
| with open("final_model.int8.ptz", "wb") as f: | ||
| f.write(quant_blob) | ||
| quant_file_bytes = os.path.getsize("final_model.int8.ptz") | ||
| code_bytes = len(code.encode("utf-8")) | ||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") |
There was a problem hiding this comment.
The quantization/compression logging labels are inconsistent with what the code actually does: it may use zstd (_COMPRESSOR == "zstd") and mixed int6/int5 plus optional int4-packed tensors, but the log line still says Total submission size int8+zlib, and later messages are hard-coded to final_int8_zlib_roundtrip. Please make these log labels reflect the chosen compressor/quantization so the recorded logs and reported metadata aren’t misleading.
| f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " | ||
| f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" | ||
| ) | ||
| log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") |
There was a problem hiding this comment.
Related to the above: the final evaluation log keys are hard-coded as final_int8_zlib_roundtrip* even when using zstd and mixed int6/int5 (+ int4 packed) roundtrip. Consider renaming these log keys or including the compressor/quant scheme in the key/value so downstream parsing and human readers don’t misinterpret the result.
| f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " | |
| f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" | |
| ) | |
| log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") | |
| f"final_mixed_int6_{_COMPRESSOR}_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " | |
| f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" | |
| ) | |
| log0( | |
| f"final_mixed_int6_{_COMPRESSOR}_roundtrip_exact " | |
| f"val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}" | |
| ) |
Community Review — Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)BPB: 1.1522 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Novel Contributions
Count-Initialized Exact Bigram Logit Head
A 1024x1024 lookup table initialized from corpus bigram transition probabilities (B[a,b] = log p(b|a) - log p(b)) before training begins. Provides a strong Markov prior from step 0. Applied BEFORE logit softcap. Int4 nibble-packed (524KB).
Int4 Nibble Packing
Custom pack_i4/unpack_i4 for signed int4 values into uint8 bytes. Halves the bigram table storage.
Adopted Techniques
Results
Built on baseline by @thwu1 (PR #180).