Skip to content

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485

Open
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA
Open

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#485
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA

Conversation

@harsha-gouru
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1522 (sliding window stride=64, post int5/int6+zstd quantization roundtrip)
  • 10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
  • Artifact: 15,384,232 bytes (15.38 MB)

Novel Contributions

Count-Initialized Exact Bigram Logit Head

A 1024x1024 lookup table initialized from corpus bigram transition probabilities (B[a,b] = log p(b|a) - log p(b)) before training begins. Provides a strong Markov prior from step 0. Applied BEFORE logit softcap. Int4 nibble-packed (524KB).

Int4 Nibble Packing

Custom pack_i4/unpack_i4 for signed int4 values into uint8 bytes. Halves the bigram table storage.

Adopted Techniques

  • XSA on last 4 layers (arxiv:2603.09078)
  • Partial RoPE 16/64 dims
  • LN Scale 1/sqrt(layer+1)
  • Higher LR (0.025)
  • SmearGate, OrthoInit, U-Net skips, SWA, int5/int6 + zstd-22

Results

  • Steps: 6267 at 95.75 ms/step (8xH100 SXM, 600s wallclock)
  • Pre-SWA val_bpb: 1.1563
  • Post-SWA+quant val_bpb: 1.1522
  • Artifact: 15.38 MB (0.62 MB headroom)

Built on baseline by @thwu1 (PR #180).

@harsha-gouru harsha-gouru marked this pull request as ready for review March 23, 2026 00:59
Copilot AI review requested due to automatic review settings March 23, 2026 00:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10-minute / 16MB track record entry for a 10-layer GPT variant that combines an exact count-initialized bigram logit bias (int4 nibble-packed) with XSA, Partial RoPE, LN scaling, and mixed int5/int6(+int4) quantization + compression, along with training logs and submission metadata.

Changes:

  • Added a new record folder with training script (train_gpt.py), run log, README, and submission.json.
  • Implemented count-based initialization for an exact bigram logit head and int4 nibble packing for compact storage.
  • Added/used XSA on last layers, Partial RoPE, LN scaling, SWA, and mixed quantization + zstd/zlib export with roundtrip evaluation.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File Description
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log Captures training + eval/roundtrip outputs for the submitted run.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py Full training/eval/quantization script implementing the techniques used by the record.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json Declares reported metrics, sizes, and submission metadata for the record.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md Documents the approach, run command, hyperparameters, and results for the record.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +7 to +12
```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

All parameters are set as defaults in `train_gpt.py`. No env vars needed.

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims “All parameters are set as defaults … No env vars needed.”, but train_gpt.py defaults BIGRAM_LOGIT_HEAD to 0, so running the command as written won’t enable the count-initialized exact bigram logit head described in this record. Either set the default to enabled for this record script, or update the run command/docs to include BIGRAM_LOGIT_HEAD=1 (and any other required env vars) so results are reproducible.

Copilot uses AI. Check for mistakes.

## Architecture
- 10 layers, 512 dim, 8 heads, 4 KV heads (GQA)
- MLP 3x expansion (hidden=1536), relu squared activation
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Architecture section says “relu squared activation”, but the implementation uses leaky_relu(..., negative_slope=0.5) followed by square(). Please update the README to match the actual activation, or adjust the code if ReLU^2 was intended.

Suggested change
- MLP 3x expansion (hidden=1536), relu squared activation
- MLP 3x expansion (hidden=1536), leaky relu squared activation (negative_slope=0.5)

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +64
## Training Hyperparameters
- Muon optimizer: matrix_lr=0.025, WD=0.04, momentum=0.99
- AdamW for embeddings/scalars: WD=0.04
- warmdown=2800 iters, warmup=20 steps
- seq_len=2048, batch=786K tokens
- grad_clip=0.3, 3% magnitude pruning
- SWA: start_frac=0.4, every=50 steps (22 checkpoints)
- Sliding window eval: stride=64
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README lists “warmdown=2800 iters”, but train_gpt.py defaults WARMDOWN_ITERS to 3000. Please reconcile the README with the actual default hyperparameters used for this record (or explicitly document the env var override used for the run).

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +105
bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 4096))
bigram_dim = int(os.environ.get("BIGRAM_DIM", 128))

swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1")))
swa_start_frac = float(os.environ.get("SWA_START_FRAC", 0.4))
swa_every = int(os.environ.get("SWA_EVERY", 50))

ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0")))
ema_decay = float(os.environ.get("EMA_DECAY", 0.997))

# Exact bigram logit head (can be enabled alongside BigramHash)
bigram_logit_head = bool(int(os.environ.get("BIGRAM_LOGIT_HEAD", "0")))

# Architectural improvements from top PRs
xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) # XSA on last N layers
rope_dims = int(os.environ.get("ROPE_DIMS", 16)) # Partial RoPE: rotate this many dims
ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) # RMSNorm scale by 1/sqrt(layer+1)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This record is described as “CountInitBigram…”, but Hyperparameters.bigram_logit_head defaults to disabled (BIGRAM_LOGIT_HEAD default "0"). With the current defaults, the count-based init block won’t run and the exact bigram logit head won’t be part of the model unless the env var is set. Consider enabling it by default for this record (or ensure the README/run command documents the required env var).

Copilot uses AI. Check for mistakes.
Comment on lines +1371 to +1376
with open("final_model.int8.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = os.path.getsize("final_model.int8.ptz")
code_bytes = len(code.encode("utf-8"))
log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantization/compression logging labels are inconsistent with what the code actually does: it may use zstd (_COMPRESSOR == "zstd") and mixed int6/int5 plus optional int4-packed tensors, but the log line still says Total submission size int8+zlib, and later messages are hard-coded to final_int8_zlib_roundtrip. Please make these log labels reflect the chosen compressor/quantization so the recorded logs and reported metadata aren’t misleading.

Copilot uses AI. Check for mistakes.
Comment on lines +1408 to +1411
f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
)
log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the above: the final evaluation log keys are hard-coded as final_int8_zlib_roundtrip* even when using zstd and mixed int6/int5 (+ int4 packed) roundtrip. Consider renaming these log keys or including the compressor/quant scheme in the key/value so downstream parsing and human readers don’t misinterpret the result.

Suggested change
f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
)
log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
f"final_mixed_int6_{_COMPRESSOR}_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
)
log0(
f"final_mixed_int6_{_COMPRESSOR}_roundtrip_exact "
f"val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}"
)

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

BPB: 1.1522 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 142228728144, file records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=10, vocab=1024, code=61667 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants