Skip to content

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488

Open
pkim02 wants to merge 1 commit intoopenai:mainfrom
pkim02:11L-int6-qat-warmdown
Open

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)#488
pkim02 wants to merge 1 commit intoopenai:mainfrom
pkim02:11L-int6-qat-warmdown

Conversation

@pkim02
Copy link
Copy Markdown

@pkim02 pkim02 commented Mar 23, 2026

Summary

  • Best result: val_bpb = 1.3267 (sliding window stride=64, post int6 quantization, 1xH100 600s)
  • 11 layers, MLP3x, GQA 8/4 heads, 26.5M params, 13.3 MB submission size
  • Systematic ablation study with 6 candidate scripts

Techniques

  • Int6 grouped quantization (group_rows=8) for all weights
  • QAT: STE fake-quantization in last 15% of wallclock, halves quant gap
  • zstd-22 compression (saves ~1 MB vs zlib)
  • Wallclock-fraction warmdown (last 15%): fixes the buggy iter-based formula that breaks with torch.compile overhead
  • SWA: 7 checkpoints during warmdown, properly guarded against premature collection
  • Sliding window eval stride=64
  • Muon optimizer with WD=0.04, momentum warmup 0.92→0.99

Ablation results (1xH100, 600s wallclock)

Variant Sliding bpb Size
Clean 11L Int6 QAT (baseline A) 1.3761 14.0 MB
A + BigramHash + SmearGate 1.3796 14.4 MB
A + Mixed Int5/Int6 1.3874 11.6 MB
A + Warmdown + SWA 1.3267 13.3 MB
A + Warmdown + Late EMA 1.3436 13.2 MB

Test plan

  • Verify on 8xH100 with torchrun --nproc_per_node=8
  • Confirm submission size < 16 MB
  • Validate sliding window bpb matches reported numbers

🤖 Generated with Claude Code

…=1.3267)

Adds 6 candidate scripts from systematic ablation study on 1xH100:

- 2026-03-20 SmearGate+BigramHash+SWA candidate (original starting point)
- 2026-03-22 Clean_11L_Int6_QAT: stripped baseline (sliding bpb=1.3761)
- 2026-03-22 Clean_11L_Int6_QAT_BigramSmear: +bigram+smear ablation (1.3796)
- 2026-03-22 Clean_11L_Mixed_Int5Int6_QAT: int5 MLP / int6 attn (1.3874, 11.6MB)
- 2026-03-22 Phase2_A_Warmdown: wallclock-fraction warmdown + SWA (1.3267, best)
- 2026-03-23 Phase2_A_Warmdown_EMA: late-start EMA during warmdown (1.3436)

Key techniques: int6 grouped quantization, QAT (STE fake-quant last 15%),
zstd-22 compression, sliding-window eval stride=64, wallclock-fraction
warmdown schedule (fixes the buggy iter-based warmdown), Muon WD=0.04.

Also updates root README with non-MLX smoke test instructions and fixes
enable_gqa compatibility for PyTorch <2.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)

BPB: 1.3267 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA af3eeb9d9817, file records/track_10min_16mb/2026-03-20_11L_Int6_MLP3x_SmearGate_BigramHash_SWA_Candidate/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=93784 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants