Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb) by leofeasby · Pull Request #470 · openai/parameter-golf

leofeasby · 2026-03-22T22:33:49Z

This is a non-record submission to the 16MB track.

We study a shared-weight transformer in which a single transformer block is reused across depth (9 passes), forming a recurrent-style stack with U-Net skip connections.

Result:
The model reaches 1.1454 val_bpb after ~2.3 hours on 8×H100, with loss still decreasing at the end of training. Training terminated due to schedule constraints (LR→0), not convergence.

Key observation:
The majority of improvement occurs during extended warmdown. The model continues improving steadily throughout the low-LR phase, with no plateau observed within the explored horizon.

This behaviour is consistent with a regime in which performance is strongly influenced by schedule alignment, potentially more so than parameter capacity for this architecture. We do not claim this as a universal property, but as an observed characteristic of this shared-weight setup.

Notable components:

Shared-core transformer (full weight sharing across depth)
Per-layer scaling (attention, MLP, residual mixing) to break symmetry
U-Net style skip connections across passes
Step-based warmdown control (WARMDOWN_START_STEP) to decouple schedule from wallclock

This submission targets long-horizon optimisation behaviour rather than the 10-minute constraint, and aims to highlight differences in convergence dynamics between shared-weight and standard transformers.

MatoTeziTanka · 2026-04-12T14:24:02Z

Community Review — Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

head_sha: 9f96828
pr: 470
author: leofeasby
track: non_record_unlimited_compute (non-record submission)
val_bpb: 1.1454

Analysis

1. N-gram family bug (CLOSE trigger: target token in hash key)

The bigram hash is computed as:

bigram_idx = (prev * 1619 + input_ids) % self.bigram_table_size

where prev[:, 1:] = input_ids[:, :-1] — i.e., prev is the shifted-left context, and input_ids is the current token being predicted. The key uses (prev_token, current_token), which means the target token is included in the hash lookup key. This is the N-gram family bug pattern.

However, the submission.json uses no environment variable overrides, and both BIGRAM_TABLE_SIZE and TRIGRAM_TABLE_SIZE default to 0 (lines 87–88). With both table sizes at zero, the n-gram embedding branches are never entered. The reported score of 1.1454 val_bpb was achieved with bigram/trigram disabled. Bug is present in code but not exercised in this submission. No CLOSE.

2. Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

The TTT implementation uses torch.optim.Adam (not AdamW) with a single step per chunk (lines 985–986, 1116–1123). Crucially, the per-chunk loop in eval_val_ttt_lora follows strict score-first-then-train ordering: _accumulate_bpb is called before the Adam step on each chunk (lines 1099–1115 precede lines 1116–1123). There is no multi-epoch inner loop. This matches the legal TTT pattern (ref: PR #1413). No CLOSE.

3. Legal TTT (CLEAN)

Confirmed: score-first-per-chunk is respected. The forward pass runs, BPB is accumulated from the scored tokens, then the LoRA adapter is updated once on the same chunk before moving to the next. Single Adam step, no epoch loop, no look-ahead. CLEAN.

4. Scored-region SLOT (HOLD)

Verdict: LOOKS CLEAN — legal TTT implementation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Add non-record shared-core transformer submission (1.1454 bpb)

9f96828

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470
leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby:shared-weight-nonrecord-clean

leofeasby commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leofeasby commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)

Analysis

1. N-gram family bug (CLOSE trigger: target token in hash key)

2. Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

3. Legal TTT (CLEAN)

4. Scored-region SLOT (HOLD)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants