Skip to content

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470

Open
leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby:shared-weight-nonrecord-clean
Open

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470
leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby:shared-weight-nonrecord-clean

Conversation

@leofeasby
Copy link
Copy Markdown

This is a non-record submission to the 16MB track.

We study a shared-weight transformer in which a single transformer block is reused across depth (9 passes), forming a recurrent-style stack with U-Net skip connections.

Result:
The model reaches 1.1454 val_bpb after ~2.3 hours on 8×H100, with loss still decreasing at the end of training. Training terminated due to schedule constraints (LR→0), not convergence.

Key observation:
The majority of improvement occurs during extended warmdown. The model continues improving steadily throughout the low-LR phase, with no plateau observed within the explored horizon.

This behaviour is consistent with a regime in which performance is strongly influenced by schedule alignment, potentially more so than parameter capacity for this architecture. We do not claim this as a universal property, but as an observed characteristic of this shared-weight setup.

Notable components:

  • Shared-core transformer (full weight sharing across depth)
  • Per-layer scaling (attention, MLP, residual mixing) to break symmetry
  • U-Net style skip connections across passes
  • Step-based warmdown control (WARMDOWN_START_STEP) to decouple schedule from wallclock

This submission targets long-horizon optimisation behaviour rather than the 10-minute constraint, and aims to highlight differences in convergence dynamics between shared-weight and standard transformers.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

head_sha: 9f96828
pr: 470
author: leofeasby
track: non_record_unlimited_compute (non-record submission)
val_bpb: 1.1454


Analysis

1. N-gram family bug (CLOSE trigger: target token in hash key)

The bigram hash is computed as:

bigram_idx = (prev * 1619 + input_ids) % self.bigram_table_size

where prev[:, 1:] = input_ids[:, :-1] — i.e., prev is the shifted-left context, and input_ids is the current token being predicted. The key uses (prev_token, current_token), which means the target token is included in the hash lookup key. This is the N-gram family bug pattern.

However, the submission.json uses no environment variable overrides, and both BIGRAM_TABLE_SIZE and TRIGRAM_TABLE_SIZE default to 0 (lines 87–88). With both table sizes at zero, the n-gram embedding branches are never entered. The reported score of 1.1454 val_bpb was achieved with bigram/trigram disabled. Bug is present in code but not exercised in this submission. No CLOSE.

2. Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

The TTT implementation uses torch.optim.Adam (not AdamW) with a single step per chunk (lines 985–986, 1116–1123). Crucially, the per-chunk loop in eval_val_ttt_lora follows strict score-first-then-train ordering: _accumulate_bpb is called before the Adam step on each chunk (lines 1099–1115 precede lines 1116–1123). There is no multi-epoch inner loop. This matches the legal TTT pattern (ref: PR #1413). No CLOSE.

3. Legal TTT (CLEAN)

Confirmed: score-first-per-chunk is respected. The forward pass runs, BPB is accumulated from the scored tokens, then the LoRA adapter is updated once on the same chunk before moving to the next. Single Adam step, no epoch loop, no look-ahead. CLEAN.

4. Scored-region SLOT (HOLD)

Verdict: LOOKS CLEAN — legal TTT implementation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants