Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470
Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470leofeasby wants to merge 1 commit intoopenai:mainfrom
Conversation
Community Review — Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) head_sha: 9f96828 Analysis1. N-gram family bug (CLOSE trigger: target token in hash key)The bigram hash is computed as: where However, the submission.json uses no environment variable overrides, and both 2. Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)The TTT implementation uses 3. Legal TTT (CLEAN)Confirmed: score-first-per-chunk is respected. The forward pass runs, BPB is accumulated from the scored tokens, then the LoRA adapter is updated once on the same chunk before moving to the next. Single Adam step, no epoch loop, no look-ahead. CLEAN. 4. Scored-region SLOT (HOLD)Verdict: LOOKS CLEAN — legal TTT implementation. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually. |
This is a non-record submission to the 16MB track.
We study a shared-weight transformer in which a single transformer block is reused across depth (9 passes), forming a recurrent-style stack with U-Net skip connections.
Result:
The model reaches 1.1454 val_bpb after ~2.3 hours on 8×H100, with loss still decreasing at the end of training. Training terminated due to schedule constraints (LR→0), not convergence.
Key observation:
The majority of improvement occurs during extended warmdown. The model continues improving steadily throughout the low-LR phase, with no plateau observed within the explored horizon.
This behaviour is consistent with a regime in which performance is strongly influenced by schedule alignment, potentially more so than parameter capacity for this architecture. We do not claim this as a universal property, but as an observed characteristic of this shared-weight setup.
Notable components:
WARMDOWN_START_STEP) to decouple schedule from wallclockThis submission targets long-horizon optimisation behaviour rather than the 10-minute constraint, and aims to highlight differences in convergence dynamics between shared-weight and standard transformers.