Value Residual + DenseFormer DWA + TTT by ahmettrkck · Pull Request #491 · openai/parameter-golf

ahmettrkck · 2026-03-23T02:15:48Z

Summary

Value Residual (ResFormer): Cache first layer's V vectors, blend into all subsequent layers via V_used = 0.5 * (V_n + V_1). Zero extra parameters, ~10% parameter efficiency gain (arXiv:2410.17897).
DenseFormer DWA: Replace U-Net skip connections with full depth-weighted averaging over all previous layer outputs. ~65 scalar params for 10 layers. Strictly more general than U-Net skips (arXiv:2402.02622).
Test-Time Training: AdamW fine-tuning on already-scored validation tokens during sliding window eval. Score-then-train strategy ensures no cheating.

Built on top of the current #1 submission (10L_Int5MLP_MuonWD04_SWA50 by thwu1).

Status

Pending 8xH100 run logs. Will update with 3-seed results once compute is available.

Test plan

Run 3 seeds (42, 1337, 2024) on 8xH100 under 10 min
Verify artifact size < 16MB
Confirm val_bpb improvement > 0.005 nats over current SOTA
Add train logs to PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace full-weight TTT with LoRA rank-8 on Q+V+LM head - Per-document adaptation with reset between docs - Multi-epoch (5) with cosine LR decay (0.01 -> 0.001) - Score-then-train per 256-token chunk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Short docs called forward() without LoRA which returns 0-dim mean scalar, then tried to index it with nll[0]. Use forward_logits + cross_entropy instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multi-epoch TTT was ruled invalid by organizers (PR #568 closed). Now: score each chunk BEFORE training, single pass, each token scored exactly once. Matches PR #77 pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Start from PR #549 base (33 techniques, 1.1194 bpb) - Add CTW-weighted n-gram backoff (orders 2-7, logistic domain mixing) - Replace heuristic alpha with provably optimal Bayesian model averaging - Keep legal score-first TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Logistic domain mixing was wrong for target-probability mixing. PR #753 uses linear: p_mixed = (1-a)*p_neural + a*p_ngram. Keep CTW-inspired depth-adaptive alpha boost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Value Residual + DenseFormer DWA + TTT submission

89a7210

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

ahmettrkck and others added 5 commits March 25, 2026 17:32

Fix TTT eval: use forward_logits for short docs

593458c

Short docs called forward() without LoRA which returns 0-dim mean scalar, then tried to index it with nll[0]. Use forward_logits + cross_entropy instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ahmettrkck closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value Residual + DenseFormer DWA + TTT#491

Value Residual + DenseFormer DWA + TTT#491
ahmettrkck wants to merge 6 commits intoopenai:mainfrom
ahmettrkck:value-resid-dwa-ttt

ahmettrkck commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ahmettrkck commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ahmettrkck commented Mar 23, 2026 •

edited

Loading