Skip to content

Value Residual + DenseFormer DWA + TTT#491

Closed
ahmettrkck wants to merge 6 commits intoopenai:mainfrom
ahmettrkck:value-resid-dwa-ttt
Closed

Value Residual + DenseFormer DWA + TTT#491
ahmettrkck wants to merge 6 commits intoopenai:mainfrom
ahmettrkck:value-resid-dwa-ttt

Conversation

@ahmettrkck
Copy link
Copy Markdown

@ahmettrkck ahmettrkck commented Mar 23, 2026

Summary

  • Value Residual (ResFormer): Cache first layer's V vectors, blend into all subsequent layers via V_used = 0.5 * (V_n + V_1). Zero extra parameters, ~10% parameter efficiency gain (arXiv:2410.17897).
  • DenseFormer DWA: Replace U-Net skip connections with full depth-weighted averaging over all previous layer outputs. ~65 scalar params for 10 layers. Strictly more general than U-Net skips (arXiv:2402.02622).
  • Test-Time Training: AdamW fine-tuning on already-scored validation tokens during sliding window eval. Score-then-train strategy ensures no cheating.

Built on top of the current #1 submission (10L_Int5MLP_MuonWD04_SWA50 by thwu1).

Status

Pending 8xH100 run logs. Will update with 3-seed results once compute is available.

Test plan

  • Run 3 seeds (42, 1337, 2024) on 8xH100 under 10 min
  • Verify artifact size < 16MB
  • Confirm val_bpb improvement > 0.005 nats over current SOTA
  • Add train logs to PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ahmettrkck and others added 5 commits March 25, 2026 17:32
- Replace full-weight TTT with LoRA rank-8 on Q+V+LM head
- Per-document adaptation with reset between docs
- Multi-epoch (5) with cosine LR decay (0.01 -> 0.001)
- Score-then-train per 256-token chunk

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short docs called forward() without LoRA which returns 0-dim mean scalar,
then tried to index it with nll[0]. Use forward_logits + cross_entropy instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-epoch TTT was ruled invalid by organizers (PR #568 closed).
Now: score each chunk BEFORE training, single pass, each token
scored exactly once. Matches PR #77 pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Start from PR #549 base (33 techniques, 1.1194 bpb)
- Add CTW-weighted n-gram backoff (orders 2-7, logistic domain mixing)
- Replace heuristic alpha with provably optimal Bayesian model averaging
- Keep legal score-first TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logistic domain mixing was wrong for target-probability mixing.
PR #753 uses linear: p_mixed = (1-a)*p_neural + a*p_ngram.
Keep CTW-inspired depth-adaptive alpha boost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ahmettrkck ahmettrkck closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant