Skip to content

fix(tokenizer): apply config-build fallback to offset tokenizer too#75

Merged
hallerite merged 1 commit into
mainfrom
fix/offset-tokenizer-config-fallback
May 28, 2026
Merged

fix(tokenizer): apply config-build fallback to offset tokenizer too#75
hallerite merged 1 commit into
mainfrom
fix/offset-tokenizer-config-fallback

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 28, 2026

Problem

#72 wired _load_fast_tokenizer_directly into _load_tokenizer_via_auto so load_tokenizer survives model-config build failures (e.g. HF RoPE validation rejecting nested rope_parameters for poolside/Laguna-XS.2).

But _get_offset_tokenizer was calling AutoTokenizer.from_pretrained directly to keep the fastokens patch out of this path — bypassing the fallback entirely.

That meant every hand-coded renderer (LagunaXS2, Qwen35, ...) still crashed on the first rollout for Laguna-family models, with the same KeyError #72 was meant to fix:

LagunaXS2Renderer.render
 → emit_text_segments
 → attribute_text_segments
 → _get_offset_tokenizer
 → AutoTokenizer.from_pretrained
 → HF RoPE validator
 → KeyError("Missing required keys in `rope_parameters` for 'rope_type'='default': {'rope_theta'}")

This bypass is also why disabling renderers made the symptom go away on hosted-rl: the offset-tokenizer path is only hit when a hand-coded renderer is in use.

Fix

Route the offset-tokenizer load through _load_tokenizer_via_auto. Same vanilla path (no fastokens patching, since that helper doesn't apply it), but now the _load_fast_tokenizer_directly fallback runs when the model config build fails.

Verification

Reproduced end-to-end against poolside/Laguna-XS.2 with prime-rl + reverse-text:

  • Without this patch — every rollout aborts with ModelError() -> KeyError("Missing required keys in 'rope_parameters' for 'rope_type'='default': {'rope_theta'}").
  • With this patch — rollouts succeed; first two RL steps complete cleanly:
    • Step 0: reward 0.3553, 97.9 tok/sample
    • Step 1: reward 0.3116, 110.8 tok/sample

(The run then OOMs on the trainer side at step 2, which is an unrelated single-GPU FSDP capacity issue with the 256-expert Laguna model.)

Follow-up worth a separate issue

For hand-coded renderers, the bulk of message-body tokenization goes through attribute_text_segments → vanilla offset tokenizer, not through the fastokens-patched main tokenizer. fastokens only helps decode + the few _encode calls that don't need offsets. Worth considering whether to skip fastokens entirely for hand-coded renderers (load_tokenizer(..., use_fastokens=False)) and avoid keeping two tokenizer copies in memory. Not in scope for this fix.

🤖 Generated with Claude Code

Note

Apply config-build fallback to _get_offset_tokenizer in renderer base

Routes tokenizer instantiation in _get_offset_tokenizer through _load_tokenizer_via_auto instead of calling AutoTokenizer.from_pretrained directly, bringing it in line with the main tokenizer loading path. This ensures the config-build fallback is available when loading offset tokenizers, while still requiring the result to be a fast tokenizer with offset_mapping support.

Macroscope summarized f0435d4.


Note

Low Risk
Single call-site change in tokenizer loading for offset attribution; aligns with an existing tested fallback path and does not alter fastokens or main load_tokenizer behavior.

Overview
_get_offset_tokenizer no longer calls AutoTokenizer.from_pretrained directly. It now loads the vanilla, offset-capable tokenizer through _load_tokenizer_via_auto, same trust/revision kwargs as before.

That keeps this path without fastokens (still not load_tokenizer) but applies the config-build fallback used elsewhere: when HF fails building the model config (e.g. Laguna rope_parameters / missing rope_theta), loading can succeed via tokenizer.json instead. Hand-coded renderers that depend on attribute_text_segments / offset mapping were still crashing on those models because the offset path skipped that fallback.

Reviewed by Cursor Bugbot for commit f0435d4. Bugbot is set up for automated code reviews on this repo. Configure here.

#72 wired ``_load_fast_tokenizer_directly`` into ``_load_tokenizer_via_auto``
so ``load_tokenizer`` survives model-config build failures (e.g. HF RoPE
validation rejecting nested ``rope_parameters`` for ``poolside/Laguna-XS.2``).
But ``_get_offset_tokenizer`` was calling ``AutoTokenizer.from_pretrained``
directly to keep the fastokens patch out of this path — bypassing the
fallback entirely.

That meant every hand-coded renderer (LagunaXS2, Qwen35, etc.) still
crashed on the first rollout for Laguna-family models: ``render`` →
``emit_text_segments`` → ``attribute_text_segments`` → ``_get_offset_tokenizer``
→ raw ``AutoTokenizer.from_pretrained`` → RoPE validator → ``KeyError``.

Reproduced end-to-end with prime-rl + reverse-text + ``poolside/Laguna-XS.2``:
without this patch, every rollout aborts with the same ``KeyError``
that #72 was supposed to fix. With it, the first two RL steps complete
cleanly (reward 0.36 / 0.31, ~100 tok/sample).

Route through ``_load_tokenizer_via_auto`` instead. Same vanilla path
(no fastokens patching, since that helper doesn't apply it), but now
the ``_load_fast_tokenizer_directly`` fallback runs when the model
config build fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review May 28, 2026 18:49
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 28, 2026

Approvability

Verdict: Approved

Small fix that applies an existing fallback pattern to the offset tokenizer loading path. The change replaces a direct AutoTokenizer.from_pretrained call with an internal wrapper that adds error recovery for edge cases while keeping the happy path unchanged.

You can customize Macroscope's approvability policy. Learn more.

@hallerite hallerite merged commit 35c2407 into main May 28, 2026
11 checks passed
@hallerite hallerite deleted the fix/offset-tokenizer-config-fallback branch May 28, 2026 18:58
eexwhyzee pushed a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 28, 2026
…or Laguna) (#2663)

Bumps the deps/renderers submodule 89ab3f0 (v0.1.8.dev35) -> 35c2407
(v0.1.8.dev37), pulling in PrimeIntellect-ai/renderers#75: apply the
config-build fallback to ``_get_offset_tokenizer`` too.

#72 only wired the fallback into ``load_tokenizer``;
``_get_offset_tokenizer`` still called ``AutoTokenizer.from_pretrained``
directly to avoid the fastokens shim and so kept hitting HF's RoPE
validator on Laguna-family models. Every rollout through a hand-coded
renderer (LagunaXS2, Qwen35, ...) crashed with the same KeyError #72
was supposed to fix.

Verified end-to-end against poolside/Laguna-XS.2 with reverse-text:
two RL steps complete cleanly (reward 0.36 / 0.31, ~100 tok/sample);
the run then hits an unrelated single-GPU FSDP OOM on the 256-expert
trainer, which is a config-sizing issue, not the renderers bug.

Also pulls in renderers#74 (``message_tool_names`` field for per-message
tool attribution).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant