fix(tokenizer): apply config-build fallback to offset tokenizer too by hallerite · Pull Request #75 · PrimeIntellect-ai/renderers

hallerite · 2026-05-28T18:47:40Z

Problem

#72 wired _load_fast_tokenizer_directly into _load_tokenizer_via_auto so load_tokenizer survives model-config build failures (e.g. HF RoPE validation rejecting nested rope_parameters for poolside/Laguna-XS.2).

But _get_offset_tokenizer was calling AutoTokenizer.from_pretrained directly to keep the fastokens patch out of this path — bypassing the fallback entirely.

That meant every hand-coded renderer (LagunaXS2, Qwen35, ...) still crashed on the first rollout for Laguna-family models, with the same KeyError #72 was meant to fix:

LagunaXS2Renderer.render
 → emit_text_segments
 → attribute_text_segments
 → _get_offset_tokenizer
 → AutoTokenizer.from_pretrained
 → HF RoPE validator
 → KeyError("Missing required keys in `rope_parameters` for 'rope_type'='default': {'rope_theta'}")

This bypass is also why disabling renderers made the symptom go away on hosted-rl: the offset-tokenizer path is only hit when a hand-coded renderer is in use.

Fix

Route the offset-tokenizer load through _load_tokenizer_via_auto. Same vanilla path (no fastokens patching, since that helper doesn't apply it), but now the _load_fast_tokenizer_directly fallback runs when the model config build fails.

Verification

Reproduced end-to-end against poolside/Laguna-XS.2 with prime-rl + reverse-text:

Without this patch — every rollout aborts with ModelError() -> KeyError("Missing required keys in 'rope_parameters' for 'rope_type'='default': {'rope_theta'}").
With this patch — rollouts succeed; first two RL steps complete cleanly:
- Step 0: reward 0.3553, 97.9 tok/sample
- Step 1: reward 0.3116, 110.8 tok/sample

(The run then OOMs on the trainer side at step 2, which is an unrelated single-GPU FSDP capacity issue with the 256-expert Laguna model.)

Follow-up worth a separate issue

For hand-coded renderers, the bulk of message-body tokenization goes through attribute_text_segments → vanilla offset tokenizer, not through the fastokens-patched main tokenizer. fastokens only helps decode + the few _encode calls that don't need offsets. Worth considering whether to skip fastokens entirely for hand-coded renderers (load_tokenizer(..., use_fastokens=False)) and avoid keeping two tokenizer copies in memory. Not in scope for this fix.

🤖 Generated with Claude Code

Note

Apply config-build fallback to `_get_offset_tokenizer` in renderer base

Routes tokenizer instantiation in _get_offset_tokenizer through _load_tokenizer_via_auto instead of calling AutoTokenizer.from_pretrained directly, bringing it in line with the main tokenizer loading path. This ensures the config-build fallback is available when loading offset tokenizers, while still requiring the result to be a fast tokenizer with offset_mapping support.

^{Macroscope summarized f0435d4.}

Note

Low Risk
Single call-site change in tokenizer loading for offset attribution; aligns with an existing tested fallback path and does not alter fastokens or main load_tokenizer behavior.

Overview
_get_offset_tokenizer no longer calls AutoTokenizer.from_pretrained directly. It now loads the vanilla, offset-capable tokenizer through _load_tokenizer_via_auto, same trust/revision kwargs as before.

That keeps this path without fastokens (still not load_tokenizer) but applies the config-build fallback used elsewhere: when HF fails building the model config (e.g. Laguna rope_parameters / missing rope_theta), loading can succeed via tokenizer.json instead. Hand-coded renderers that depend on attribute_text_segments / offset mapping were still crashing on those models because the offset path skipped that fallback.

^{Reviewed by Cursor Bugbot for commit f0435d4. Bugbot is set up for automated code reviews on this repo. Configure here.}

#72 wired ``_load_fast_tokenizer_directly`` into ``_load_tokenizer_via_auto`` so ``load_tokenizer`` survives model-config build failures (e.g. HF RoPE validation rejecting nested ``rope_parameters`` for ``poolside/Laguna-XS.2``). But ``_get_offset_tokenizer`` was calling ``AutoTokenizer.from_pretrained`` directly to keep the fastokens patch out of this path — bypassing the fallback entirely. That meant every hand-coded renderer (LagunaXS2, Qwen35, etc.) still crashed on the first rollout for Laguna-family models: ``render`` → ``emit_text_segments`` → ``attribute_text_segments`` → ``_get_offset_tokenizer`` → raw ``AutoTokenizer.from_pretrained`` → RoPE validator → ``KeyError``. Reproduced end-to-end with prime-rl + reverse-text + ``poolside/Laguna-XS.2``: without this patch, every rollout aborts with the same ``KeyError`` that #72 was supposed to fix. With it, the first two RL steps complete cleanly (reward 0.36 / 0.31, ~100 tok/sample). Route through ``_load_tokenizer_via_auto`` instead. Same vanilla path (no fastokens patching, since that helper doesn't apply it), but now the ``_load_fast_tokenizer_directly`` fallback runs when the model config build fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-05-28T18:57:26Z

Approvability

Verdict: Approved

Small fix that applies an existing fallback pattern to the offset tokenizer loading path. The change replaces a direct AutoTokenizer.from_pretrained call with an internal wrapper that adds error recovery for edge cases while keeping the happy path unchanged.

^{You can customize Macroscope's approvability policy. Learn more.}

…or Laguna) (#2663) Bumps the deps/renderers submodule 89ab3f0 (v0.1.8.dev35) -> 35c2407 (v0.1.8.dev37), pulling in PrimeIntellect-ai/renderers#75: apply the config-build fallback to ``_get_offset_tokenizer`` too. #72 only wired the fallback into ``load_tokenizer``; ``_get_offset_tokenizer`` still called ``AutoTokenizer.from_pretrained`` directly to avoid the fastokens shim and so kept hitting HF's RoPE validator on Laguna-family models. Every rollout through a hand-coded renderer (LagunaXS2, Qwen35, ...) crashed with the same KeyError #72 was supposed to fix. Verified end-to-end against poolside/Laguna-XS.2 with reverse-text: two RL steps complete cleanly (reward 0.36 / 0.31, ~100 tok/sample); the run then hits an unrelated single-GPU FSDP OOM on the 256-expert trainer, which is a config-sizing issue, not the renderers bug. Also pulls in renderers#74 (``message_tool_names`` field for per-message tool attribution). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallerite marked this pull request as ready for review May 28, 2026 18:49

macroscopeapp Bot approved these changes May 28, 2026

View reviewed changes

hallerite merged commit 35c2407 into main May 28, 2026
11 checks passed

hallerite deleted the fix/offset-tokenizer-config-fallback branch May 28, 2026 18:58

hallerite mentioned this pull request May 28, 2026

chore(deps): bump renderers (offset-tokenizer config-build fallback for Laguna) PrimeIntellect-ai/prime-rl#2663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): apply config-build fallback to offset tokenizer too#75

fix(tokenizer): apply config-build fallback to offset tokenizer too#75
hallerite merged 1 commit into
mainfrom
fix/offset-tokenizer-config-fallback

hallerite commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

macroscopeapp Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Follow-up worth a separate issue

Apply config-build fallback to _get_offset_tokenizer in renderer base

Uh oh!

macroscopeapp Bot commented May 28, 2026

Approvability

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented May 28, 2026 •

edited by cursor Bot

Loading

Apply config-build fallback to `_get_offset_tokenizer` in renderer base