fix(tokenizer): apply config-build fallback to offset tokenizer too#75
Merged
Conversation
#72 wired ``_load_fast_tokenizer_directly`` into ``_load_tokenizer_via_auto`` so ``load_tokenizer`` survives model-config build failures (e.g. HF RoPE validation rejecting nested ``rope_parameters`` for ``poolside/Laguna-XS.2``). But ``_get_offset_tokenizer`` was calling ``AutoTokenizer.from_pretrained`` directly to keep the fastokens patch out of this path — bypassing the fallback entirely. That meant every hand-coded renderer (LagunaXS2, Qwen35, etc.) still crashed on the first rollout for Laguna-family models: ``render`` → ``emit_text_segments`` → ``attribute_text_segments`` → ``_get_offset_tokenizer`` → raw ``AutoTokenizer.from_pretrained`` → RoPE validator → ``KeyError``. Reproduced end-to-end with prime-rl + reverse-text + ``poolside/Laguna-XS.2``: without this patch, every rollout aborts with the same ``KeyError`` that #72 was supposed to fix. With it, the first two RL steps complete cleanly (reward 0.36 / 0.31, ~100 tok/sample). Route through ``_load_tokenizer_via_auto`` instead. Same vanilla path (no fastokens patching, since that helper doesn't apply it), but now the ``_load_fast_tokenizer_directly`` fallback runs when the model config build fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ApprovabilityVerdict: Approved Small fix that applies an existing fallback pattern to the offset tokenizer loading path. The change replaces a direct You can customize Macroscope's approvability policy. Learn more. |
eexwhyzee
pushed a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 28, 2026
…or Laguna) (#2663) Bumps the deps/renderers submodule 89ab3f0 (v0.1.8.dev35) -> 35c2407 (v0.1.8.dev37), pulling in PrimeIntellect-ai/renderers#75: apply the config-build fallback to ``_get_offset_tokenizer`` too. #72 only wired the fallback into ``load_tokenizer``; ``_get_offset_tokenizer`` still called ``AutoTokenizer.from_pretrained`` directly to avoid the fastokens shim and so kept hitting HF's RoPE validator on Laguna-family models. Every rollout through a hand-coded renderer (LagunaXS2, Qwen35, ...) crashed with the same KeyError #72 was supposed to fix. Verified end-to-end against poolside/Laguna-XS.2 with reverse-text: two RL steps complete cleanly (reward 0.36 / 0.31, ~100 tok/sample); the run then hits an unrelated single-GPU FSDP OOM on the 256-expert trainer, which is a config-sizing issue, not the renderers bug. Also pulls in renderers#74 (``message_tool_names`` field for per-message tool attribution). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
#72 wired
_load_fast_tokenizer_directlyinto_load_tokenizer_via_autosoload_tokenizersurvives model-config build failures (e.g. HF RoPE validation rejecting nestedrope_parametersforpoolside/Laguna-XS.2).But
_get_offset_tokenizerwas callingAutoTokenizer.from_pretraineddirectly to keep the fastokens patch out of this path — bypassing the fallback entirely.That meant every hand-coded renderer (
LagunaXS2,Qwen35, ...) still crashed on the first rollout for Laguna-family models, with the sameKeyError#72 was meant to fix:This bypass is also why disabling renderers made the symptom go away on hosted-rl: the offset-tokenizer path is only hit when a hand-coded renderer is in use.
Fix
Route the offset-tokenizer load through
_load_tokenizer_via_auto. Same vanilla path (no fastokens patching, since that helper doesn't apply it), but now the_load_fast_tokenizer_directlyfallback runs when the model config build fails.Verification
Reproduced end-to-end against
poolside/Laguna-XS.2with prime-rl +reverse-text:ModelError() -> KeyError("Missing required keys in 'rope_parameters' for 'rope_type'='default': {'rope_theta'}").(The run then OOMs on the trainer side at step 2, which is an unrelated single-GPU FSDP capacity issue with the 256-expert Laguna model.)
Follow-up worth a separate issue
For hand-coded renderers, the bulk of message-body tokenization goes through
attribute_text_segments→ vanilla offset tokenizer, not through the fastokens-patched main tokenizer. fastokens only helps decode + the few_encodecalls that don't need offsets. Worth considering whether to skip fastokens entirely for hand-coded renderers (load_tokenizer(..., use_fastokens=False)) and avoid keeping two tokenizer copies in memory. Not in scope for this fix.🤖 Generated with Claude Code
Note
Apply config-build fallback to
_get_offset_tokenizerin renderer baseRoutes tokenizer instantiation in
_get_offset_tokenizerthrough_load_tokenizer_via_autoinstead of callingAutoTokenizer.from_pretraineddirectly, bringing it in line with the main tokenizer loading path. This ensures the config-build fallback is available when loading offset tokenizers, while still requiring the result to be a fast tokenizer withoffset_mappingsupport.Macroscope summarized f0435d4.
Note
Low Risk
Single call-site change in tokenizer loading for offset attribution; aligns with an existing tested fallback path and does not alter fastokens or main
load_tokenizerbehavior.Overview
_get_offset_tokenizerno longer callsAutoTokenizer.from_pretraineddirectly. It now loads the vanilla, offset-capable tokenizer through_load_tokenizer_via_auto, same trust/revision kwargs as before.That keeps this path without fastokens (still not
load_tokenizer) but applies the config-build fallback used elsewhere: when HF fails building the model config (e.g. Lagunarope_parameters/ missingrope_theta), loading can succeed viatokenizer.jsoninstead. Hand-coded renderers that depend onattribute_text_segments/ offset mapping were still crashing on those models because the offset path skipped that fallback.Reviewed by Cursor Bugbot for commit f0435d4. Bugbot is set up for automated code reviews on this repo. Configure here.