Skip to content

feat: route load_tokenizer through fastokens by default#10

Open
hallerite wants to merge 1 commit intomainfrom
feat/fastokens-default
Open

feat: route load_tokenizer through fastokens by default#10
hallerite wants to merge 1 commit intomainfrom
feat/fastokens-default

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

Add fastokens (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a required dependency and patch it in by default for every supported model except a small denylist. The patch is bracketed around from_pretrained, so the loaded tokenizer keeps the fastokens shim while the user's process-global AutoTokenizer.from_pretrained stays vanilla.

Audit results — every entry of MODEL_RENDERER_MAP

Probe: 5-case encoding (plain text, Lorem, emoji + CJK, literal <|im_start|>, 200-word long), vanilla vs fastokens-patched.

Result Count Models
Byte-identical 31/35 All Qwen3.x (10), Qwen3.5 (3), Qwen3.6 (1), Qwen3-VL (3), GLM-5 / 4.7 / 5.1 (3), GLM-4.5 family (2), Kimi-K2.x (3), Nemotron 3 (2), Llama-3.2 Instruct (2), gpt-oss (2)
Load error 2/35 deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1 doesn't support the Metaspace pretokenizer
Silent encode divergence 2/35 MiniMaxAI/MiniMax-M2{,.5} — encodes literal `<

The 4 incompat models live in FASTOKENS_INCOMPATIBLE (frozenset in renderers/base.py) and skip the patch unconditionally. Unknown / fine-tuned models hit the patched path first and fall back to vanilla on any fastokens load error (logged at INFO).

Implementation notes

def load_tokenizer(model_name_or_path, *, use_fastokens=True):
    ...
    if not use_fastokens or model_name_or_path in FASTOKENS_INCOMPATIBLE:
        return AutoTokenizer.from_pretrained(...)
    try:
        return _patched_load(...)  # patches fastokens, loads, unpatches
    except Exception:
        logger.info("fastokens couldn't load %r; falling back ...")
        return AutoTokenizer.from_pretrained(...)

Per-call patch/unpatch keeps the side effect minimal: the returned tokenizer keeps the fastokens shim (because fastokens captures the backend at load time), but subsequent AutoTokenizer.from_pretrained calls outside load_tokenizer stay vanilla. Verified by test_patch_is_unloaded_after_call.

Tests

tests/test_load_tokenizer_fastokens.py — 10 cases pinning the policy:

  • FASTOKENS_INCOMPATIBLE exact-shape lock
  • Default path produces a fastokens-shim backend
  • use_fastokens=False produces a vanilla backend
  • Encode parity vanilla vs fastokens on Qwen3.5-9B (4 sample strings)
  • Each of the 4 incompat models loads via vanilla (skips the patch)
  • Patch leak test: a direct AutoTokenizer.from_pretrained outside load_tokenizer stays vanilla
  • Simulated fastokens load failure falls back to vanilla cleanly

Existing 900-test suite passes unchanged with fastokens patched globally — verified before designing this PR (full run: 947 passed on the feat/llama-3-renderer branch with fastokens. This branch off origin/main: 910 passed, 48 skipped, 1 xfailed in 129.15s).

No version bump — batched with the open Qwen3.5 / Llama-3 PRs.

Test plan

  • pytest tests/test_load_tokenizer_fastokens.py — 10 cases pass
  • Full suite (pytest tests/ --ignore=tests/test_client.py) — 910 pass, 48 skipped, 1 xfailed (no regressions)
  • Per-model byte parity probe over all 35 MODEL_RENDERER_MAP entries — 31 PARITY, 4 denylisted by deliberate review
  • Pre-commit hooks (ruff check + format) clean

🤖 Generated with Claude Code

Add ``fastokens`` (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a
required dependency and patch it in by default for every supported
model except a small denylist. The patch is bracketed: ``patch`` →
``from_pretrained`` → ``unpatch``, so the loaded tokenizer keeps the
fastokens shim while the user's process-global
``AutoTokenizer.from_pretrained`` stays vanilla.

Empirically verified across all 35 entries in MODEL_RENDERER_MAP:

  31/35 byte-identical with vanilla on a 5-case encoding probe
        (plain, Lorem, emoji+CJK, special-token-literal text, long).
   2/35 fail to load: deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1
        doesn't support the Metaspace pretokenizer.
   2/35 silently diverge on content containing literal ``<|im_start|>``-
        like text: MiniMaxAI/MiniMax-M2{,.5}.

The 4 incompat models live in FASTOKENS_INCOMPATIBLE and skip the patch
unconditionally. Unknown / fine-tuned models hit the patched fast path
first and fall back to vanilla on any fastokens load error (logged at
INFO).

Existing 900-test suite passes unchanged with fastokens patched
globally; new test_load_tokenizer_fastokens.py adds 10 cases pinning
the policy:

* FASTOKENS_INCOMPATIBLE shape (deliberate-review on changes)
* Default path produces a fastokens-shim backend
* ``use_fastokens=False`` produces a vanilla backend
* Encode parity vanilla vs fastokens on a representative model
* Each incompat model loads via vanilla (skips the patch)
* The patch doesn't leak: a direct AutoTokenizer.from_pretrained
  outside load_tokenizer stays vanilla
* Simulated fastokens load failure falls back to vanilla cleanly

No version bump (batched with the open Qwen3.5 / Llama-3 PRs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant