fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work by alvinttang · Pull Request #1420 · ml-explore/mlx-examples

alvinttang · 2026-04-25T14:55:22Z

Summary

Fixes #1021. t5/t5.py Tokenizer.decode called self._tokenizer.convert_ids_to_tokens(t) per token and only stripped SentencePiece ▁. The __main__ streaming loop fed it one token id at a time. For tokenizers whose ids don't map 1:1 to characters this is broken:

Byte-level BPE (RobertaTokenizerFast used by Salesforce/codet5p-220m) emits raw GPT-2 byte markers Ġ (space) and Ċ (newline) which the ▁ replacement never touches.
SentencePiece byte-fallback emits hex byte tokens that also need decoding.

Fix

Replace per-token convert_ids_to_tokens with a small streaming-state API that delegates to the HuggingFace tokenizer's own decode(prefix, skip_special_tokens=True) and yields the new substring each step — exactly the "redecode the entire prefix" path @awni suggested in the issue thread.

def new_stream_state(self) -> dict: ...
def stream_decode(self, state, token_id) -> str: ...
def decode(self, t) -> str:  # batch decode now also via HF
    return self._tokenizer.decode(t, skip_special_tokens=True)

__main__ updated to call stream_decode(state, token.item()) instead of decode([token.item()], with_sep=...). with_sep flag removed (only ever set in this one call site — checked via grep across the repo). +22/-4 production.

Test

t5/test_tokenizer_stream.py — parametrized: t5-small (SentencePiece) + Salesforce/codet5p-220m (byte-level BPE). Asserts streamed output equals full-prefix decode(ids, skip_special_tokens=True). Regression: asserts no Ġ / Ċ ever appear in streamed output for codet5p code generation.

TDD RED first (AttributeError: 'Tokenizer' object has no attribute 'new_stream_state'), then GREEN (3/3 pass). End-to-end repro encoded the exact buggy snippet from the issue ('\n print("Hello World")\n\n'), streamed via the new API, output matches expected with no markers leaked.

black + isort --profile=black clean.

Risk notes

Tokenizer.decode(list_of_ids) no longer accepts with_sep kwarg. Only call site rewritten.
Quadratic cost in output length per stream — acceptable for typical T5 lengths and explicitly accepted by the maintainer in the issue thread; alternative would be a proper incremental detokenizer (much larger change).

Refs #1021

…encePiece The streaming generation loop in `t5.py` was decoding each new token in isolation via `convert_ids_to_tokens` and only stripping the SentencePiece `▁` marker. This is incorrect for tokenizers whose ids do not map 1:1 to characters: byte-level BPE tokenizers (e.g. `Salesforce/codet5p-220m`, which uses `RobertaTokenizerFast`) produced raw `Ġ`/`Ċ` markers in the output, and SentencePiece byte-fallback would leak hex byte tokens. Issue ml-explore#1021 demonstrated: $ python3 t5.py --model codet5p-220m --prompt 'def print_hello_world():<extra_id_0>' <extra_id_0>ĊĠĠĠĠprintĠ"HelloĠWorld"ĊĊ Fix: introduce a small streaming-state API on `Tokenizer` that delegates to the underlying HuggingFace tokenizer's own `decode` over the full prefix and yields the new substring on each step. This is the simplest correct fix suggested by the maintainer in the issue thread. Adds a regression test that compares streamed output to full-prefix decode for both `t5-small` (SentencePiece) and `Salesforce/codet5p-220m` (byte-level BPE), and asserts no `Ġ`/`Ċ` markers leak.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420

fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420
alvinttang wants to merge 1 commit into
ml-explore:mainfrom
alvinttang:fix/t5-streaming-decode

alvinttang commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alvinttang commented Apr 25, 2026

Summary

Fix

Test

Risk notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant