Skip to content

fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420

Open
alvinttang wants to merge 1 commit into
ml-explore:mainfrom
alvinttang:fix/t5-streaming-decode
Open

fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420
alvinttang wants to merge 1 commit into
ml-explore:mainfrom
alvinttang:fix/t5-streaming-decode

Conversation

@alvinttang
Copy link
Copy Markdown

Summary

Fixes #1021. t5/t5.py Tokenizer.decode called self._tokenizer.convert_ids_to_tokens(t) per token and only stripped SentencePiece . The __main__ streaming loop fed it one token id at a time. For tokenizers whose ids don't map 1:1 to characters this is broken:

  • Byte-level BPE (RobertaTokenizerFast used by Salesforce/codet5p-220m) emits raw GPT-2 byte markers Ġ (space) and Ċ (newline) which the replacement never touches.
  • SentencePiece byte-fallback emits hex byte tokens that also need decoding.

Fix

Replace per-token convert_ids_to_tokens with a small streaming-state API that delegates to the HuggingFace tokenizer's own decode(prefix, skip_special_tokens=True) and yields the new substring each step — exactly the "redecode the entire prefix" path @awni suggested in the issue thread.

def new_stream_state(self) -> dict: ...
def stream_decode(self, state, token_id) -> str: ...
def decode(self, t) -> str:  # batch decode now also via HF
    return self._tokenizer.decode(t, skip_special_tokens=True)

__main__ updated to call stream_decode(state, token.item()) instead of decode([token.item()], with_sep=...). with_sep flag removed (only ever set in this one call site — checked via grep across the repo). +22/-4 production.

Test

t5/test_tokenizer_stream.py — parametrized: t5-small (SentencePiece) + Salesforce/codet5p-220m (byte-level BPE). Asserts streamed output equals full-prefix decode(ids, skip_special_tokens=True). Regression: asserts no Ġ / Ċ ever appear in streamed output for codet5p code generation.

TDD RED first (AttributeError: 'Tokenizer' object has no attribute 'new_stream_state'), then GREEN (3/3 pass). End-to-end repro encoded the exact buggy snippet from the issue ('\n print("Hello World")\n\n'), streamed via the new API, output matches expected with no markers leaked.

black + isort --profile=black clean.

Risk notes

  • Tokenizer.decode(list_of_ids) no longer accepts with_sep kwarg. Only call site rewritten.
  • Quadratic cost in output length per stream — acceptable for typical T5 lengths and explicitly accepted by the maintainer in the issue thread; alternative would be a proper incremental detokenizer (much larger change).

Refs #1021

…encePiece

The streaming generation loop in `t5.py` was decoding each new token in
isolation via `convert_ids_to_tokens` and only stripping the SentencePiece
`▁` marker. This is incorrect for tokenizers whose ids do not map 1:1 to
characters: byte-level BPE tokenizers (e.g. `Salesforce/codet5p-220m`,
which uses `RobertaTokenizerFast`) produced raw `Ġ`/`Ċ` markers in the
output, and SentencePiece byte-fallback would leak hex byte tokens.

Issue ml-explore#1021 demonstrated:

    $ python3 t5.py --model codet5p-220m --prompt 'def print_hello_world():<extra_id_0>'
    <extra_id_0>ĊĠĠĠĠprintĠ"HelloĠWorld"ĊĊ

Fix: introduce a small streaming-state API on `Tokenizer` that delegates to
the underlying HuggingFace tokenizer's own `decode` over the full prefix and
yields the new substring on each step. This is the simplest correct fix
suggested by the maintainer in the issue thread.

Adds a regression test that compares streamed output to full-prefix decode
for both `t5-small` (SentencePiece) and `Salesforce/codet5p-220m`
(byte-level BPE), and asserts no `Ġ`/`Ċ` markers leak.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

T5 tokenizer decoding error with CodeT5+

1 participant