fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420
Open
alvinttang wants to merge 1 commit into
Open
fix(t5): use prefix-redecode for streaming so byte-level BPE tokenizers work#1420alvinttang wants to merge 1 commit into
alvinttang wants to merge 1 commit into
Conversation
…encePiece The streaming generation loop in `t5.py` was decoding each new token in isolation via `convert_ids_to_tokens` and only stripping the SentencePiece `▁` marker. This is incorrect for tokenizers whose ids do not map 1:1 to characters: byte-level BPE tokenizers (e.g. `Salesforce/codet5p-220m`, which uses `RobertaTokenizerFast`) produced raw `Ġ`/`Ċ` markers in the output, and SentencePiece byte-fallback would leak hex byte tokens. Issue ml-explore#1021 demonstrated: $ python3 t5.py --model codet5p-220m --prompt 'def print_hello_world():<extra_id_0>' <extra_id_0>ĊĠĠĠĠprintĠ"HelloĠWorld"ĊĊ Fix: introduce a small streaming-state API on `Tokenizer` that delegates to the underlying HuggingFace tokenizer's own `decode` over the full prefix and yields the new substring on each step. This is the simplest correct fix suggested by the maintainer in the issue thread. Adds a regression test that compares streamed output to full-prefix decode for both `t5-small` (SentencePiece) and `Salesforce/codet5p-220m` (byte-level BPE), and asserts no `Ġ`/`Ċ` markers leak.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1021.
t5/t5.pyTokenizer.decodecalledself._tokenizer.convert_ids_to_tokens(t)per token and only stripped SentencePiece▁. The__main__streaming loop fed it one token id at a time. For tokenizers whose ids don't map 1:1 to characters this is broken:RobertaTokenizerFastused bySalesforce/codet5p-220m) emits raw GPT-2 byte markersĠ(space) andĊ(newline) which the▁replacement never touches.Fix
Replace per-token
convert_ids_to_tokenswith a small streaming-state API that delegates to the HuggingFace tokenizer's owndecode(prefix, skip_special_tokens=True)and yields the new substring each step — exactly the "redecode the entire prefix" path @awni suggested in the issue thread.__main__updated to callstream_decode(state, token.item())instead ofdecode([token.item()], with_sep=...).with_sepflag removed (only ever set in this one call site — checked via grep across the repo). +22/-4 production.Test
t5/test_tokenizer_stream.py— parametrized:t5-small(SentencePiece) +Salesforce/codet5p-220m(byte-level BPE). Asserts streamed output equals full-prefixdecode(ids, skip_special_tokens=True). Regression: asserts noĠ/Ċever appear in streamed output for codet5p code generation.TDD RED first (
AttributeError: 'Tokenizer' object has no attribute 'new_stream_state'), then GREEN (3/3 pass). End-to-end repro encoded the exact buggy snippet from the issue ('\n print("Hello World")\n\n'), streamed via the new API, output matches expected with no markers leaked.black+isort --profile=blackclean.Risk notes
Tokenizer.decode(list_of_ids)no longer acceptswith_sepkwarg. Only call site rewritten.Refs #1021