-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
When loading openpecha/BoSentencePiece via Hugging Face tokenizer wrappers (AutoTokenizer) the tokenizer loads in a degenerate state with only 7 special tokens.
This causes Tibetan text to tokenize almost entirely as <unk>, which breaks downstream OCR training.
Direct loading via the sentencepiece Python package works correctly and shows a valid 20k-piece model.
Impact
In OCR training:
- Ground-truth labels become mostly
[CLS] <unk> <unk> ... [SEP] - Model effectively trains on special tokens instead of real text
- Debug decode can appear as empty string if
skip_special_tokens=True
This results in unusable OCR training behavior.
Expected behavior
Loading openpecha/BoSentencePiece should produce a tokenizer with a normal vocabulary size (~20,000 pieces) and functional Tibetan tokenization.
Actual behavior
HF tokenizer wrappers load a tokenizer with:
len(tokenizer) == 7sp_model is None- only ALBERT-style special tokens (
<s>,</s>,<unk>,[SEP],<pad>,[CLS],[MASK])
Environment
- Python 3.12
transformers==5.2.0tokenizers==0.22.2sentencepiece==0.2.1
Also observed with:
transformers==4.46.3tokenizers==0.20.3sentencepiece==0.2.1
(Direct downgrade to sentencepiece==0.1.99 was not feasible on Python 3.12 due to wheel/build issues.)
Workaround
Load the tokenizer directly with the sentencepiece package (SentencePieceProcessor.load()) and use encode_as_ids / encode_as_pieces for tokenization, bypassing HF tokenizer wrappers.