Skip to content

openpecha/BoSentencePiece loads as a degenerate tokenizer (7 tokens only) via HF tokenizer wrappers, breaking OCR training #2

@nih23

Description

@nih23

Summary

When loading openpecha/BoSentencePiece via Hugging Face tokenizer wrappers (AutoTokenizer) the tokenizer loads in a degenerate state with only 7 special tokens.

This causes Tibetan text to tokenize almost entirely as <unk>, which breaks downstream OCR training.

Direct loading via the sentencepiece Python package works correctly and shows a valid 20k-piece model.

Impact

In OCR training:

  • Ground-truth labels become mostly [CLS] <unk> <unk> ... [SEP]
  • Model effectively trains on special tokens instead of real text
  • Debug decode can appear as empty string if skip_special_tokens=True

This results in unusable OCR training behavior.

Expected behavior

Loading openpecha/BoSentencePiece should produce a tokenizer with a normal vocabulary size (~20,000 pieces) and functional Tibetan tokenization.

Actual behavior

HF tokenizer wrappers load a tokenizer with:

  • len(tokenizer) == 7
  • sp_model is None
  • only ALBERT-style special tokens (<s>, </s>, <unk>, [SEP], <pad>, [CLS], [MASK])

Environment

  • Python 3.12
  • transformers==5.2.0
  • tokenizers==0.22.2
  • sentencepiece==0.2.1

Also observed with:

  • transformers==4.46.3
  • tokenizers==0.20.3
  • sentencepiece==0.2.1

(Direct downgrade to sentencepiece==0.1.99 was not feasible on Python 3.12 due to wheel/build issues.)

Workaround

Load the tokenizer directly with the sentencepiece package (SentencePieceProcessor.load()) and use encode_as_ids / encode_as_pieces for tokenization, bypassing HF tokenizer wrappers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions