openpecha/BoSentencePiece loads as a degenerate tokenizer (7 tokens only) via HF tokenizer wrappers, breaking OCR training

## Summary

When loading `openpecha/BoSentencePiece` via Hugging Face tokenizer wrappers (`AutoTokenizer`) the tokenizer loads in a degenerate state with only 7 special tokens.

This causes Tibetan text to tokenize almost entirely as `<unk>`, which breaks downstream OCR training.

Direct loading via the `sentencepiece` Python package works correctly and shows a valid 20k-piece model.

## Impact

In OCR training:
- Ground-truth labels become mostly `[CLS] <unk> <unk> ... [SEP]`
- Model effectively trains on special tokens instead of real text
- Debug decode can appear as empty string if `skip_special_tokens=True`

This results in unusable OCR training behavior.

## Expected behavior

Loading `openpecha/BoSentencePiece` should produce a tokenizer with a normal vocabulary size (~20,000 pieces) and functional Tibetan tokenization.

## Actual behavior

HF tokenizer wrappers load a tokenizer with:
- `len(tokenizer) == 7`
- `sp_model is None`
- only ALBERT-style special tokens (`<s>`, `</s>`, `<unk>`, `[SEP]`, `<pad>`, `[CLS]`, `[MASK]`)

## Environment

- Python 3.12
- `transformers==5.2.0`
- `tokenizers==0.22.2`
- `sentencepiece==0.2.1`

Also observed with:
- `transformers==4.46.3`
- `tokenizers==0.20.3`
- `sentencepiece==0.2.1`

(Direct downgrade to `sentencepiece==0.1.99` was not feasible on Python 3.12 due to wheel/build issues.)

## Workaround
Load the tokenizer directly with the `sentencepiece` package (`SentencePieceProcessor.load()`) and use `encode_as_ids` / `encode_as_pieces` for tokenization, bypassing HF tokenizer wrappers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openpecha/BoSentencePiece loads as a degenerate tokenizer (7 tokens only) via HF tokenizer wrappers, breaking OCR training #2

Summary

Impact

Expected behavior

Actual behavior

Environment

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

openpecha/BoSentencePiece loads as a degenerate tokenizer (7 tokens only) via HF tokenizer wrappers, breaking OCR training #2

Description

Summary

Impact

Expected behavior

Actual behavior

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions