fix: refix tokenizer with added token shenanigans by stephantul · Pull Request #304 · MinishLab/model2vec

stephantul · 2026-02-19T15:54:10Z

This PR fixes the tokenizer AGAIN by

removing all added tokens that are meaningless in the tokenizer, e.g., [MASK] for bert models. This solved a bug with some tokenizers having added tokens that were out of bounds.
I also update the base model itself, so that the embedding matrix always matches the tokenizer.

I had to update a whole bunch of tests. The vocabulary counts are different because we now also remove any special tokens we used to keep, such as [EOS]. The only special tokens we now keep are the ones actually relevant to model2vec, which are the padding token (which is now set correctly) and the unk token.

NB: tests will fail because of utils.py

stephantul added 2 commits February 19, 2026 16:25

fix: refix tokenizer with added token shenanigans

35cca62

fix: remove all added tokens except important ones

dd8f595

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: refix tokenizer with added token shenanigans#304

fix: refix tokenizer with added token shenanigans#304
stephantul wants to merge 2 commits intoMinishLab:mainfrom
stephantul:fix-tokenizer-again

stephantul commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

stephantul commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments