Skip to content

fix: refix tokenizer with added token shenanigans#304

Open
stephantul wants to merge 2 commits intoMinishLab:mainfrom
stephantul:fix-tokenizer-again
Open

fix: refix tokenizer with added token shenanigans#304
stephantul wants to merge 2 commits intoMinishLab:mainfrom
stephantul:fix-tokenizer-again

Conversation

@stephantul
Copy link
Contributor

This PR fixes the tokenizer AGAIN by

  1. removing all added tokens that are meaningless in the tokenizer, e.g., [MASK] for bert models. This solved a bug with some tokenizers having added tokens that were out of bounds.
  2. I also update the base model itself, so that the embedding matrix always matches the tokenizer.

I had to update a whole bunch of tests. The vocabulary counts are different because we now also remove any special tokens we used to keep, such as [EOS]. The only special tokens we now keep are the ones actually relevant to model2vec, which are the padding token (which is now set correctly) and the unk token.

NB: tests will fail because of utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments