Thanks for making this open-source! The following function checks for _pad_token attribute:
def _tokenize(self, text_sample):
if self.tokenizer._pad_token is None:
# Some tokenizers (e.g. GPT2 tokenizer) have no padding token which causes bugs
raise RuntimeError("If tokenizing on-the-fly, tokenizer must have a pad_token_id")
return self.tokenizer(text_sample["text"], truncation=True, padding="max_length", max_length=self.max_seq_len)
But shouldn't it simply check for pad_token_id? My tokenizer has pad_token_id and pad_token, but no _pad_token.
Thanks for making this open-source! The following function checks for
_pad_tokenattribute:But shouldn't it simply check for
pad_token_id? My tokenizer haspad_token_idandpad_token, but no_pad_token.