Skip to content

_pad_token attribute? #241

@ceferisbarov

Description

@ceferisbarov

Thanks for making this open-source! The following function checks for _pad_token attribute:

    def _tokenize(self, text_sample):
        if self.tokenizer._pad_token is None:
            # Some tokenizers (e.g. GPT2 tokenizer) have no padding token which causes bugs
            raise RuntimeError("If tokenizing on-the-fly, tokenizer must have a pad_token_id")

        return self.tokenizer(text_sample["text"], truncation=True, padding="max_length", max_length=self.max_seq_len)

But shouldn't it simply check for pad_token_id? My tokenizer has pad_token_id and pad_token, but no _pad_token.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions