Skip to content

Tokenization issue in transformer NER #22

@mukesh-mehta

Description

@mukesh-mehta

In your custom data loader:

class CustomDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.sentences[index])
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        label = self.labels[index]
        label.extend([4]*200)
        label=label[:200]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'tags': torch.tensor(label, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

according to my understanding:
you have a sentence say w1 w2 w3 w4, and its BIO label is O B-class1 I-class1 O.
once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD], but your labels are O B-class1 I-class1 O 4 4 4 4 4 4. So, now you are passing incorrect labels to your model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions