Tokenization issue in transformer NER

In your custom data loader:

```python
class CustomDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.sentences[index])
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        label = self.labels[index]
        label.extend([4]*200)
        label=label[:200]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'tags': torch.tensor(label, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len
```
according to my understanding:
you have a sentence say  **w1 w2 w3 w4**, and its BIO label is **O B-class1 I-class1 O**.
once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say **w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD]**, but your labels are  **O B-class1 I-class1 O 4 4 4 4 4 4**. So, now you are passing incorrect labels to your model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenization issue in transformer NER #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tokenization issue in transformer NER #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions