-
Notifications
You must be signed in to change notification settings - Fork 196
Open
Description
In your custom data loader:
class CustomDataset(Dataset):
def __init__(self, tokenizer, sentences, labels, max_len):
self.len = len(sentences)
self.sentences = sentences
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __getitem__(self, index):
sentence = str(self.sentences[index])
inputs = self.tokenizer.encode_plus(
sentence,
None,
add_special_tokens=True,
max_length=self.max_len,
pad_to_max_length=True,
return_token_type_ids=True
)
ids = inputs['input_ids']
mask = inputs['attention_mask']
label = self.labels[index]
label.extend([4]*200)
label=label[:200]
return {
'ids': torch.tensor(ids, dtype=torch.long),
'mask': torch.tensor(mask, dtype=torch.long),
'tags': torch.tensor(label, dtype=torch.long)
}
def __len__(self):
return self.lenaccording to my understanding:
you have a sentence say w1 w2 w3 w4, and its BIO label is O B-class1 I-class1 O.
once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD], but your labels are O B-class1 I-class1 O 4 4 4 4 4 4. So, now you are passing incorrect labels to your model.
Metadata
Metadata
Assignees
Labels
No labels