In train_diloco_torch.py, the validation set is loaded with streaming=True format.
This means when evaluating, the process will continue infinitely since IterableDataset does not have len
ds = (
load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
if c4_tiny
else load_dataset(
"allenai/c4",
"en",
streaming=True,
data_files={
"train": "en/c4-train.*.json.gz",
"validation": "en/c4-validation.00000-of-00008.json.gz",
},
)
)
We can use 1000 samples to test perplexity, or we can just simply load the validation dataset with streaming=False.
In train_diloco_torch.py, the validation set is loaded with streaming=True format.
This means when evaluating, the process will continue infinitely since IterableDataset does not have len
We can use 1000 samples to test perplexity, or we can just simply load the validation dataset with streaming=False.