Skip to content

Conversation

@mpeex
Copy link

@mpeex mpeex commented Dec 8, 2025

…sample

Before submitting
  • [N ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • [ Y] Did you read the contributor guideline, Pull Request section?
  • [ N] Did you make sure to update the docs?
  • [ N] Did you write any new necessary tests?

What does this PR do?

  1. enable support for HuggingFace revision/branches as per 1 and 2
  2. transform function of StreamingDataset can return None to skip the sample

Fixes # (issue).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun? oh ye

Make sure you had fun coding 🙃

item = self.transform(item)

return item
return item if item else self.__next__()
Copy link
Collaborator

@deependujha deependujha Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to have this in __next__ method, and not in __getitem__.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion

@deependujha
Copy link
Collaborator

deependujha commented Dec 9, 2025

Hi @mpeex, can you share more details on when HF dataset's __getitem__ can return None.

If this is your specific use case, one option can be to override the __next__ method in your implementation.

class MyDS(StreamingDataset):
    def __next__(self, index):
         item = super().__next__(self)
         return item if item else self.__next__()

ds = MyDs(...)

Also, can you write some tests verifying the behaviour for HF dataset revision/branches?

@mpeex
Copy link
Author

mpeex commented Dec 9, 2025

hi @deependujha thanks for reviewing. The usecase is to filter samples while streaming using the transform function, for ex:

t = [partial(t_1, args..),
        ...
        partial(t_N, args..)
        ]
StreamingDatasets(uri, transform=t)

returning None in any of the t_x function can be used to skip the sample, very straightforward.
I understand your suggestion (I didn't think about it).

I will write some tests to verify the behaviour for HF dataset revision/branches.

thanks for your time

@philgzl
Copy link
Contributor

philgzl commented Dec 12, 2025

I think that skipping samples, whether that's via transform or by subclassing StreamingDataset, breaks integration with StreamingDataLoader and ParallelStreamingDataset. See #642.

@bhimrazy bhimrazy marked this pull request as draft January 4, 2026 18:29
@mpeex
Copy link
Author

mpeex commented Jan 6, 2026

considering @philgzl comment, what is than the correct way to filter out samples if using StreamingDataLoader ?

@bhimrazy
Copy link
Collaborator

considering @philgzl comment, what is than the correct way to filter out samples if using StreamingDataLoader ?

@mpeex — since this appears to be specific to your filtering use case, one option is to use the approach suggested by @deependujha and override the __next__ method to skip samples.

However, as @philgzl pointed out, this comes with caveats and may not work correctly with StreamingDataLoader / ParallelStreamingDataset (e.g. sharding, worker sync, resume semantics).

To fully avoid these caveats, I’d recommend filtering during the litdata optimize step instead. By doing the filtering upfront at optimization time, you ensure that only valid samples are written into the optimized dataset, and the streaming layer can then operate with stable indexing and correct resume behavior.

@bhimrazy
Copy link
Collaborator

bhimrazy commented Jan 14, 2026

@mpeex, would you be open to opening a separate PR that only contains the HuggingFace revision/branch support?
Ideally, this would come from a separate branch (not main) of your fork. I tried to apply edits directly to this PR but wasn’t able to.

Let me know if that works for you, and thanks again for working through this 👍

@deependujha
Copy link
Collaborator

Having a separate PR for HF revision/branches makes more sense.

@deependujha deependujha added help wanted Extra attention is needed waiting on author Waiting for user input or feedback. and removed help wanted Extra attention is needed labels Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting on author Waiting for user input or feedback.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants