-
Notifications
You must be signed in to change notification settings - Fork 193
Download Hugging Face datasets for text classification tutorials #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Download Hugging Face datasets for text classification tutorials #1288
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Greptile OverviewGreptile SummaryThis PR successfully addresses issue #1084 by replacing toy datasets with real-world Hugging Face datasets across all text classification tutorials. The changes include:
All notebooks have been executed successfully with outputs showing realistic classification results. The changes are consistent across all 10 affected tutorial notebooks, with 8 receiving major updates (dataset downloads and Parquet support) and 2 receiving minor documentation updates. Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant Notebook
participant HF as Hugging Face Hub
participant FileSystem
participant Reader as ParquetReader/JsonlReader
participant Classifier
participant Writer as ParquetWriter/JsonlWriter
User->>Notebook: Run tutorial notebook
Notebook->>FileSystem: Check if input_data_dir exists
alt Directory empty or missing
Notebook->>HF: snapshot_download(fineweb-edu-llama3-annotations)
HF-->>FileSystem: Download Parquet files to input_data_dir/data/*
end
Notebook->>Reader: Initialize reader stage with text_field
Reader->>FileSystem: Read files from input_data_dir
FileSystem-->>Reader: Return text data
Notebook->>Classifier: Initialize classifier with text_field parameter
Reader->>Classifier: Pass text data
Classifier->>Classifier: Tokenize on CPU
Classifier->>Classifier: Model inference on GPU
Classifier-->>Writer: Return predictions
Writer->>FileSystem: Write results to output directory
FileSystem-->>Notebook: Confirmation
Notebook-->>User: Display results with pd.read_parquet/read_json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, no comments
Closes #1084.