Skip to content

Conversation

@sarahyurick
Copy link
Contributor

Closes #1084.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

sarahyurick and others added 14 commits December 3, 2025 12:41
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick marked this pull request as ready for review December 4, 2025 00:17
@sarahyurick sarahyurick requested a review from arhamm1 December 4, 2025 00:18
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 4, 2025

Greptile Overview

Greptile Summary

This PR successfully addresses issue #1084 by replacing toy datasets with real-world Hugging Face datasets across all text classification tutorials. The changes include:

  • Replaced hardcoded toy data with automatic downloads from HuggingFaceFW/fineweb-edu-llama3-annotations dataset
  • Added Parquet file format support alongside existing JSONL support
  • Configured text_field parameter for all classifiers to properly specify the input text column
  • Updated documentation to reflect support for both JSONL and Parquet formats
  • Enhanced tutorials to demonstrate real-world usage patterns including DocumentBatch, performance optimization, and filtering

All notebooks have been executed successfully with outputs showing realistic classification results. The changes are consistent across all 10 affected tutorial notebooks, with 8 receiving major updates (dataset downloads and Parquet support) and 2 receiving minor documentation updates.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects successful notebook executions with realistic outputs, consistent implementation patterns across all tutorials, proper error handling, and no breaking changes to existing functionality
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
tutorials/text/distributed-data-classification/quality-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier
tutorials/text/distributed-data-classification/domain-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier
tutorials/text/distributed-data-classification/content-type-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier
tutorials/text/distributed-data-classification/fineweb-edu-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier
tutorials/text/distributed-data-classification/multilingual-domain-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier
tutorials/text/distributed-data-classification/prompt-task-complexity-classification.ipynb 5/5 Replaced toy data with real Hugging Face dataset download, added Parquet support, and configured text_field parameter for classifier

Sequence Diagram

sequenceDiagram
    participant User
    participant Notebook
    participant HF as Hugging Face Hub
    participant FileSystem
    participant Reader as ParquetReader/JsonlReader
    participant Classifier
    participant Writer as ParquetWriter/JsonlWriter
    
    User->>Notebook: Run tutorial notebook
    Notebook->>FileSystem: Check if input_data_dir exists
    alt Directory empty or missing
        Notebook->>HF: snapshot_download(fineweb-edu-llama3-annotations)
        HF-->>FileSystem: Download Parquet files to input_data_dir/data/*
    end
    
    Notebook->>Reader: Initialize reader stage with text_field
    Reader->>FileSystem: Read files from input_data_dir
    FileSystem-->>Reader: Return text data
    
    Notebook->>Classifier: Initialize classifier with text_field parameter
    Reader->>Classifier: Pass text data
    Classifier->>Classifier: Tokenize on CPU
    Classifier->>Classifier: Model inference on GPU
    Classifier-->>Writer: Return predictions
    
    Writer->>FileSystem: Write results to output directory
    FileSystem-->>Notebook: Confirmation
    Notebook-->>User: Display results with pd.read_parquet/read_json
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using real-world datasets for text classifier tutorials

1 participant