Skip to content

feat: add HuggingFace Hub integration for dataset publishing#275

Merged
nabinchha merged 28 commits intomainfrom
nmulepati/feat/7-push-to-hf
Feb 4, 2026
Merged

feat: add HuggingFace Hub integration for dataset publishing#275
nabinchha merged 28 commits intomainfrom
nmulepati/feat/7-push-to-hf

Conversation

@nabinchha
Copy link
Contributor

@nabinchha nabinchha commented Jan 30, 2026

📋 Summary

This PR adds comprehensive HuggingFace Hub integration, enabling users to publish DataDesigner datasets directly to the HuggingFace Hub with automated dataset card generation, flexible upload options, and robust error handling.

🔄 Changes

✨ Added

  • HuggingFaceHubClient - Complete client for uploading datasets to HuggingFace Hub with support for:
    • Parquet and JSON file formats
    • Automatic repository creation
    • Path mapping configuration for column-specific file handling
    • Comprehensive logging and progress tracking
  • DataDesignerDatasetCard - Rich dataset card generator with:
    • Automatic metadata extraction from DataDesigner configs
    • Model information (aliases, inference parameters, system prompts)
    • Column configuration details (samplers, validators, LLM types)
    • Generation statistics and filtering info
    • Custom markdown template rendering
  • push_to_hub() method on DatasetCreationResults - Simple API for publishing results
  • Comprehensive test suite with 569+ test lines covering edge cases and error scenarios

🔧 Changed

  • Enhanced DatasetCreationResults class to support HuggingFace publishing workflow

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • client.py (349 lines) - Core upload logic with path mapping and error handling
  • dataset_card.py (139 lines) - Dataset card template rendering and metadata extraction
  • test_client.py (569 lines) - Extensive test coverage for upload scenarios

See this data set card as an example published directly from the create results object: https://huggingface.co/datasets/nabinnvidia/multi-lingual-greetings

create_result.push_to_hub(
    repo_id="nabinnvidia/multi-lingual-greetings",
    private=False,
    description="This dataset is a test dataset for multi-lingual greetings."
)
Screenshot 2026-02-02 at 11 55 14 AM

Closes #7

Draws inspiration on implementation from conversations in this PR: #127

🤖 Generated with AI

Implement HuggingFace Hub integration to upload DataDesigner datasets:
- Add HuggingFaceHubClient with upload_dataset method
- Upload main parquet files to data/ subset
- Upload processor outputs to data/{processor_name}/ subsets
- Generate dataset card from metadata.json with column statistics
- Include sdg.json and metadata.json configuration files
- Comprehensive validation and error handling
- Add push_to_hub() method to DatasetCreationResults
…nitions

- Add progress logging with emojis following codebase style
- Add repository exists check before creation
- Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/)
- Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links
- Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs
- Set data config as default configuration
- Add description parameter to push_to_hub() for custom dataset card content
- Description appears after NeMo Data Designer intro section
- Update dataset card template to conditionally render custom description
- Add tests for with/without custom description scenarios
- Make description parameter required in push_to_hub()
- Improve dataset card layout with flexbox header (title + right-aligned tagline)
- Add horizontal dividers between sections for visual separation
- Add emoji icons to section headers for better readability
- Move About NeMo Data Designer section after Citation
- Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About
- Update all tests to provide required description parameter
@nabinchha nabinchha requested a review from a team as a code owner January 30, 2026 21:42
@greptile-apps
Copy link

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

This PR adds comprehensive HuggingFace Hub integration to DataDesigner, enabling users to publish datasets directly to the HuggingFace Hub with a single method call. The implementation includes automated dataset card generation with rich metadata, robust error handling with specific exception types, and flexible upload options for both main datasets and processor outputs.

Key Changes:

  • Added HuggingFaceHubClient with comprehensive validation, authentication handling, and multi-stage upload workflow (dataset card → data files → processor outputs → config files)
  • Created DataDesignerDatasetCard with custom Jinja2 template featuring automatic metadata extraction, column statistics tables, generation details, and multi-config support
  • Extended DatasetCreationResults with push_to_hub() method providing clean public API
  • Added helper methods to ArtifactStorage for tracking file paths across main dataset and processor outputs
  • Implemented extensive test suite (826 lines) with fixtures, mocks, and edge case coverage

Implementation Highlights:

  • Proper resource cleanup with try/finally for temporary files
  • Specific HTTP error handling (401, 403) with actionable error messages
  • Path mapping to transform local structure (parquet-files/) to HuggingFace conventions (data/)
  • Support for custom tags and private repositories
  • Follows all project style guidelines (type annotations, error handling, documentation)

The PR is well-structured, thoroughly tested, and ready for production use.

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The implementation demonstrates exceptional code quality with comprehensive error handling, thorough validation, extensive test coverage (826 test lines), proper resource cleanup, and adherence to project style guidelines. The HuggingFace integration is well-architected with clear separation of concerns between client logic, dataset card generation, and the public API.
  • No files require special attention - all implementations are production-ready

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/integrations/huggingface/client.py comprehensive HuggingFace client with robust validation, error handling, and upload functionality
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py clean dataset card generator with proper metadata extraction and size categorization
packages/data-designer/src/data_designer/interface/results.py added push_to_hub() method with clear API and comprehensive documentation
packages/data-designer/tests/integrations/huggingface/test_client.py extensive test coverage (559 lines) with fixtures, mocks, and edge case validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Results as DatasetCreationResults
    participant Client as HuggingFaceHubClient
    participant API as HfApi
    participant Card as DataDesignerDatasetCard
    participant Storage as ArtifactStorage

    User->>Results: push_to_hub(repo_id, description, token, private, tags)
    Results->>Client: __init__(token)
    Client->>API: HfApi(token)
    Results->>Client: upload_dataset(repo_id, base_dataset_path, description, private, tags)
    
    Client->>Client: _validate_repo_id(repo_id)
    Note over Client: Check format: username/dataset-name<br/>Validate with HF validator
    
    Client->>Client: _validate_dataset_path(base_dataset_path)
    Note over Client: Verify metadata.json exists<br/>Check parquet-files/ directory<br/>Validate JSON structure
    
    Client->>API: repo_exists(repo_id)
    API-->>Client: True/False
    Client->>API: create_repo(repo_id, exist_ok=True, private)
    API-->>Client: Repo created/exists
    
    Client->>Client: _upload_dataset_card(...)
    Client->>Storage: Read metadata.json
    Storage-->>Client: metadata dict
    Client->>Storage: Read builder_config.json
    Storage-->>Client: builder_config dict
    Client->>Card: from_metadata(metadata, builder_config, repo_id, description, tags)
    Card->>Card: Extract stats, compute size category
    Card->>Card: Render Jinja2 template
    Card-->>Client: DatasetCard instance
    Client->>Card: push_to_hub(repo_id)
    Card->>API: Upload README.md
    API-->>Card: Success
    
    Client->>Client: _upload_main_dataset_files(...)
    Client->>API: upload_folder(parquet_folder → data/)
    API-->>Client: Success
    
    Client->>Client: _upload_processor_files(...)
    loop For each processor
        Client->>API: upload_folder(processor_dir → processor_name/)
        API-->>Client: Success
    end
    
    Client->>Client: _upload_config_files(...)
    Client->>API: upload_file(builder_config.json)
    API-->>Client: Success
    Client->>Client: _update_metadata_paths(metadata_path)
    Note over Client: Transform paths:<br/>parquet-files/ → data/<br/>processors-files/X/ → X/
    Client->>API: upload_file(metadata.json)
    API-->>Client: Success
    
    Client-->>Results: HuggingFace dataset URL
    Results-->>User: https://huggingface.co/datasets/username/dataset-name
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

…ace/dataset_card.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@davanstrien davanstrien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small suggestions

nabinchha and others added 3 commits February 4, 2026 09:55
…ace/client.py

Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
…ace/client.py

Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@nabinchha nabinchha requested a review from johnnygreco February 4, 2026 18:29
@johnnygreco
Copy link
Contributor

Nice, thanks @nabinchha!

Thank you @davanstrien and @Wauplin for your feedback – super helpful!

@nabinchha nabinchha merged commit 236f62b into main Feb 4, 2026
46 checks passed
@nabinchha nabinchha deleted the nmulepati/feat/7-push-to-hf branch February 4, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish to dataset HF Hub

5 participants