feat: add HuggingFace Hub integration for dataset publishing#275
Merged
feat: add HuggingFace Hub integration for dataset publishing#275
Conversation
Implement HuggingFace Hub integration to upload DataDesigner datasets:
- Add HuggingFaceHubClient with upload_dataset method
- Upload main parquet files to data/ subset
- Upload processor outputs to data/{processor_name}/ subsets
- Generate dataset card from metadata.json with column statistics
- Include sdg.json and metadata.json configuration files
- Comprehensive validation and error handling
- Add push_to_hub() method to DatasetCreationResults
…nitions
- Add progress logging with emojis following codebase style
- Add repository exists check before creation
- Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/)
- Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links
- Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs
- Set data config as default configuration
- Add description parameter to push_to_hub() for custom dataset card content - Description appears after NeMo Data Designer intro section - Update dataset card template to conditionally render custom description - Add tests for with/without custom description scenarios
- Make description parameter required in push_to_hub() - Improve dataset card layout with flexbox header (title + right-aligned tagline) - Add horizontal dividers between sections for visual separation - Add emoji icons to section headers for better readability - Move About NeMo Data Designer section after Citation - Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About - Update all tests to provide required description parameter
Greptile OverviewGreptile SummaryThis PR adds comprehensive HuggingFace Hub integration to DataDesigner, enabling users to publish datasets directly to the HuggingFace Hub with a single method call. The implementation includes automated dataset card generation with rich metadata, robust error handling with specific exception types, and flexible upload options for both main datasets and processor outputs. Key Changes:
Implementation Highlights:
The PR is well-structured, thoroughly tested, and ready for production use.
|
| Filename | Overview |
|---|---|
| packages/data-designer/src/data_designer/integrations/huggingface/client.py | comprehensive HuggingFace client with robust validation, error handling, and upload functionality |
| packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py | clean dataset card generator with proper metadata extraction and size categorization |
| packages/data-designer/src/data_designer/interface/results.py | added push_to_hub() method with clear API and comprehensive documentation |
| packages/data-designer/tests/integrations/huggingface/test_client.py | extensive test coverage (559 lines) with fixtures, mocks, and edge case validation |
Sequence Diagram
sequenceDiagram
participant User
participant Results as DatasetCreationResults
participant Client as HuggingFaceHubClient
participant API as HfApi
participant Card as DataDesignerDatasetCard
participant Storage as ArtifactStorage
User->>Results: push_to_hub(repo_id, description, token, private, tags)
Results->>Client: __init__(token)
Client->>API: HfApi(token)
Results->>Client: upload_dataset(repo_id, base_dataset_path, description, private, tags)
Client->>Client: _validate_repo_id(repo_id)
Note over Client: Check format: username/dataset-name<br/>Validate with HF validator
Client->>Client: _validate_dataset_path(base_dataset_path)
Note over Client: Verify metadata.json exists<br/>Check parquet-files/ directory<br/>Validate JSON structure
Client->>API: repo_exists(repo_id)
API-->>Client: True/False
Client->>API: create_repo(repo_id, exist_ok=True, private)
API-->>Client: Repo created/exists
Client->>Client: _upload_dataset_card(...)
Client->>Storage: Read metadata.json
Storage-->>Client: metadata dict
Client->>Storage: Read builder_config.json
Storage-->>Client: builder_config dict
Client->>Card: from_metadata(metadata, builder_config, repo_id, description, tags)
Card->>Card: Extract stats, compute size category
Card->>Card: Render Jinja2 template
Card-->>Client: DatasetCard instance
Client->>Card: push_to_hub(repo_id)
Card->>API: Upload README.md
API-->>Card: Success
Client->>Client: _upload_main_dataset_files(...)
Client->>API: upload_folder(parquet_folder → data/)
API-->>Client: Success
Client->>Client: _upload_processor_files(...)
loop For each processor
Client->>API: upload_folder(processor_dir → processor_name/)
API-->>Client: Success
end
Client->>Client: _upload_config_files(...)
Client->>API: upload_file(builder_config.json)
API-->>Client: Success
Client->>Client: _update_metadata_paths(metadata_path)
Note over Client: Transform paths:<br/>parquet-files/ → data/<br/>processors-files/X/ → X/
Client->>API: upload_file(metadata.json)
API-->>Client: Success
Client-->>Results: HuggingFace dataset URL
Results-->>User: https://huggingface.co/datasets/username/dataset-name
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/__init__.py
Show resolved
Hide resolved
packages/data-designer/tests/integrations/huggingface/test_dataset_card.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
…ace/dataset_card.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer-engine/src/data_designer/engine/dataset_builders/artifact_storage.py
Show resolved
Hide resolved
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer/src/data_designer/integrations/huggingface/__init__.py
Outdated
Show resolved
Hide resolved
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
johnnygreco
reviewed
Jan 30, 2026
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
davanstrien
reviewed
Feb 4, 2026
Contributor
davanstrien
left a comment
There was a problem hiding this comment.
Left a few small suggestions
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
johnnygreco
reviewed
Feb 4, 2026
packages/data-designer-engine/src/data_designer/engine/dataset_builders/artifact_storage.py
Outdated
Show resolved
Hide resolved
…ace/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
…ace/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
johnnygreco
approved these changes
Feb 4, 2026
Contributor
|
Nice, thanks @nabinchha! Thank you @davanstrien and @Wauplin for your feedback – super helpful! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary
This PR adds comprehensive HuggingFace Hub integration, enabling users to publish DataDesigner datasets directly to the HuggingFace Hub with automated dataset card generation, flexible upload options, and robust error handling.
🔄 Changes
✨ Added
HuggingFaceHubClient- Complete client for uploading datasets to HuggingFace Hub with support for:DataDesignerDatasetCard- Rich dataset card generator with:push_to_hub()method onDatasetCreationResults- Simple API for publishing results🔧 Changed
DatasetCreationResultsclass to support HuggingFace publishing workflow🔍 Attention Areas
client.py(349 lines) - Core upload logic with path mapping and error handlingdataset_card.py(139 lines) - Dataset card template rendering and metadata extractiontest_client.py(569 lines) - Extensive test coverage for upload scenariosSee this data set card as an example published directly from the create results object: https://huggingface.co/datasets/nabinnvidia/multi-lingual-greetings
Closes #7
Draws inspiration on implementation from conversations in this PR: #127
🤖 Generated with AI