Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@
/.github/ @NVIDIA-NeMo/data_designer_reviewers

# Plugins
/plugins/data-designer-github/ @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
/plugins/data-designer-retrieval-sdg/ @NVIDIA-NeMo/data_designer_reviewers @shan-nvidia @oliverholworthy
/plugins/data-designer-template/ @NVIDIA-NeMo/data_designer_reviewers
32 changes: 32 additions & 0 deletions catalog/plugins.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,38 @@
{
"schema_version": 2,
"packages": [
{
"name": "data-designer-github",
"description": "GitHub and local git repository seed reader for Data Designer",
"install": {
"requirement": "data-designer-github",
"index_url": "https://nvidia-nemo.github.io/DataDesignerPlugins/simple/"
},
"compatibility": {
"python": {
"specifier": ">=3.10"
},
"data_designer": {
"requirement": "data-designer>=0.5.7",
"specifier": ">=0.5.7",
"marker": null
}
},
"docs": {
"url": "https://nvidia-nemo.github.io/DataDesignerPlugins/plugins/data-designer-github/"
},
"plugins": [
{
"name": "github",
"plugin_type": "seed-reader",
"entry_point": {
"group": "data_designer.plugins",
"name": "github",
"value": "data_designer_github.plugin:plugin"
}
}
]
},
{
"name": "data-designer-retrieval-sdg",
"description": "Retriever SDG toolkit: registers the embedding-dedup column generator and document-chunker seed reader, plus a multi-step QA generation pipeline, CLI, and Automodel-compatible data conversion",
Expand Down
79 changes: 79 additions & 0 deletions docs/plugins/data-designer-github/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# data-designer-github

`data-designer-github` is a Data Designer seed reader for repository files. It
turns GitHub repositories or local git repositories into seed rows that carry
file content, path metadata, repository provenance, and commit identifiers.

Use it when a workflow needs code repository data as the starting point for
generation, review, transformation, or indexing tasks. The reader is intentionally
file-oriented: each matching text file becomes one seed row, and downstream Data
Designer columns decide how to summarize, critique, rewrite, label, or enrich
that row.

## Installation

```bash
uv add data-designer data-designer-github
```

The plugin is discovered through the `data_designer.plugins` entry point once it
is installed in the same environment as Data Designer.

## Seed source

Use the `github` seed source when the seed dataset should come from one or more
repositories.

| Field | Required | Description |
| --- | --- | --- |
| `path` | No | A local git repository path, or a directory whose immediate children are git repositories. |
| `repositories` | No | GitHub repositories to clone. Entries may be `owner/name`, `https://github.com/owner/name`, or `https://github.com/owner/name.git`. |
| `repository_paths` | No | Additional explicit local git repository paths to read. |
| `ref` | No | Branch, tag, or commit to check out for cloned GitHub repositories. |
| `clone_depth` | No | Shallow clone depth for GitHub repositories. Defaults to `1`; set to `None` for a full clone. |
| `clone_timeout_seconds` | No | Timeout for each clone or checkout operation. Defaults to `300`. |
| `file_pattern` | No | Inherited file glob from Data Designer's filesystem seed source. For example, `*.py`. |
| `recursive` | No | Whether `file_pattern` is applied recursively. |
| `include_extensions` | No | File extensions to include after the glob match. Defaults to common code and documentation extensions. Set to `None` to allow every extension. |
| `include_file_names` | No | Extensionless file names to include, such as `Dockerfile` and `Makefile`. |
| `exclude_patterns` | No | Relative path glob patterns to skip, including `.git`, cache, build, virtualenv, and dependency directories by default. |
| `max_file_size_bytes` | No | Maximum file size to hydrate into `content`. Defaults to `1_000_000`. |
| `encoding` | No | Text encoding used when reading file contents. Defaults to `utf-8`. |

At least one of `path`, `repositories`, or `repository_paths` is required.

## Output columns

| Column | Description |
| --- | --- |
| `repo_id` | Repository identifier. GitHub repositories use `owner/name`; local repositories use their GitHub remote when available, otherwise the directory name. |
| `repo_url` | Remote origin URL when available. |
| `commit_sha` | Checked-out commit SHA for the repository. |
| `source_kind` | `github` for cloned repositories, or `git_repository` for local repositories. |
| `repository_path` | Local path used by the reader. GitHub repositories are cloned into a temporary runtime directory. |
| `source_path` | Absolute path to the file that produced the seed row. |
| `relative_path` | File path relative to the repository root. |
| `file_name` | Basename of the file. |
| `file_extension` | Lowercase file extension. |
| `code_lang` | Language hint inferred from the file name or extension. |
| `size_bytes` | File size at manifest time. |
| `content_sha256` | SHA-256 hash of the hydrated file bytes. |
| `content` | Decoded text content. |

## Behavior

When the reader is attached, it resolves local repository roots, clones any
configured GitHub repositories, records the checked-out commit, and builds a
manifest of matching files. File content is read during row hydration, so Data
Designer can batch and sample repository content using the same seed reader
interfaces as other filesystem-backed datasets.

The plugin reads repository files only. It does not parse code into functions,
classes, symbols, dependency graphs, or AST nodes. If a workflow needs those
structures, use this reader to collect stable file-level inputs and add
downstream columns that perform the language-specific analysis.

The plugin shells out to `git` for repository operations and does not manage
GitHub API tokens. Public repositories work directly. Private repositories
require the execution environment's git credential configuration to already have
access.
165 changes: 165 additions & 0 deletions docs/plugins/data-designer-github/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Usage

This tutorial walks through the common patterns for turning repositories into
Data Designer seed rows. The examples use the Python builder API, but the same
configuration fields apply when a workflow is built from serialized config.

## Read a GitHub repository

Start with a small repository and a narrow file pattern. This keeps previews
fast and makes it clear which rows are entering the workflow.

```python
from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.interface.data_designer import DataDesigner
from data_designer_github.config import GitHubSeedSource

builder = DataDesignerConfigBuilder()
builder.with_seed_dataset(
GitHubSeedSource(
repositories=["pallets/markupsafe"],
file_pattern="*.py",
recursive=True,
)
)

builder.add_column(
name="_row_id",
column_type="sampler",
sampler_type="uuid",
params={},
)

preview = DataDesigner().preview(builder, num_records=5)
print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]])
```

The seed rows contain repository provenance and file text. Downstream columns can
then ask questions such as "summarize this file", "identify risky APIs", "write
a short module description", or "extract candidate test scenarios" using the
`content`, `relative_path`, `code_lang`, and `commit_sha` columns.

## Pin a branch, tag, or commit

Use `ref` when the dataset must be reproducible against a specific branch, tag,
or commit. Branches and tags are passed to `git clone --branch`; commit SHAs are
checked out after cloning.

```python
source = GitHubSeedSource(
repositories=["NVIDIA-NeMo/DataDesigner"],
ref="v0.5.7",
clone_depth=1,
file_pattern="*.py",
recursive=True,
)
```

For arbitrary commit SHAs, set `clone_depth=None` if the commit may not be
reachable from the shallow default clone.

```python
source = GitHubSeedSource(
repositories=["NVIDIA-NeMo/DataDesigner"],
ref="0123456789abcdef0123456789abcdef01234567",
clone_depth=None,
file_pattern="*.py",
recursive=True,
)
```

## Read local repositories

Local repositories are useful for private code, local experiments, or a checked
out monorepo that already exists on disk.

```python
source = GitHubSeedSource(
repository_paths=[
"/workspace/services/api",
"/workspace/libraries/shared",
],
file_pattern="*.py",
recursive=True,
)
```

If `path` points at a git repository, that repository is read. If `path` points
at a directory whose immediate children are git repositories, each child
repository is discovered and read.

```python
source = GitHubSeedSource(
path="/workspace/repos",
file_pattern="*.ts",
recursive=True,
)
```

## Control which files become rows

The reader first applies `file_pattern` and `recursive`, then filters by
extension, file name, exclude pattern, and file size.

```python
source = GitHubSeedSource(
repositories=["NVIDIA-NeMo/DataDesigner"],
file_pattern="*",
recursive=True,
include_extensions=["py", "toml", "md"],
include_file_names=["Dockerfile", "Makefile"],
exclude_patterns=[
".git/**",
"**/__pycache__/**",
"**/build/**",
"**/dist/**",
"docs/generated/**",
],
max_file_size_bytes=250_000,
)
```

Use `include_extensions=None` for broad repository inventory tasks where the
glob and exclude patterns should decide the candidate set.

```python
source = GitHubSeedSource(
repositories=["owner/repo"],
file_pattern="LICENSE*",
recursive=False,
include_extensions=None,
)
```

## Typical workflows

`data-designer-github` works best as the seed layer for file-level code
workflows:

- Repository QA: score files for risky dependencies, missing license headers, or
stale implementation notes.
- Documentation generation: turn source files into module summaries, migration
notes, or API reference drafts.
- Test ideation: derive test scenarios from implementation files and route them
to a code-generation column.
- Code search preparation: create embeddings or labels from stable file content
and repository metadata.
- Dataset construction: sample representative code files from several projects
while preserving `repo_id`, `relative_path`, and `commit_sha` provenance.

Because the reader emits full file content, prompts should account for file
length and language. A common pattern is to filter or sample seed rows first,
then generate focused columns that reference only the metadata and content each
task needs.

## Operational notes

The plugin requires `git` on `PATH`. GitHub repositories are cloned into a
temporary runtime directory for the reader attachment and local repositories are
read in place. Files that exceed `max_file_size_bytes` are skipped before
hydration. Files that cannot be decoded with `encoding` are skipped with a
warning rather than producing partial text.

The reader does not call the GitHub API, manage credentials, or expand GitHub
issues and pull requests. It is scoped to repository file content so workflows
can compose repository-aware seed data with the rest of Data Designer.
11 changes: 11 additions & 0 deletions docs/plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@
Browse available Data Designer plugins by what they add to your data generation workflow.

<div class="plugin-doc-grid">
<a class="plugin-doc-card" href="data-designer-github/" aria-label="Open data-designer-github documentation">
<span class="plugin-doc-card__header">
<span class="plugin-doc-card__title">data-designer-github</span>
<span class="plugin-doc-card__version">v0.1.0</span>
</span>
<span class="plugin-doc-card__description">GitHub and local git repository seed reader for Data Designer</span>
<span class="plugin-doc-card__section">
<span class="plugin-doc-card__label">Entry points</span>
<span class="plugin-doc-card__chips"><span class="plugin-doc-chip">github</span></span>
</span>
</a>
<a class="plugin-doc-card" href="data-designer-retrieval-sdg/" aria-label="Open data-designer-retrieval-sdg documentation">
<span class="plugin-doc-card__header">
<span class="plugin-doc-card__title">data-designer-retrieval-sdg</span>
Expand Down
3 changes: 3 additions & 0 deletions plugins/data-designer-github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Owner(s) of this plugin — used to generate the root CODEOWNERS file.
# GitHub accepts @username, @org/team, or email format.
* @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
51 changes: 51 additions & 0 deletions plugins/data-designer-github/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# data-designer-github

GitHub and local git repository seed reader for
[NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner).

## Installation

```bash
pip install data-designer-github
```

## Usage

This plugin provides a `github` seed source. Once installed, the seed reader is
automatically discovered by Data Designer.

```python
from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.interface.data_designer import DataDesigner
from data_designer_github.config import GitHubSeedSource

builder = DataDesignerConfigBuilder()
builder.with_seed_dataset(
GitHubSeedSource(
repositories=["NVIDIA-NeMo/DataDesigner"],
file_pattern="*.py",
recursive=True,
)
)

preview = DataDesigner().preview(builder, num_records=5)
print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]])
```

The reader can also scan local git repositories:

```python
builder.with_seed_dataset(
GitHubSeedSource(
path="/path/to/repos",
repository_paths=["/path/to/one/repo"],
file_pattern="*.py",
)
)
```

Seed columns include repository metadata, file paths, language hints, file
content, and content SHA-256 hashes.

For the full plugin authoring guide, see the
[main repository docs](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/blob/main/docs/adding-a-plugin.md).
Loading
Loading