NVIDIA-NeMo · eric-tramel · May 8, 2026 · May 5, 2026 · May 6, 2026 · May 7, 2026
@@ -7,5 +7,6 @@
 /.github/ @NVIDIA-NeMo/data_designer_reviewers
 
 # Plugins
+/plugins/data-designer-github/ @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
 /plugins/data-designer-retrieval-sdg/ @NVIDIA-NeMo/data_designer_reviewers @shan-nvidia @oliverholworthy
 /plugins/data-designer-template/ @NVIDIA-NeMo/data_designer_reviewers
@@ -1,6 +1,38 @@
 {
   "schema_version": 2,
   "packages": [
+    {
+      "name": "data-designer-github",
+      "description": "GitHub and local git repository seed reader for Data Designer",
+      "install": {
+        "requirement": "data-designer-github",
+        "index_url": "https://nvidia-nemo.github.io/DataDesignerPlugins/simple/"
+      },
+      "compatibility": {
+        "python": {
+          "specifier": ">=3.10"
+        },
+        "data_designer": {
+          "requirement": "data-designer>=0.5.7",
+          "specifier": ">=0.5.7",
+          "marker": null
+        }
+      },
+      "docs": {
+        "url": "https://nvidia-nemo.github.io/DataDesignerPlugins/plugins/data-designer-github/"
+      },
+      "plugins": [
+        {
+          "name": "github",
+          "plugin_type": "seed-reader",
+          "entry_point": {
+            "group": "data_designer.plugins",
+            "name": "github",
+            "value": "data_designer_github.plugin:plugin"
+          }
+        }
+      ]
+    },
     {
       "name": "data-designer-retrieval-sdg",
       "description": "Retriever SDG toolkit: registers the embedding-dedup column generator and document-chunker seed reader, plus a multi-step QA generation pipeline, CLI, and Automodel-compatible data conversion",

@@ -0,0 +1,79 @@
+# data-designer-github
+
+`data-designer-github` is a Data Designer seed reader for repository files. It
+turns GitHub repositories or local git repositories into seed rows that carry
+file content, path metadata, repository provenance, and commit identifiers.
+
+Use it when a workflow needs code repository data as the starting point for
+generation, review, transformation, or indexing tasks. The reader is intentionally
+file-oriented: each matching text file becomes one seed row, and downstream Data
+Designer columns decide how to summarize, critique, rewrite, label, or enrich
+that row.
+
+## Installation
+
+```bash
+uv add data-designer data-designer-github
+```
+
+The plugin is discovered through the `data_designer.plugins` entry point once it
+is installed in the same environment as Data Designer.
+
+## Seed source
+
+Use the `github` seed source when the seed dataset should come from one or more
+repositories.
+
+| Field | Required | Description |
+| --- | --- | --- |
+| `path` | No | A local git repository path, or a directory whose immediate children are git repositories. |
+| `repositories` | No | GitHub repositories to clone. Entries may be `owner/name`, `https://github.com/owner/name`, or `https://github.com/owner/name.git`. |
+| `repository_paths` | No | Additional explicit local git repository paths to read. |
+| `ref` | No | Branch, tag, or commit to check out for cloned GitHub repositories. |
+| `clone_depth` | No | Shallow clone depth for GitHub repositories. Defaults to `1`; set to `None` for a full clone. |
+| `clone_timeout_seconds` | No | Timeout for each clone or checkout operation. Defaults to `300`. |
+| `file_pattern` | No | Inherited file glob from Data Designer's filesystem seed source. For example, `*.py`. |
+| `recursive` | No | Whether `file_pattern` is applied recursively. |
+| `include_extensions` | No | File extensions to include after the glob match. Defaults to common code and documentation extensions. Set to `None` to allow every extension. |
+| `include_file_names` | No | Extensionless file names to include, such as `Dockerfile` and `Makefile`. |
+| `exclude_patterns` | No | Relative path glob patterns to skip, including `.git`, cache, build, virtualenv, and dependency directories by default. |
+| `max_file_size_bytes` | No | Maximum file size to hydrate into `content`. Defaults to `1_000_000`. |
+| `encoding` | No | Text encoding used when reading file contents. Defaults to `utf-8`. |
+
+At least one of `path`, `repositories`, or `repository_paths` is required.
+
+## Output columns
+
+| Column | Description |
+| --- | --- |
+| `repo_id` | Repository identifier. GitHub repositories use `owner/name`; local repositories use their GitHub remote when available, otherwise the directory name. |
+| `repo_url` | Remote origin URL when available. |
+| `commit_sha` | Checked-out commit SHA for the repository. |
+| `source_kind` | `github` for cloned repositories, or `git_repository` for local repositories. |
+| `repository_path` | Local path used by the reader. GitHub repositories are cloned into a temporary runtime directory. |
+| `source_path` | Absolute path to the file that produced the seed row. |
+| `relative_path` | File path relative to the repository root. |
+| `file_name` | Basename of the file. |
+| `file_extension` | Lowercase file extension. |
+| `code_lang` | Language hint inferred from the file name or extension. |
+| `size_bytes` | File size at manifest time. |
+| `content_sha256` | SHA-256 hash of the hydrated file bytes. |
+| `content` | Decoded text content. |
+
+## Behavior
+
+When the reader is attached, it resolves local repository roots, clones any
+configured GitHub repositories, records the checked-out commit, and builds a
+manifest of matching files. File content is read during row hydration, so Data
+Designer can batch and sample repository content using the same seed reader
+interfaces as other filesystem-backed datasets.
+
+The plugin reads repository files only. It does not parse code into functions,
+classes, symbols, dependency graphs, or AST nodes. If a workflow needs those
+structures, use this reader to collect stable file-level inputs and add
+downstream columns that perform the language-specific analysis.
+
+The plugin shells out to `git` for repository operations and does not manage
+GitHub API tokens. Public repositories work directly. Private repositories
+require the execution environment's git credential configuration to already have
+access.
@@ -0,0 +1,165 @@
+# Usage
+
+This tutorial walks through the common patterns for turning repositories into
+Data Designer seed rows. The examples use the Python builder API, but the same
+configuration fields apply when a workflow is built from serialized config.
+
+## Read a GitHub repository
+
+Start with a small repository and a narrow file pattern. This keeps previews
+fast and makes it clear which rows are entering the workflow.
+
+```python
+from data_designer.config.config_builder import DataDesignerConfigBuilder
+from data_designer.interface.data_designer import DataDesigner
+from data_designer_github.config import GitHubSeedSource
+
+builder = DataDesignerConfigBuilder()
+builder.with_seed_dataset(
+    GitHubSeedSource(
+        repositories=["pallets/markupsafe"],
+        file_pattern="*.py",
+        recursive=True,
+    )
+)
+
+builder.add_column(
+    name="_row_id",
+    column_type="sampler",
+    sampler_type="uuid",
+    params={},
+)
+
+preview = DataDesigner().preview(builder, num_records=5)
+print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]])
+```
+
+The seed rows contain repository provenance and file text. Downstream columns can
+then ask questions such as "summarize this file", "identify risky APIs", "write
+a short module description", or "extract candidate test scenarios" using the
+`content`, `relative_path`, `code_lang`, and `commit_sha` columns.
+
+## Pin a branch, tag, or commit
+
+Use `ref` when the dataset must be reproducible against a specific branch, tag,
+or commit. Branches and tags are passed to `git clone --branch`; commit SHAs are
+checked out after cloning.
+
+```python
+source = GitHubSeedSource(
+    repositories=["NVIDIA-NeMo/DataDesigner"],
+    ref="v0.5.7",
+    clone_depth=1,
+    file_pattern="*.py",
+    recursive=True,
+)
+```
+
+For arbitrary commit SHAs, set `clone_depth=None` if the commit may not be
+reachable from the shallow default clone.
+
+```python
+source = GitHubSeedSource(
+    repositories=["NVIDIA-NeMo/DataDesigner"],
+    ref="0123456789abcdef0123456789abcdef01234567",
+    clone_depth=None,
+    file_pattern="*.py",
+    recursive=True,
+)
+```
+
+## Read local repositories
+
+Local repositories are useful for private code, local experiments, or a checked
+out monorepo that already exists on disk.
+
+```python
+source = GitHubSeedSource(
+    repository_paths=[
+        "/workspace/services/api",
+        "/workspace/libraries/shared",
+    ],
+    file_pattern="*.py",
+    recursive=True,
+)
+```
+
+If `path` points at a git repository, that repository is read. If `path` points
+at a directory whose immediate children are git repositories, each child
+repository is discovered and read.
+
+```python
+source = GitHubSeedSource(
+    path="/workspace/repos",
+    file_pattern="*.ts",
+    recursive=True,
+)
+```
+
+## Control which files become rows
+
+The reader first applies `file_pattern` and `recursive`, then filters by
+extension, file name, exclude pattern, and file size.
+
+```python
+source = GitHubSeedSource(
+    repositories=["NVIDIA-NeMo/DataDesigner"],
+    file_pattern="*",
+    recursive=True,
+    include_extensions=["py", "toml", "md"],
+    include_file_names=["Dockerfile", "Makefile"],
+    exclude_patterns=[
+        ".git/**",
+        "**/__pycache__/**",
+        "**/build/**",
+        "**/dist/**",
+        "docs/generated/**",
+    ],
+    max_file_size_bytes=250_000,
+)
+```
+
+Use `include_extensions=None` for broad repository inventory tasks where the
+glob and exclude patterns should decide the candidate set.
+
+```python
+source = GitHubSeedSource(
+    repositories=["owner/repo"],
+    file_pattern="LICENSE*",
+    recursive=False,
+    include_extensions=None,
+)
+```
+
+## Typical workflows
+
+`data-designer-github` works best as the seed layer for file-level code
+workflows:
+
+- Repository QA: score files for risky dependencies, missing license headers, or
+  stale implementation notes.
+- Documentation generation: turn source files into module summaries, migration
+  notes, or API reference drafts.
+- Test ideation: derive test scenarios from implementation files and route them
+  to a code-generation column.
+- Code search preparation: create embeddings or labels from stable file content
+  and repository metadata.
+- Dataset construction: sample representative code files from several projects
+  while preserving `repo_id`, `relative_path`, and `commit_sha` provenance.
+
+Because the reader emits full file content, prompts should account for file
+length and language. A common pattern is to filter or sample seed rows first,
+then generate focused columns that reference only the metadata and content each
+task needs.
+
+## Operational notes
+
+The plugin requires `git` on `PATH`. GitHub repositories are cloned into a
+temporary runtime directory for the reader attachment and local repositories are
+read in place. Files that exceed `max_file_size_bytes` are skipped before
+hydration. Files that cannot be decoded with `encoding` are skipped with a
+warning rather than producing partial text.
+
+The reader does not call the GitHub API, manage credentials, or expand GitHub
+issues and pull requests. It is scoped to repository file content so workflows
+can compose repository-aware seed data with the rest of Data Designer.
@@ -5,6 +5,17 @@
 Browse available Data Designer plugins by what they add to your data generation workflow.
 
 <div class="plugin-doc-grid">
+  <a class="plugin-doc-card" href="data-designer-github/" aria-label="Open data-designer-github documentation">
+    <span class="plugin-doc-card__header">
+      <span class="plugin-doc-card__title">data-designer-github</span>
+      <span class="plugin-doc-card__version">v0.1.0</span>
+    </span>
+    <span class="plugin-doc-card__description">GitHub and local git repository seed reader for Data Designer</span>
+    <span class="plugin-doc-card__section">
+      <span class="plugin-doc-card__label">Entry points</span>
+      <span class="plugin-doc-card__chips"><span class="plugin-doc-chip">github</span></span>
+    </span>
+  </a>
   <a class="plugin-doc-card" href="data-designer-retrieval-sdg/" aria-label="Open data-designer-retrieval-sdg documentation">
     <span class="plugin-doc-card__header">
       <span class="plugin-doc-card__title">data-designer-retrieval-sdg</span>

@@ -0,0 +1,3 @@
+# Owner(s) of this plugin — used to generate the root CODEOWNERS file.
+# GitHub accepts @username, @org/team, or email format.
+* @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
@@ -0,0 +1,51 @@
+# data-designer-github
+
+GitHub and local git repository seed reader for
+[NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner).
+
+## Installation
+
+```bash
+pip install data-designer-github
+```
+
+## Usage
+
+This plugin provides a `github` seed source. Once installed, the seed reader is
+automatically discovered by Data Designer.
+
+```python
+from data_designer.config.config_builder import DataDesignerConfigBuilder
+from data_designer.interface.data_designer import DataDesigner
+from data_designer_github.config import GitHubSeedSource
+
+builder = DataDesignerConfigBuilder()
+builder.with_seed_dataset(
+    GitHubSeedSource(
+        repositories=["NVIDIA-NeMo/DataDesigner"],
+        file_pattern="*.py",
+        recursive=True,
+    )
+)
+
+preview = DataDesigner().preview(builder, num_records=5)
+print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]])
+```
+
+The reader can also scan local git repositories:
+
+```python
+builder.with_seed_dataset(
+    GitHubSeedSource(
+        path="/path/to/repos",
+        repository_paths=["/path/to/one/repo"],
+        file_pattern="*.py",
+    )
+)
+```
+
+Seed columns include repository metadata, file paths, language hints, file
+content, and content SHA-256 hashes.
+
+For the full plugin authoring guide, see the
+[main repository docs](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/blob/main/docs/adding-a-plugin.md).