feat: add Claude Code skill for Data Designer library by johnnygreco · Pull Request #309 · NVIDIA-NeMo/DataDesigner

johnnygreco · 2026-02-06T23:17:42Z

📋 Summary

Adds a comprehensive Claude Code skill that teaches Claude how to generate synthetic datasets using NVIDIA NeMo Data Designer. When activated, Claude can design and build complete data generation pipelines — choosing the right column types, writing prompts, wiring up dependencies, and iterating on previews — all from a natural language description of the dataset.

🔄 Changes

✨ Added

skill/data-designer/SKILL.md — Core skill definition with step-by-step workflow, column type decision tree, key patterns (Jinja2 templating, seed datasets, constraints, processors, validators), and best practices
skill/data-designer/examples/ — 5 runnable pattern-reference scripts:
- basic_text_generation.py — Sampler + expression + LLM text columns
- structured_and_code.py — Pydantic structured output + - seed_dataset_with_judge.py — Seed data + LLM judge scoring
- custom_column_with_llm.py — @custom_column_generator with multi-model orchestration
- mcp_tool_use.py — MCP tool calling with trace capture
skill/data-designer/references/ — Complete API reference (api_reference.md) and advanced patterns guide (advanced_patterns.md)
skill/data-designer/scripts/ — Discovery tools for API introspection (get_column_info.py, get_sampler_info.py, get_processor_info.py, get_validator_info.py, and shared pydantic_info_utils.py)
skill/data-designer/hooks/ — Session hooks for environment checks (check_data_designer.sh), ruff linting (ruff_lint.sh), and ty type checking (ty_check.sh)
skill/README.md — Setup instructions and quick start guide
skill/test_info_scripts.py — 900-line test suite for the discovery scripts

🔧 Changed

AGENTS.md — Added EmbeddingColumnConfig and CustomColumnConfig to the column configuratiopackages/data-designer-config/src/data_designer/config/__init__.py** — Added missing ConstraintType and InequalityOperator exports to lazy imports

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

skill/data-designer/SKILL.md — Core skill definition that drives Claude's behavior; verify the workflow, decision tree, and patterns are accurate and complete
skill/data-designer/references/api_reference.md — Complete API documentation (557 lines); ensure accuracy against current codebase
packages/data-designer-config/src/data_designer/config/__init__.py — Nets (ConstraintType, InequalityOperator) that affect the user-facing API surface

🤖 Generated with AI

greptile-apps · 2026-02-06T23:20:31Z

Greptile Overview

Greptile Summary

Adds a new Claude Code skill (skill/data-designer) with workflow guidance, runnable example scripts, API references, discovery/introspection scripts, and session hooks.
Introduces hooks to check Data Designer installation on session start and to run ruff/ty on edited Python files.
Adds a comprehensive test runner (skill/test_info_scripts.py) covering both unit and integration behavior for the discovery scripts.
Extends the public config surface by exporting ConstraintType and InequalityOperator via lazy imports, and updates AGENTS.md with new column config types.

Confidence Score: 3/5

This PR is close to mergeable, but the skill documentation should be corrected to avoid encouraging persistence of model chain-of-thought reasoning content.
Most changes are additive docs/examples/scripts with minimal impact on the core library, and the config export tweak is small and straightforward. However, multiple new docs explicitly recommend capturing/storing chain-of-thought via extract_reasoning_content=True, which is generally not appropriate and may be unsupported/inconsistent across providers. Fixing the guidance reduces compliance and operational risk.
skill/data-designer/SKILL.md; skill/data-designer/references/advanced_patterns.md; skill/data-designer/references/api_reference.md

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/config/init.py	Exports ConstraintType and InequalityOperator via TYPE_CHECKING and lazy imports; small API-surface fix.
skill/README.md	Adds setup instructions for installing and using the Claude Code skill; includes download commands for the skill folder.
skill/data-designer/SKILL.md	Introduces core skill definition and workflow docs; currently instructs capturing 'chain-of-thought' via extract_reasoning_content.
skill/data-designer/hooks/check_data_designer.sh	Adds session-start hook to detect data_designer installation and print version/path context.
skill/data-designer/references/advanced_patterns.md	Adds advanced usage guide; currently recommends storing chain-of-thought via extract_reasoning_content.
skill/data-designer/references/api_reference.md	Adds large API reference; includes wording that extract_reasoning_content captures 'chain-of-thought'.
skill/data-designer/scripts/helpers/pydantic_info_utils.py	Adds shared helper utilities for printing Pydantic model schemas/types, including nested model and enum expansion.
skill/test_info_scripts.py	Adds comprehensive unit/integration tests for the introspection scripts using subprocess uv runs.

Sequence Diagram

sequenceDiagram
    participant User
    participant ClaudeCode as Claude Code
    participant HookStart as check_data_designer.sh
    participant InfoScripts as get_*_info.py
    participant DD as data_designer (DataDesigner)
    participant MCP as MCP Provider/Tools
    participant HooksPost as ruff_lint.sh & ty_check.sh

    User->>ClaudeCode: Activate data-designer skill
    ClaudeCode->>HookStart: SessionStart hook (startup)
    HookStart-->>ClaudeCode: Prints install/version/library path

    User->>ClaudeCode: Describe dataset + requirements
    ClaudeCode->>InfoScripts: (Optional) uv run get_column/sampler/processor/validator info
    InfoScripts-->>ClaudeCode: Prints schema/field references

    ClaudeCode->>DD: Build DataDesignerConfigBuilder + add columns
    alt Tool-augmented generation
        ClaudeCode->>DD: Configure ToolConfig(tool_alias, providers, allow_tools)
        DD->>MCP: Call tool(s) via MCP during LLM column execution
        MCP-->>DD: Tool results
    end

    ClaudeCode->>DD: validate(config_builder)
    ClaudeCode->>DD: preview(config_builder, num_records=...)
    ClaudeCode->>DD: create(config_builder, num_records=..., dataset_name=...)

    ClaudeCode->>HooksPost: PostToolUse hook on Write/Edit
    HooksPost-->>ClaudeCode: Lint/type-check edited file

greptile-apps

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T23:20:35Z

skill/data-designer/SKILL.md

+```
+
+### Trace & Reasoning Capture
+
+```python
+dd.LLMTextColumnConfig(
+    name="answer",
+    prompt="...",
+    model_alias="nvidia-reasoning",
+    with_trace=dd.TraceType.ALL_MESSAGES,       # -> answer__trace column
+    extract_reasoning_content=True,              # -> answer__reasoning_content column
+)
+```


Encourages chain-of-thought capture

This skill docs recommend extract_reasoning_content=True and describe it as “capture chain-of-thought”. Persisting model reasoning content is typically disallowed/undesired (privacy/safety + provider ToS) and can also break runs if the underlying provider/model doesn’t return a separate reasoning field. Consider removing this recommendation (or reframing to capture tool traces only) and documenting safe alternatives (e.g., with_trace=TraceType.ALL_MESSAGES without extracting reasoning).

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: skill/data-designer/SKILL.md Line: 247:259 Comment: **Encourages chain-of-thought capture** This skill docs recommend `extract_reasoning_content=True` and describe it as “capture chain-of-thought”. Persisting model reasoning content is typically disallowed/undesired (privacy/safety + provider ToS) and can also break runs if the underlying provider/model doesn’t return a separate reasoning field. Consider removing this recommendation (or reframing to capture tool traces only) and documenting safe alternatives (e.g., `with_trace=TraceType.ALL_MESSAGES` without extracting reasoning). <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-06T23:20:36Z

skill/data-designer/references/advanced_patterns.md

+## 8. Trace & Reasoning Extraction
+
+### Trace Types
+
+- `TraceType.NONE` (default): No trace
+- `TraceType.LAST_MESSAGE`: Only final response -> `{name}__trace`
+- `TraceType.ALL_MESSAGES`: Full conversation -> `{name}__trace`
+
+### Reasoning Content
+
+`extract_reasoning_content=True` creates `{name}__reasoning_content` with chain-of-thought.
+
+Available on all LLM column types.
+
+### Use Cases
+
+- **Debugging**: `ALL_MESSAGES` to see full conversation including tool calls
+- **Fine-tuning data**: `extract_reasoning_content=True` for clean reasoning extraction
+- **Tool-use training**: `ALL_MESSAGES` to capture tool call patterns


Avoid persisting model reasoning

This section documents extract_reasoning_content=True as producing “chain-of-thought” and suggests using it for fine-tuning. Persisting model reasoning content is generally not appropriate (privacy/safety + provider policies), and it may not be supported consistently across providers/models. Recommend removing/rewriting this guidance to focus on safer debugging artifacts (e.g., tool traces / final outputs) rather than storing reasoning content.

Prompt To Fix With AI

This is a comment left during a code review. Path: skill/data-designer/references/advanced_patterns.md Line: 372:390 Comment: **Avoid persisting model reasoning** This section documents `extract_reasoning_content=True` as producing “chain-of-thought” and suggests using it for fine-tuning. Persisting model reasoning content is generally not appropriate (privacy/safety + provider policies), and it may not be supported consistently across providers/models. Recommend removing/rewriting this guidance to focus on safer debugging artifacts (e.g., tool traces / final outputs) rather than storing reasoning content. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-06T23:20:37Z

skill/data-designer/references/api_reference.md

+```python
+dd.LLMTextColumnConfig(
+    name: str,
+    prompt: str,                                        # Jinja2 template
+    model_alias: str,
+    system_prompt: str | None = None,
+    tool_alias: str | None = None,                      # MCP tool reference
+    with_trace: dd.TraceType = TraceType.NONE,          # NONE, LAST_MESSAGE, ALL_MESSAGES
+    extract_reasoning_content: bool = False,            # -> {name}__reasoning_content
+    multi_modal_context: list[dd.ImageContext] | None = None,
+    drop: bool = False,


Reasoning field documentation

The API reference describes extract_reasoning_content as “capture chain-of-thought”. If the library intentionally supports extracting a separate reasoning field, this should be documented more carefully (what exactly is captured, which providers support it, and whether it’s safe/allowed to store). Otherwise, consider removing or rewording to avoid encouraging collection of chain-of-thought.

Prompt To Fix With AI

This is a comment left during a code review. Path: skill/data-designer/references/api_reference.md Line: 103:113 Comment: **Reasoning field documentation** The API reference describes `extract_reasoning_content` as “capture chain-of-thought”. If the library intentionally supports extracting a separate reasoning field, this should be documented more carefully (what exactly is captured, which providers support it, and whether it’s safe/allowed to store). Otherwise, consider removing or rewording to avoid encouraging collection of chain-of-thought. How can I resolve this? If you propose a fix, please make it concise.

johnnygreco · 2026-02-06T23:36:59Z

packages/data-designer-config/src/data_designer/config/__init__.py

+        ConstraintType,
+        InequalityOperator,


ugh, these leaked over from a different PR 🤦‍♂️

johnnygreco added 3 commits February 6, 2026 15:24

update init constraint imports

6bc3301

add missing columns

2475146

add library claude skill

e864a75

johnnygreco requested a review from a team as a code owner February 6, 2026 23:17

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

johnnygreco commented Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Claude Code skill for Data Designer library#309

feat: add Claude Code skill for Data Designer library#309
johnnygreco wants to merge 3 commits intomainfrom
johnny/feat/add-library-skill

johnnygreco commented Feb 6, 2026

Uh oh!

greptile-apps bot commented Feb 6, 2026

Confidence Score: 3/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

johnnygreco Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnnygreco commented Feb 6, 2026

📋 Summary

🔄 Changes

✨ Added

🔧 Changed

🔍 Attention Areas

Uh oh!

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant