Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 74 additions & 2 deletions docs/concepts/seed-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Every column in your seed dataset becomes available as a Jinja2 variable in prom

## Seed Sources

Data Designer supports three ways to provide seed data:
Data Designer supports five ways to provide seed data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Source count is off by one

The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

Suggested change
Data Designer supports five ways to provide seed data:
Data Designer supports six ways to provide seed data:
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57

Comment:
**Source count is off by one**

The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

```suggestion
Data Designer supports six ways to provide seed data:
```

How can I resolve this? If you propose a fix, please make it concise.


### 📁 LocalFileSeedSource

Expand Down Expand Up @@ -100,6 +100,78 @@ seed_source = dd.DataFrameSeedSource(df=df)
!!! warning "Serialization"
`DataFrameSeedSource` can't be serialized to YAML/JSON configs. Use `LocalFileSeedSource` if you need to save and share configurations.

### 🗂️ DirectorySeedSource

Treat a directory tree as the seed dataset. Each matching file becomes one seed row, exposing file metadata you can reference in prompts and expressions.

```python
seed_source = dd.DirectorySeedSource(
path="docs/",
file_pattern="*.md",
recursive=True,
)

config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
dd.ExpressionColumnConfig(
name="doc_label",
expr="{{ source_kind }}::{{ relative_path }}",
)
)
```

Directory-backed seed datasets expose these columns:

- `source_kind` - always `"directory_file"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file

!!! note "Filesystem matching"
`file_pattern` matches file names only, not relative paths. `recursive=True` is the default, so nested subdirectories are searched unless you turn it off.

### 📄 FileContentsSeedSource

Read matching text files into the seed dataset. Each file becomes one seed row with the same metadata as `DirectorySeedSource`, plus the decoded file contents in a `content` column.

```python
seed_source = dd.FileContentsSeedSource(
path="docs/",
file_pattern="*.md",
encoding="utf-8",
)

config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="summary",
model_alias="my-model",
prompt="""\
Summarize the following document.

File: {{ file_name }}
Path: {{ relative_path }}

{{ content }}
""",
)
)
```

`FileContentsSeedSource` exposes these seeded columns:

- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file

!!! note "Encoding"
`encoding="utf-8"` is the default. Set a different Python codec name if your files use another text encoding.

!!! tip "Advanced Filesystem Readers"
If you need custom row construction, fan-out behavior, or expensive hydration logic, build a custom filesystem seed reader and pass it via `DataDesigner(seed_readers=[...])`. See the [plugin example](../plugins/example.md) for the extension pattern.

## Sampling Strategies

Control how rows are read from the seed dataset.
Expand Down Expand Up @@ -234,7 +306,7 @@ Write detailed clinical notes for this visit.
)

# Preview
preview = designer.preview(config_builder, num_records=5)
preview = data_designer.preview(config_builder, num_records=5)
preview.display_sample_record()
```

Expand Down
Loading