Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@

_mod = importlib.import_module("dataframely")


project = "dataframely"
copyright = f"{datetime.date.today().year}, QuantCo, Inc"
author = "QuantCo, Inc."
Expand All @@ -42,6 +41,7 @@
"sphinx_copybutton",
"sphinx_design",
"sphinx_toolbox.more_autodoc.overloads",
"sphinx_llms_txt",
]

## sphinx
Expand Down Expand Up @@ -71,7 +71,7 @@
maximum_signature_line_length = 88

# source files
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "SKILL.md"]
source_suffix = {
".rst": "restructuredtext",
".txt": "markdown",
Expand Down
65 changes: 65 additions & 0 deletions docs/guides/coding-agents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Using `dataframely` with coding agents

Coding agents are particularly powerful when two criteria are met:

1. The agent can know all required information and does not need to guess.
2. The results of the agent's work can be easily verified.

`dataframely` helps you fulfill these criteria.

To help your coding agent write good `dataframely` code, we provide a
`dataframely` [skill](https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/docs/guides/coding-agents/SKILL.md)
following the [
`agentskills.io` spec](https://agentskills.io/specification). You can install
it by placing it where your agent can find it. For example, if you are using `claude`:

```bash
mkdir -p .claude/skills/dataframely/
curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/docs/guides/coding-agents/SKILL.md
```

Refer to the documentation of your coding agent for instructions on how to add custom skills.

## Tell the agent about your data with `dataframely` schemas

`dataframely` schemas provide a clear format for documenting dataframe structure and contents, which helps coding
agents understand your code base. We recommend structuring your data processing code using clear interfaces that are
documented using
`dataframely` type hints. This streamlines your coding agent's ability to find the right schema at the right time.

For example:

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
...
```

gives a coding agent much more information than the schema-less alternative:

```python
def load_data(raw: pl.LazyFrame) -> pl.DataFrame:
...
```

This convention also makes your code more readable and maintainable for human developers.

If there is additional domain information that is not natively expressed through the structure of the schema,
we recommend documenting this as docstrings on the definition of the schema columns. One common example would be the
semantic meanings of enum values referring to conventions in the data:

```python
class HospitalStaySchema(dy.Schema):
# Reason for admission to the hospital
# N = Emergency
# V = Transfer from another hospital
# ...
admission_reason = dy.Enum(["N", "V", ...])
```

## Verifying results

`dataframely` supports you and your coding agent in writing unit tests for individual pieces of logic. One significant
bottle neck is the generation of appropriate test data. Check
out [our documentation on synthetic data generation](./features/data-generation.md) to see how `dataframely` can help
you generate realistic test data that meets the constraints of your schema. We recommend requiring your coding agent to
write tests using this functionality to verify its work.
118 changes: 118 additions & 0 deletions docs/guides/coding-agents/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
name: dataframely
description: A declarative, Polars-native data frame validation library. Use when implementing data processing logic in polars.
license: BSD-3-Clause
---

# Dataframely skill

`dataframely` provides `dy.Schema` and `dy.Collection` to document and enforce the structure of single or multiple related data frames.

## `dy.Schema` example

A `dy.Schema` describes the structure of a single dataframe.

```python
class HouseSchema(dy.Schema):
"""A schema for a dataframe describing houses."""

street: dy.String(primary_key=True)
number: dy.UInt16(primary_key=True)
# Number of rooms
rooms: dy.UInt8()
# Area in square meters
area: dy.UInt16()
```

## `dy.Collection` example

A `dy.Collection` describes a set of related dataframes, each described by a `dy.Schema`. Dataframes in a collection should share at least a subset of their primary key.

```python
class MyStreetSchema(dy.Schema):
"""A schema for a dataframe describing streets."""

# Shared primary key component with MyHouseSchema
street: dy.String(primary_key=True)
city: dy.String()


class MyCollection(dy.Collection):
"""A collection of related dataframes."""

houses: MyHouseSchema
streets: MyStreetSchema
```

# Usage conventions

## Use clear interfaces

Structure data processing code with clear interfaces documented using `dataframely` type hints:

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
# Internal dataframes do not require schemas
df: pl.LazyFrame = ...
return MyPreprocessedSchema.validate(df, cast=True)
```

Use schemas for all input, output, and intermediate dataframes. Schemas may be omitted for short-lived temporary dataframes and private helper functions (prefixed with `_`).

## `filter` vs `validate`

Both `.validate` and `.filter` enforce the schema at runtime. Pass `cast=True` for safe type-casting.

- **`Schema.validate`** — raises on failure. Use when failures are unexpected (e.g. transforming already-validated data).
- **`Schema.filter`** — returns valid rows plus a `FailureInfo` describing filtered-out rows. Use when failures are possible and should be handled gracefully (e.g. logging and skipping invalid rows).

## Testing

Every data transformation must have unit tests. Test each branch of the transformation logic. Do not test properties already guaranteed by the schema.

### Test structure

1. Create synthetic input data
2. Define the expected output
3. Execute the transformation
4. Compare using `assert_frame_equal` from `polars.testing` (or `diffly.testing` if installed)

```python
from polars.testing import assert_frame_equal


def test_grouped_sum():
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "a", "b"],
}).pipe(MyInputSchema.validate, cast=True)

expected = pl.DataFrame({
"col1": ["a", "b"],
"col2": [3, 3],
})

result = my_code(df)

assert assert_frame_equal(expected, result)
```

### Generating synthetic input data

For complex schemas where only some columns are relevant to the test, use `dataframely`'s synthetic data generation:

```python
# Random data meeting all schema constraints
random_data = MyInputSchema.sample(num_rows=100)
```

Use fully random data for property tests where exact contents don't matter. Use overrides to pin specific columns while randomly sampling the rest:

```python
random_data_with_overrides = HouseSchema.sample(
num_rows=5,
overrides={
"street": ["Main St.", "Main St.", "Main St.", "Second St.", "Second St."],
}
)
```
1 change: 1 addition & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
quickstart
examples/index
features/index
coding-agents
development
migration/index
faq
Expand Down
41 changes: 41 additions & 0 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading