Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@
# Plugins
/plugins/data-designer-github/ @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
/plugins/data-designer-retrieval-sdg/ @NVIDIA-NeMo/data_designer_reviewers @shan-nvidia @oliverholworthy
/plugins/data-designer-sandbox-piston/ @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
/plugins/data-designer-template/ @NVIDIA-NeMo/data_designer_reviewers
140 changes: 140 additions & 0 deletions docs/plugins/data-designer-sandbox-piston/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# data-designer-sandbox-piston

`data-designer-sandbox-piston` adds Piston-backed code execution to Data
Designer. It provides a `code-sandbox` column type for batch workflows and a
stdio MCP server that exposes the same Piston endpoint as a `run_code` tool for
tool-calling LLM columns.

The plugin is deployment-neutral: point it at any reachable Piston API URL. That
URL can be a local Docker container on macOS or Linux, a service running beside a
Data Designer worker, or a remote endpoint behind your own proxy.

## Installation

```bash
uv add data-designer data-designer-sandbox-piston
```

## Column type

Use the `code-sandbox` column type when a dataset already contains source code
that should be executed by Piston.

| Field | Required | Description |
| --- | --- | --- |
| `name` | Yes | Output column name. Each value is a dictionary with execution results. |
| `target_column` | Yes | Existing column containing source code. |
| `language` | Yes | Piston runtime language, such as `python` or `gcc`. |
| `version` | No | Piston runtime version selector. Defaults to `*`. Required when `python_packages` is non-empty. |
| `python_packages` | No | Optional Python package requirements for a prebuilt custom Python runtime. Only valid with `language="python"`; the deployment must provide the matching runtime before execution. |
| `stdin` | No | Text passed to standard input. Defaults to an empty string. |
| `args` | No | Command-line arguments passed to the program. Defaults to an empty list. |
| `compile_timeout` | No | Compile wall-time limit in milliseconds. Defaults to `10000`. |
| `run_timeout` | No | Run wall-time limit in milliseconds. Defaults to `3000`, matching stock Piston's default run limit. |
| `compile_cpu_time` | No | Compile CPU-time limit in milliseconds. Defaults to `3000`. |
| `run_cpu_time` | No | Run CPU-time limit in milliseconds. Defaults to `3000`. |
| `sandbox_url` | Yes | HTTP or HTTPS Piston API base URL, such as `http://localhost:2000`. |

## Output shape

The output column contains a dictionary per row:

```python
{
"stdout": "42\n",
"stderr": "",
"output": "42\n",
"exit_code": 0,
"signal": None,
"message": None,
"status": None,
"cpu_time": 12.5,
"wall_time": 15.2,
"memory": 8192,
}
```

Empty or missing source code returns `exit_code=-2`. Sandbox API failures return
`exit_code=-1` with the final error in `stderr` and `message`.

## Python Packages

`python_packages` is declarative metadata. The plugin sends `language` and
`version` to Piston; it does not install packages or build runtimes during
generation. If you set a non-empty `python_packages` list, also set `version` to
the exact custom Python runtime version that your deployment has already built
and installed in Piston.

## Example

```python
import pandas as pd
from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.config.seed_source_dataframe import DataFrameSeedSource

builder = DataDesignerConfigBuilder()
builder.with_seed_dataset(
DataFrameSeedSource(df=pd.DataFrame({"code": ["print(6 * 7)"]}))
)
builder.add_column(
name="sandbox_result",
column_type="code-sandbox",
target_column="code",
language="python",
version="*",
sandbox_url="http://localhost:2000",
)
```

## MCP tool

Use `SandboxMCPConfig` to create a `LocalStdioMCPProvider` for Data Designer
tool-calling workflows:

```python
from data_designer_sandbox_piston import SandboxMCPConfig

sandbox_mcp = SandboxMCPConfig(
name="sandbox",
sandbox_url="http://localhost:2000",
language="python",
result_fields=["stdout", "stderr", "exit_code"],
)
mcp_provider = sandbox_mcp.to_provider()
tool_config = sandbox_mcp.to_tool_config()
```

The MCP process can also be launched directly:

```bash
SANDBOX_URL=http://localhost:2000 \
SANDBOX_LANGUAGE=python \
SANDBOX_VERSION='*' \
SANDBOX_RUN_TIMEOUT=3000 \
SANDBOX_RUN_CPU_TIME=3000 \
SANDBOX_TOOL_DESCRIPTION='Execute Python code in a sandbox.' \
SANDBOX_RESULT_FIELDS=stdout,stderr,exit_code \
python -m data_designer_sandbox_piston.mcp_server
```

## Local and remote Piston

For local development on macOS or Linux, run Piston in Docker and point
`sandbox_url` at `http://localhost:2000`. The package includes a convenience
script and Docker Compose example under `scripts/` and `docker/`.

The local container stores Piston runtime packages under `/piston` in a Docker
volume. A fresh volume has no runtimes installed; install the runtimes you need
through Piston's package API, for example:

```bash
curl -X POST http://localhost:2000/api/v2/packages \
-H 'Content-Type: application/json' \
-d '{"language":"python","version":"3.12.0"}'
```

For remote deployment, build or run a Piston API image and expose port `2000`
inside your deployment boundary. Piston must be run with the privileges and
kernel support required by its own sandboxing model. See the
[Piston project](https://github.com/engineer-man/piston) for current runtime
installation and security guidance.
93 changes: 93 additions & 0 deletions docs/plugins/data-designer-sandbox-piston/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Usage

This plugin is a client for a running Piston API. It does not install Piston
runtimes for you, which keeps the Data Designer package portable across local,
cluster, and remote service deployments.

## Start a local sandbox

Use Docker on macOS or Linux:

```bash
bash plugins/data-designer-sandbox-piston/scripts/run-local-piston.sh
```

The script starts a container on `http://localhost:2000`. You can also use the
Docker Compose example:

```bash
docker compose -f plugins/data-designer-sandbox-piston/docker/docker-compose.yml up --build
```

Piston runtime packages are stored under `/piston` inside a Docker volume. A new
Piston package volume starts empty, so install the runtimes your workflow needs:

```bash
curl -X POST http://localhost:2000/api/v2/packages \
-H 'Content-Type: application/json' \
-d '{"language":"python","version":"3.12.0"}'
```

## Execute a code column

```python
from data_designer.config.config_builder import DataDesignerConfigBuilder

builder = DataDesignerConfigBuilder()
builder.add_column(
name="result",
column_type="code-sandbox",
target_column="python_code",
language="python",
sandbox_url="http://localhost:2000",
)
```

Each output value is a JSON-serializable dictionary containing stdout, stderr,
exit code, status, timing, and memory fields.

If a workflow needs Python packages, build or provide a custom Piston Python
runtime first, then set `version` to that exact runtime version and list the
expected packages in `python_packages`. The package list is metadata for humans
and tool descriptions; execution still depends on the configured Piston runtime.

## Add an MCP `run_code` tool

```python
from data_designer_sandbox_piston import SandboxMCPConfig

sandbox_mcp = SandboxMCPConfig(
name="sandbox",
sandbox_url="http://localhost:2000",
language="python",
result_fields=["stdout", "stderr", "exit_code"],
)

mcp_provider = sandbox_mcp.to_provider()
tool_config = sandbox_mcp.to_tool_config()
```

Pass `mcp_provider` and `tool_config` into the Data Designer configuration path
that configures MCP providers and tool aliases for your LLM columns.

If the sandbox URL is assigned by your launcher at runtime, omit `sandbox_url`
from `SandboxMCPConfig` and set `SANDBOX_URL` in the environment inherited by the
MCP subprocess.

## Remote endpoint

For a remote deployment, use the same configuration with a different URL:

```python
builder.add_column(
name="result",
column_type="code-sandbox",
target_column="python_code",
language="python",
sandbox_url="https://piston.example.internal",
)
```

Keep the Piston endpoint private to trusted Data Designer workers or put it
behind your own authentication proxy. The plugin forwards code to the configured
endpoint and relies on that service for isolation and runtime availability.
11 changes: 11 additions & 0 deletions docs/plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,17 @@ Browse available Data Designer plugins by what they add to your data generation
<span class="plugin-doc-card__chips"><span class="plugin-doc-chip">document-chunker</span><span class="plugin-doc-chip">embedding-dedup</span></span>
</span>
</a>
<a class="plugin-doc-card" href="data-designer-sandbox-piston/" aria-label="Open data-designer-sandbox-piston documentation">
<span class="plugin-doc-card__header">
<span class="plugin-doc-card__title">data-designer-sandbox-piston</span>
<span class="plugin-doc-card__version">v0.1.0</span>
</span>
<span class="plugin-doc-card__description">Piston-backed code execution columns and MCP tools for Data Designer</span>
<span class="plugin-doc-card__section">
<span class="plugin-doc-card__label">Entry points</span>
<span class="plugin-doc-card__chips"><span class="plugin-doc-chip">code-sandbox</span></span>
</span>
</a>
<a class="plugin-doc-card" href="data-designer-template/" aria-label="Open data-designer-template documentation">
<span class="plugin-doc-card__header">
<span class="plugin-doc-card__title">data-designer-template</span>
Expand Down
3 changes: 3 additions & 0 deletions plugins/data-designer-sandbox-piston/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Owner(s) of this plugin — used to generate the root CODEOWNERS file.
# GitHub accepts @username, @org/team, or email format.
* @NVIDIA-NeMo/data_designer_reviewers @eric-tramel
45 changes: 45 additions & 0 deletions plugins/data-designer-sandbox-piston/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# data-designer-sandbox-piston

Piston-backed code execution for Data Designer.

This package provides:

- A `code-sandbox` column type that executes code from an existing column via a
local or remote Piston API.
- A stdio MCP server that exposes the same sandbox as a `run_code` tool.
- Local Docker helper examples for macOS and Linux development.

## Installation

```bash
uv add data-designer data-designer-sandbox-piston
```

## Column usage

```python
builder.add_column(
name="sandbox_result",
column_type="code-sandbox",
target_column="python_code",
language="python",
sandbox_url="http://localhost:2000",
)
```

## MCP usage

```python
from data_designer_sandbox_piston import SandboxMCPConfig

sandbox_mcp = SandboxMCPConfig(
name="sandbox",
sandbox_url="http://localhost:2000",
language="python",
)
mcp_provider = sandbox_mcp.to_provider()
tool_config = sandbox_mcp.to_tool_config()
```

Plugin documentation for the repository site lives in this package's `docs/`
directory.
7 changes: 7 additions & 0 deletions plugins/data-designer-sandbox-piston/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

ARG PISTON_IMAGE=ghcr.io/engineer-man/piston:latest
FROM ${PISTON_IMAGE}

EXPOSE 2000
25 changes: 25 additions & 0 deletions plugins/data-designer-sandbox-piston/docker/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

services:
piston:
build:
context: .
dockerfile: Dockerfile
container_name: data-designer-piston
privileged: true
ports:
- "${PISTON_PORT:-2000}:2000"
volumes:
- piston-data:/piston
restart: unless-stopped
healthcheck:
test:
- CMD-SHELL
- "python3 -c 'import urllib.request; urllib.request.urlopen(\"http://127.0.0.1:2000/api/v2/runtimes\", timeout=1)'"
interval: 10s
timeout: 2s
retries: 12

volumes:
piston-data:
Loading
Loading