Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/advanced/testing-environments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,52 @@ hud debug . --max-phase 3 # Stop after phase 3
hud debug --config mcp.json # Debug from config file
```

## Scenario MCP Protocol Mapping

Understanding how scenarios map to MCP is crucial for debugging. Each scenario registers **two MCP endpoints**:

| Phase | MCP Type | Endpoint | What it does |
|-------|----------|----------|--------------|
| Setup | **Prompt** | `get_prompt("{env}:{scenario}", args)` | Runs code before first `yield`, returns the prompt |
| Evaluate | **Resource** | `read_resource("{env}:{scenario}")` | Runs code after first `yield`, returns `{"reward": float}` |

### Debug with raw MCP calls

If a scenario isn't working, test each phase directly:

```python
async with env:
# Phase 1: Setup (runs code before first yield)
prompt_result = await env.get_prompt(
"myenv:checkout",
{"product": "laptop", "user_id": "alice"}
)
print(f"Prompt: {prompt_result.messages[0].content}")

# ... agent runs here ...

# Phase 2: Submit answer (stores it for evaluation)
await env.submit("checkout", answer="Order completed successfully")

# Phase 3: Evaluate (runs code after first yield)
resource_result = await env.read_resource("myenv:checkout")
print(f"Reward: {resource_result}") # {"reward": 1.0}
```

### Common debugging scenarios

**Problem:** `evaluate_tool: NULL` but using v5 scenarios
- **Cause:** v5 scenarios don't use `evaluate_tool`—they return rewards via `read_resource`
- **Fix:** Ensure your orchestrator calls `read_resource()` after agent completion

**Problem:** `TypeError` when evaluating with complex args like `list[dict]`
- **Cause:** MCP passes all arguments as strings; SDK deserializes them
- **Debug:** Add logging to check `type(arg)` at scenario entry

**Problem:** Scenario setup works but evaluate returns no reward
- **Cause:** `submit()` wasn't called before `read_resource()`
- **Fix:** Call `await env.submit(scenario_name, answer)` first

## Useful Environment Properties

```python
Expand Down
5 changes: 3 additions & 2 deletions docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"icon": "code",
"versions": [
{
"version": "0.5.4",
"version": "0.5.5",
"groups": [
{
"group": "Get Started",
Expand Down Expand Up @@ -198,7 +198,8 @@
{
"group": "Get Started",
"pages": [
"platform/index"
"platform/index",
"platform/mcp"
]
},
{
Expand Down
4 changes: 4 additions & 0 deletions docs/guides/best-practices.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ HUD sandboxes each eval—containers don't share state. But if your environment

## Good Evals

<Tip>
**Scenarios are the atomic skills your agent must get right.** If your agent can't reliably pass a scenario, that's a gap to close—through prompting, fine-tuning, or tool design.
</Tip>

An eval combines a prompt (the first `yield`) with grading logic (everything after). The prompt tells agents what to do—write short-to-medium length instructions that ask for an unambiguous change you can verify.

### Be Specific
Expand Down
2 changes: 1 addition & 1 deletion docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ async def find_answer(question: str):
yield 1.0 if "correct" in answer.lower() else 0.0
```

Scenarios define the prompt (first yield) and the scoring logic (second yield). The agent runs in between.
**Scenarios are the atomic skills your agent must get right.** Each one defines a prompt (first yield) and scoring logic (second yield). If your agent can't reliably pass a scenario, that's a gap to close—through prompting, fine-tuning, or tool design.

→ [More on Environments](/quick-links/environments)

Expand Down
48 changes: 48 additions & 0 deletions docs/platform/environments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,54 @@ hud eval my-env/checkout --model gpt-4o --group-size 10

For local debugging before pushing, see [`hud debug`](/reference/cli/debug).

## Debugging Traces

When viewing a trace (task run), the environment pane on the right side provides powerful debugging tools through two special tabs:

### DEBUG Tab

The DEBUG tab shows low-level information about the environment and agent execution:

- **Environment Info** — Container ID, pod status, and connection details
- **MCP Operations** — All MCP protocol messages including prompts, resources, and internal operations that aren't tool calls
- **Raw Attributes** — Expand any operation to see the full request/response payload
- **Worker Logs** (Admin only) — Server-side logs from the Celery worker that executed the rollout

This is useful for diagnosing:
- Why a scenario setup or evaluation failed
- MCP protocol issues between the agent and environment
- Authentication or connection problems
- Server-side errors that don't surface in the agent trace

### LOGS Tab

The LOGS tab shows container stdout/stderr from the environment:

- **Real-time streaming** — Logs update as the environment runs
- **Timestamp filtering** — See when specific events occurred
- **Error highlighting** — Errors and warnings are visually distinct

This helps debug:
- Environment startup issues
- Tool execution failures
- Python exceptions in your environment code
- Resource exhaustion (memory, CPU, disk)

<Tip>
If an agent run fails with no obvious error, check the LOGS tab first—often the environment container logged an exception that explains what went wrong.
</Tip>

### Accessing Debug Information

1. Open any trace at `hud.ai/trace/{id}`
2. Look at the environment pane on the right side
3. Click the **DEBUG** or **LOGS** tab at the top
4. For MCP operations in DEBUG, click the expand icon to see full payloads

<Note>
Worker logs in the DEBUG tab are only visible to platform administrators. Regular users see environment logs in the LOGS tab.
</Note>

## Next Steps

<CardGroup cols={2}>
Expand Down
7 changes: 6 additions & 1 deletion docs/platform/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,15 @@ Sign in at [hud.ai](https://hud.ai) with your account. The main navigation gives
- **Models** — Browse models, view checkpoints, train custom models
- **Environments** — Deploy and monitor your agent environments
- **Tasksets** — Organize tasks for evaluations and benchmarks
- **🔍 MCP** — Connect your AI agent to query platform data

## Quick Links

<CardGroup cols={3}>
<CardGroup cols={2}>
<Card title="MCP Integration" icon="magnifying-glass" href="/platform/mcp">
Let your AI agent query traces, debug environments, and explore tasks
</Card>

<Card title="Models" icon="robot" href="/platform/models">
Browse, fork, and train models
</Card>
Expand Down
101 changes: 101 additions & 0 deletions docs/platform/mcp.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: "MCP Integration"
description: "Let your AI agent query the platform. Analyze traces, debug environments, explore tasks."
icon: "magnifying-glass"
---

Connect your AI agent to the HUD platform via MCP. Your agent can query traces from the Home dashboard, check environment build status, explore your tasksets—all through natural conversation. When you're reviewing jobs and spot failure patterns, ask your agent to analyze them and suggest new tasks.

## Setup

Click the 🔍 button in the platform header to get the config, or add manually:

```json
{
"hud": {
"url": "https://api.hud.ai/v3/mcp/",
"headers": {
"Authorization": "Bearer YOUR_HUD_API_KEY"
}
}
}
```

Get your API key from [Settings → API Keys](https://hud.ai/project/api-keys).

## Analyze Traces

From the Home dashboard, you see your recent jobs and traces. With MCP, your agent can dig deeper:

```
"Get the traces from my last failed job and explain what the agent did wrong."
```

```
"Show me traces where the reward was 0. What patterns do you see in how the agent failed?"
```

Your agent retrieves the trace data—every action, tool call, and response—and helps you understand what happened.

## Debug Environments

When an environment build fails or behaves unexpectedly, ask your agent to investigate:

```
"Check the status of my remote-browser environment."
```

```
"List my environments and tell me which ones are ready vs still building."
```

This surfaces the same info you see on the Environments page, but lets you query it conversationally while you're working.

## Explore Tasksets

Browse your tasksets and see what's in each one:

```
"What tasksets do I have? How many tasks are in SheetBench-50?"
```

```
"Show me the tasks in my latest evalset and describe what they test."
```

## Write New Tasks from Failures

The real power: after analyzing failed traces, have your agent suggest new tasks that target those weaknesses.

```
"Based on the failures you found, write 3 new tasks that would test
those specific edge cases."
```

This closes the loop—run evals → analyze failures → create targeted tasks → run again.

## Available Tools

| Tool | What it queries |
|------|-----------------|
| `list_jobs` | Your jobs from Home (status, metrics) |
| `get_job` | Job details and summary |
| `get_job_traces` | Traces in a job |
| `get_trace` | Full trace with trajectory and logs |
| `list_environments` | Your environments from Environments page |
| `get_environment` | Environment details and build status |
| `list_evalsets` | Your tasksets from Tasksets page |
| `get_evalset_tasks` | Tasks in a specific evalset |
| `list_scenarios` | Scenarios for an environment |

All read-only—your agent can query but not modify platform data.

<CardGroup cols={2}>
<Card title="Environments" icon="cube" href="/platform/environments">
Deploy and manage agent environments
</Card>

<Card title="Tasksets" icon="list-check" href="/platform/tasksets">
Organize tasks for evaluation
</Card>
</CardGroup>
60 changes: 53 additions & 7 deletions docs/reference/environments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -303,21 +303,32 @@ env.serve(transport="streamable-http", host="0.0.0.0", port=8765)

### http_app()

Get a Starlette/ASGI app to mount on an existing FastAPI server:
Get a Starlette/ASGI app to mount on an existing FastAPI server. This is inherited from FastMCP and enables deployment on platforms like Railway, Fly.io, or Vercel.

```python
from contextlib import asynccontextmanager
from fastapi import FastAPI
from hud import Environment

app = FastAPI()
env = Environment("my-env")

@env.tool()
def my_tool(arg: str) -> str:
return f"Got: {arg}"

# Mount the HUD environment's MCP endpoint at /mcp
app.mount("/mcp", env.http_app())
# Create the MCP app with stateless_http=True for multi-replica deployments
mcp_app = env.http_app(path="/", stateless_http=True)

@asynccontextmanager
async def lifespan(app: FastAPI):
# Enter BOTH the environment context AND the MCP app's lifespan
async with env, mcp_app.router.lifespan_context(mcp_app):
yield

app = FastAPI(lifespan=lifespan, redirect_slashes=False)

# Mount the MCP app
app.mount("/mcp", mcp_app)

# Your other FastAPI routes work normally
@app.get("/health")
Expand All @@ -328,10 +339,45 @@ def health():
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `path` | `str \| None` | Internal path for the MCP endpoint | `"/"` |
| `transport` | `Literal["http", "streamable-http", "sse"]` | Transport protocol | `"http"` |
| `stateless_http` | `bool` | Stateless mode for multi-replica deployments | `False` |
| `middleware` | `list[ASGIMiddleware] \| None` | Starlette middleware | `None` |
| `json_response` | `bool \| None` | Use JSON response format | `None` |
| `stateless_http` | `bool \| None` | Use stateless HTTP mode | `None` |

<Warning>
**Lifespan is critical.** You must enter both `env` (the Environment context) and `mcp_app.router.lifespan_context(mcp_app)` (the MCP session manager). Missing either will cause tools to fail or sessions to not initialize.
</Warning>

#### Stateless HTTP Mode

Enable `stateless_http=True` when deploying to platforms with multiple replicas (Railway, Fly.io, etc.). This ensures each request creates a fresh transport context, eliminating session affinity requirements:

```python
# For single-replica or sticky sessions
mcp_app = env.http_app(path="/")

# For multi-replica deployments (Railway, Fly.io, Vercel)
mcp_app = env.http_app(path="/", stateless_http=True)
```

#### Authentication via Headers

For authenticated tools, use FastMCP's `get_http_headers()` to extract the API key:

```python
from fastmcp.server.dependencies import get_http_headers

@env.tool()
async def protected_tool(query: str) -> dict:
"""A tool that requires authentication."""
headers = get_http_headers()
auth_header = headers.get("authorization", "")

if not auth_header.startswith("Bearer "):
return {"error": "Missing API key"}

api_key = auth_header[7:] # Remove "Bearer " prefix
# Validate api_key and proceed...
return {"result": "authenticated"}
```

MCP clients can then connect at `http://your-server/mcp`:

Expand Down
17 changes: 12 additions & 5 deletions hud/environment/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,9 @@ async def _broadcast_tool(
Automatically filters to only connections where the tool exists
(based on cached_tools from initial discovery).

For internal tools (starting with _), tries ALL connections since
internal tools are hidden from list_tools() and won't be in cached_tools.

Args:
tool_name: Name of the tool to call
**kwargs: Arguments to pass to the tool
Expand All @@ -234,10 +237,13 @@ async def _broadcast_tool(
"""
import asyncio

# Only call connections that have this tool
targets = self._connections_with_tool(tool_name)
if not targets:
return {}
# For internal tools (underscore prefix), try ALL connections since
# they're hidden from list_tools() and won't appear in cached_tools.
# For regular tools, only try connections that advertise the tool.
if tool_name.startswith("_"):
targets = set(self._connections.keys())
else:
targets = self._connections_with_tool(tool_name)

results: dict[str, Any] = {}

Expand All @@ -246,7 +252,8 @@ async def call_one(name: str) -> None:
if not connector or not connector.client:
return
try:
results[name] = await connector.client.call_tool(tool_name, **kwargs)
# Use connector.call_tool which expects arguments as a dict
results[name] = await connector.call_tool(tool_name, kwargs)
logger.debug("Broadcast '%s' to '%s' succeeded", tool_name, name)
except Exception as e:
results[name] = e
Expand Down
Loading
Loading