hud-evals · lorenss-m · Jan 11, 2026 · Jan 10, 2026 · Jan 10, 2026 · Jan 10, 2026
diff --git a/docs/advanced/testing-environments.mdx b/docs/advanced/testing-environments.mdx
@@ -89,6 +89,52 @@ hud debug . --max-phase 3      # Stop after phase 3
 hud debug --config mcp.json    # Debug from config file
 ```
 
+## Scenario MCP Protocol Mapping
+
+Understanding how scenarios map to MCP is crucial for debugging. Each scenario registers **two MCP endpoints**:
+
+| Phase | MCP Type | Endpoint | What it does |
+|-------|----------|----------|--------------|
+| Setup | **Prompt** | `get_prompt("{env}:{scenario}", args)` | Runs code before first `yield`, returns the prompt |
+| Evaluate | **Resource** | `read_resource("{env}:{scenario}")` | Runs code after first `yield`, returns `{"reward": float}` |
+
+### Debug with raw MCP calls
+
+If a scenario isn't working, test each phase directly:
+
+```python
+async with env:
+    # Phase 1: Setup (runs code before first yield)
+    prompt_result = await env.get_prompt(
+        "myenv:checkout", 
+        {"product": "laptop", "user_id": "alice"}
+    )
+    print(f"Prompt: {prompt_result.messages[0].content}")
+
+    # ... agent runs here ...
+
+    # Phase 2: Submit answer (stores it for evaluation)
+    await env.submit("checkout", answer="Order completed successfully")
+
+    # Phase 3: Evaluate (runs code after first yield)
+    resource_result = await env.read_resource("myenv:checkout")
+    print(f"Reward: {resource_result}")  # {"reward": 1.0}
+```
+
+### Common debugging scenarios
+
+**Problem:** `evaluate_tool: NULL` but using v5 scenarios
+- **Cause:** v5 scenarios don't use `evaluate_tool`—they return rewards via `read_resource`
+- **Fix:** Ensure your orchestrator calls `read_resource()` after agent completion
+
+**Problem:** `TypeError` when evaluating with complex args like `list[dict]`
+- **Cause:** MCP passes all arguments as strings; SDK deserializes them
+- **Debug:** Add logging to check `type(arg)` at scenario entry
+
+**Problem:** Scenario setup works but evaluate returns no reward
+- **Cause:** `submit()` wasn't called before `read_resource()`
+- **Fix:** Call `await env.submit(scenario_name, answer)` first
+
 ## Useful Environment Properties
 
 ```python

diff --git a/docs/docs.json b/docs/docs.json
@@ -33,7 +33,7 @@
         "icon": "code",
         "versions": [
           {
-            "version": "0.5.4",
+            "version": "0.5.5",
             "groups": [
               {
                 "group": "Get Started",
@@ -198,7 +198,8 @@
           {
             "group": "Get Started",
             "pages": [
-              "platform/index"
+              "platform/index",
+              "platform/mcp"
             ]
           },
           {

diff --git a/docs/guides/best-practices.mdx b/docs/guides/best-practices.mdx
@@ -61,6 +61,10 @@ HUD sandboxes each eval—containers don't share state. But if your environment
 
 ## Good Evals
 
+<Tip>
+**Scenarios are the atomic skills your agent must get right.** If your agent can't reliably pass a scenario, that's a gap to close—through prompting, fine-tuning, or tool design.
+</Tip>
+
 An eval combines a prompt (the first `yield`) with grading logic (everything after). The prompt tells agents what to do—write short-to-medium length instructions that ask for an unambiguous change you can verify.
 
 ### Be Specific

diff --git a/docs/index.mdx b/docs/index.mdx
@@ -63,7 +63,7 @@ async def find_answer(question: str):
     yield 1.0 if "correct" in answer.lower() else 0.0
 ```
 
-Scenarios define the prompt (first yield) and the scoring logic (second yield). The agent runs in between.
+**Scenarios are the atomic skills your agent must get right.** Each one defines a prompt (first yield) and scoring logic (second yield). If your agent can't reliably pass a scenario, that's a gap to close—through prompting, fine-tuning, or tool design.
 
 → [More on Environments](/quick-links/environments)
 

diff --git a/docs/platform/environments.mdx b/docs/platform/environments.mdx
@@ -191,6 +191,54 @@ hud eval my-env/checkout --model gpt-4o --group-size 10
 
 For local debugging before pushing, see [`hud debug`](/reference/cli/debug).
 
+## Debugging Traces
+
+When viewing a trace (task run), the environment pane on the right side provides powerful debugging tools through two special tabs:
+
+### DEBUG Tab
+
+The DEBUG tab shows low-level information about the environment and agent execution:
+
+- **Environment Info** — Container ID, pod status, and connection details
+- **MCP Operations** — All MCP protocol messages including prompts, resources, and internal operations that aren't tool calls
+- **Raw Attributes** — Expand any operation to see the full request/response payload
+- **Worker Logs** (Admin only) — Server-side logs from the Celery worker that executed the rollout
+
+This is useful for diagnosing:
+- Why a scenario setup or evaluation failed
+- MCP protocol issues between the agent and environment
+- Authentication or connection problems
+- Server-side errors that don't surface in the agent trace
+
+### LOGS Tab
+
+The LOGS tab shows container stdout/stderr from the environment:
+
+- **Real-time streaming** — Logs update as the environment runs
+- **Timestamp filtering** — See when specific events occurred
+- **Error highlighting** — Errors and warnings are visually distinct
+
+This helps debug:
+- Environment startup issues
+- Tool execution failures
+- Python exceptions in your environment code
+- Resource exhaustion (memory, CPU, disk)
+
+<Tip>
+If an agent run fails with no obvious error, check the LOGS tab first—often the environment container logged an exception that explains what went wrong.
+</Tip>
+
+### Accessing Debug Information
+
+1. Open any trace at `hud.ai/trace/{id}`
+2. Look at the environment pane on the right side
+3. Click the **DEBUG** or **LOGS** tab at the top
+4. For MCP operations in DEBUG, click the expand icon to see full payloads
+
+<Note>
+Worker logs in the DEBUG tab are only visible to platform administrators. Regular users see environment logs in the LOGS tab.
+</Note>
+
 ## Next Steps
 
 <CardGroup cols={2}>

diff --git a/docs/platform/index.mdx b/docs/platform/index.mdx
@@ -14,10 +14,15 @@ Sign in at [hud.ai](https://hud.ai) with your account. The main navigation gives
 - **Models** — Browse models, view checkpoints, train custom models
 - **Environments** — Deploy and monitor your agent environments
 - **Tasksets** — Organize tasks for evaluations and benchmarks
+- **🔍 MCP** — Connect your AI agent to query platform data
 
 ## Quick Links
 
-<CardGroup cols={3}>
+<CardGroup cols={2}>
+<Card title="MCP Integration" icon="magnifying-glass" href="/platform/mcp">
+  Let your AI agent query traces, debug environments, and explore tasks
+</Card>
+
 <Card title="Models" icon="robot" href="/platform/models">
   Browse, fork, and train models
 </Card>

diff --git a/docs/platform/mcp.mdx b/docs/platform/mcp.mdx
@@ -0,0 +1,101 @@
+---
+title: "MCP Integration"
+description: "Let your AI agent query the platform. Analyze traces, debug environments, explore tasks."
+icon: "magnifying-glass"
+---
+
+Connect your AI agent to the HUD platform via MCP. Your agent can query traces from the Home dashboard, check environment build status, explore your tasksets—all through natural conversation. When you're reviewing jobs and spot failure patterns, ask your agent to analyze them and suggest new tasks.
+
+## Setup
+
+Click the 🔍 button in the platform header to get the config, or add manually:
+
+```json
+{
+  "hud": {
+    "url": "https://api.hud.ai/v3/mcp/",
+    "headers": {
+      "Authorization": "Bearer YOUR_HUD_API_KEY"
+    }
+  }
+}
+```
+
+Get your API key from [Settings → API Keys](https://hud.ai/project/api-keys).
+
+## Analyze Traces
+
+From the Home dashboard, you see your recent jobs and traces. With MCP, your agent can dig deeper:
+
+```
+"Get the traces from my last failed job and explain what the agent did wrong."
+```
+
+```
+"Show me traces where the reward was 0. What patterns do you see in how the agent failed?"
+```
+
+Your agent retrieves the trace data—every action, tool call, and response—and helps you understand what happened.
+
+## Debug Environments
+
+When an environment build fails or behaves unexpectedly, ask your agent to investigate:
+
+```
+"Check the status of my remote-browser environment."
+```
+
+```
+"List my environments and tell me which ones are ready vs still building."
+```
+
+This surfaces the same info you see on the Environments page, but lets you query it conversationally while you're working.
+
+## Explore Tasksets
+
+Browse your tasksets and see what's in each one:
+
+```
+"What tasksets do I have? How many tasks are in SheetBench-50?"
+```
+
+```
+"Show me the tasks in my latest evalset and describe what they test."
+```
+
+## Write New Tasks from Failures
+
+The real power: after analyzing failed traces, have your agent suggest new tasks that target those weaknesses.
+
+```
+"Based on the failures you found, write 3 new tasks that would test 
+those specific edge cases."
+```
+
+This closes the loop—run evals → analyze failures → create targeted tasks → run again.
+
+## Available Tools
+
+| Tool | What it queries |
+|------|-----------------|
+| `list_jobs` | Your jobs from Home (status, metrics) |
+| `get_job` | Job details and summary |
+| `get_job_traces` | Traces in a job |
+| `get_trace` | Full trace with trajectory and logs |
+| `list_environments` | Your environments from Environments page |
+| `get_environment` | Environment details and build status |
+| `list_evalsets` | Your tasksets from Tasksets page |
+| `get_evalset_tasks` | Tasks in a specific evalset |
+| `list_scenarios` | Scenarios for an environment |
+
+All read-only—your agent can query but not modify platform data.
+
+<CardGroup cols={2}>
+<Card title="Environments" icon="cube" href="/platform/environments">
+  Deploy and manage agent environments
+</Card>
+
+<Card title="Tasksets" icon="list-check" href="/platform/tasksets">
+  Organize tasks for evaluation
+</Card>
+</CardGroup>
diff --git a/docs/reference/environments.mdx b/docs/reference/environments.mdx
@@ -303,21 +303,32 @@ env.serve(transport="streamable-http", host="0.0.0.0", port=8765)
 
 ### http_app()
 
-Get a Starlette/ASGI app to mount on an existing FastAPI server:
+Get a Starlette/ASGI app to mount on an existing FastAPI server. This is inherited from FastMCP and enables deployment on platforms like Railway, Fly.io, or Vercel.
 
 ```python
+from contextlib import asynccontextmanager
 from fastapi import FastAPI
 from hud import Environment
 
-app = FastAPI()
 env = Environment("my-env")
 
 @env.tool()
 def my_tool(arg: str) -> str:
     return f"Got: {arg}"
 
-# Mount the HUD environment's MCP endpoint at /mcp
-app.mount("/mcp", env.http_app())
+# Create the MCP app with stateless_http=True for multi-replica deployments
+mcp_app = env.http_app(path="/", stateless_http=True)
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    # Enter BOTH the environment context AND the MCP app's lifespan
+    async with env, mcp_app.router.lifespan_context(mcp_app):
+        yield
+
+app = FastAPI(lifespan=lifespan, redirect_slashes=False)
+
+# Mount the MCP app
+app.mount("/mcp", mcp_app)
 
 # Your other FastAPI routes work normally
 @app.get("/health")
@@ -328,10 +339,45 @@ def health():
 | Parameter | Type | Description | Default |
 |-----------|------|-------------|---------|
 | `path` | `str \| None` | Internal path for the MCP endpoint | `"/"` |
-| `transport` | `Literal["http", "streamable-http", "sse"]` | Transport protocol | `"http"` |
+| `stateless_http` | `bool` | Stateless mode for multi-replica deployments | `False` |
 | `middleware` | `list[ASGIMiddleware] \| None` | Starlette middleware | `None` |
-| `json_response` | `bool \| None` | Use JSON response format | `None` |
-| `stateless_http` | `bool \| None` | Use stateless HTTP mode | `None` |
+
+<Warning>
+**Lifespan is critical.** You must enter both `env` (the Environment context) and `mcp_app.router.lifespan_context(mcp_app)` (the MCP session manager). Missing either will cause tools to fail or sessions to not initialize.
+</Warning>
+
+#### Stateless HTTP Mode
+
+Enable `stateless_http=True` when deploying to platforms with multiple replicas (Railway, Fly.io, etc.). This ensures each request creates a fresh transport context, eliminating session affinity requirements:
+
+```python
+# For single-replica or sticky sessions
+mcp_app = env.http_app(path="/")
+
+# For multi-replica deployments (Railway, Fly.io, Vercel)
+mcp_app = env.http_app(path="/", stateless_http=True)
+```
+
+#### Authentication via Headers
+
+For authenticated tools, use FastMCP's `get_http_headers()` to extract the API key:
+
+```python
+from fastmcp.server.dependencies import get_http_headers
+
+@env.tool()
+async def protected_tool(query: str) -> dict:
+    """A tool that requires authentication."""
+    headers = get_http_headers()
+    auth_header = headers.get("authorization", "")
+
+    if not auth_header.startswith("Bearer "):
+        return {"error": "Missing API key"}
+
+    api_key = auth_header[7:]  # Remove "Bearer " prefix
+    # Validate api_key and proceed...
+    return {"result": "authenticated"}
+```
 
 MCP clients can then connect at `http://your-server/mcp`:
 

diff --git a/hud/environment/environment.py b/hud/environment/environment.py
@@ -225,6 +225,9 @@ async def _broadcast_tool(
         Automatically filters to only connections where the tool exists
         (based on cached_tools from initial discovery).
 
+        For internal tools (starting with _), tries ALL connections since
+        internal tools are hidden from list_tools() and won't be in cached_tools.
+
         Args:
             tool_name: Name of the tool to call
             **kwargs: Arguments to pass to the tool
@@ -234,10 +237,13 @@ async def _broadcast_tool(
         """
         import asyncio
 
-        # Only call connections that have this tool
-        targets = self._connections_with_tool(tool_name)
-        if not targets:
-            return {}
+        # For internal tools (underscore prefix), try ALL connections since
+        # they're hidden from list_tools() and won't appear in cached_tools.
+        # For regular tools, only try connections that advertise the tool.
+        if tool_name.startswith("_"):
+            targets = set(self._connections.keys())
+        else:
+            targets = self._connections_with_tool(tool_name)
 
         results: dict[str, Any] = {}
 
@@ -246,7 +252,8 @@ async def call_one(name: str) -> None:
             if not connector or not connector.client:
                 return
             try:
-                results[name] = await connector.client.call_tool(tool_name, **kwargs)
+                # Use connector.call_tool which expects arguments as a dict
+                results[name] = await connector.call_tool(tool_name, kwargs)
                 logger.debug("Broadcast '%s' to '%s' succeeded", tool_name, name)
             except Exception as e:
                 results[name] = e