spacedriveapp · jamiepine · Feb 22, 2026 · Feb 21, 2026
diff --git a/METRICS.md b/METRICS.md
@@ -1,6 +1,6 @@
 # Metrics Reference
 
-Comprehensive reference for Spacebot's Prometheus metrics. For quick-start setup, see `docs/metrics.md`.
+Comprehensive reference for Spacebot's Prometheus metrics. For quick-start setup, see `docs/metrics.md`. For the published docs, see the metrics page on [docs.spacebot.sh](https://docs.spacebot.sh).
 
 ## Feature Gate
 
@@ -27,9 +27,7 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 | Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
 | Description | Total LLM completion requests (one per `completion()` call, including retries and fallbacks). |
 
-**Cardinality:** `agents × models × tiers`. Currently `agent_id` and `tier` are hardcoded to `"unknown"` because `SpacebotModel` doesn't carry process context. Effective cardinality is just the number of distinct model names (typically 5–15). Once agent context is threaded through, expect `agents(1–5) × models(5–15) × tiers(5)` = 25–375 series.
-
-**Known limitation:** Labels `agent_id` and `tier` are always `"unknown"`. The `SpacebotHook` has these values but can't be used here without structural changes.
+**Cardinality:** `agents × models × tiers`. With agent context wired, expect `agents(1–5) × models(5–15) × tiers(5)` = 25–375 series.
 
 #### `spacebot_tool_calls_total`
 
@@ -40,7 +38,7 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 | Instrumented in | `src/hooks/spacebot.rs` — `SpacebotHook::on_tool_result()` |
 | Description | Total tool calls executed across all processes. Incremented after each tool call completes (success or failure). |
 
-**Cardinality:** `agents × tools`. With 1–5 agents and ~20 tool names, expect 20–100 series. Tool names are a bounded set defined in `src/tools/`.
+**Cardinality:** `agents × tools`. With 1–5 agents and ~20 tool names, expect 20–100 series.
 
 #### `spacebot_memory_reads_total`
 
@@ -62,6 +60,52 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 
 **Cardinality:** 1 series.
 
+#### `spacebot_llm_tokens_total`
+
+| Field | Value |
+|-------|-------|
+| Type | `IntCounterVec` |
+| Labels | `agent_id`, `model`, `tier`, `direction` |
+| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
+| Description | Total LLM tokens consumed. `direction` is one of `input`, `output`, or `cached_input`. |
+
+**Cardinality:** `agents × models × tiers × 3`. Expect 75–1125 series.
+
+#### `spacebot_llm_estimated_cost_dollars`
+
+| Field | Value |
+|-------|-------|
+| Type | `CounterVec` (f64) |
+| Labels | `agent_id`, `model`, `tier` |
+| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
+| Description | Estimated LLM cost in USD. Uses a built-in pricing table (`src/llm/pricing.rs`). |
+
+**Cardinality:** Same as `spacebot_llm_requests_total`.
+
+**Note:** Costs are best-effort estimates. The pricing table covers major models (Claude 4/3.5/3, GPT-4o, o-series, Gemini, DeepSeek) with a conservative fallback for unknown models ($3/M input, $15/M output).
+
+#### `spacebot_process_errors_total`
+
+| Field | Value |
+|-------|-------|
+| Type | `IntCounterVec` |
+| Labels | `agent_id`, `process_type`, `error_type` |
+| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` error paths |
+| Description | Process errors by type. `error_type` classifies the failure (timeout, rate_limit, auth, server, provider, unknown). |
+
+**Cardinality:** `agents × process_types × error_types`. Expect 15–75 series.
+
+#### `spacebot_memory_updates_total`
+
+| Field | Value |
+|-------|-------|
+| Type | `IntCounterVec` |
+| Labels | `agent_id`, `operation` |
+| Instrumented in | `src/memory/store.rs` (save/delete), `src/tools/memory_save.rs`, `src/tools/memory_delete.rs` (forget) |
+| Description | Memory mutation operations. `operation` is one of `save`, `delete`, or `forget`. |
+
+**Cardinality:** `agents × operations(3)`. Expect 3–15 series.
+
 ### Histograms
 
 #### `spacebot_llm_request_duration_seconds`
@@ -70,16 +114,12 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 |-------|-------|
 | Type | `HistogramVec` |
 | Labels | `agent_id`, `model`, `tier` |
-| Buckets | 0.1, 0.25, 0.5, 1, 2.5, 5, 10 |
+| Buckets | 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120 |
 | Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
 | Description | End-to-end LLM request duration in seconds. Includes retry loops and fallback chain traversal. |
 
 **Cardinality:** Same as `spacebot_llm_requests_total` (per-bucket overhead is fixed, not per-series).
 
-**Known limitation:** Buckets max out at 10s. LLM requests with retries and fallbacks routinely exceed 10s (15–60s is common). Everything above 10s collapses into the +Inf bucket, losing resolution. A future fix should extend buckets to cover `[..., 15, 30, 60, 120]`.
-
-**What the timer measures:** The timer wraps the entire `completion()` method body, including all retry attempts on the primary model and the full fallback chain. This measures user-perceived latency, not individual provider call latency.
-
 #### `spacebot_tool_call_duration_seconds`
 
 | Field | Value |
@@ -91,7 +131,19 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 
 **Cardinality:** 1 series.
 
-**Implementation note:** Duration is tracked via a `LazyLock<Mutex<HashMap<String, Instant>>>` static keyed by Rig's internal call ID. The timer starts in `on_tool_call` and is consumed in `on_tool_result`. If a tool call starts but the agent terminates before `on_tool_result` fires (e.g. leak detection terminates the agent), the timer entry remains in the map. These orphaned entries are small (String + Instant) and bounded by concurrent tool calls, so this is not a practical concern.
+**Implementation note:** Duration is tracked via a `LazyLock<Mutex<HashMap<String, Instant>>>` static keyed by Rig's internal call ID. If a tool call starts but the agent terminates before `on_tool_result` fires (e.g. leak detection), the timer entry remains — bounded by concurrent tool calls, not a practical concern.
+
+#### `spacebot_worker_duration_seconds`
+
+| Field | Value |
+|-------|-------|
+| Type | `HistogramVec` |
+| Labels | `agent_id`, `worker_type` |
+| Buckets | 1, 5, 10, 30, 60, 120, 300, 600, 1800 |
+| Instrumented in | `src/agent/channel.rs` — `spawn_worker_task()` |
+| Description | Worker lifetime duration in seconds from spawn to completion. |
+
+**Cardinality:** `agents × worker_types`. Currently `worker_type` is `"builtin"` — expect 1–5 series.
 
 ### Gauges
 
@@ -102,40 +154,55 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
 | Type | `IntGaugeVec` |
 | Labels | `agent_id` |
 | Instrumented in | `src/agent/channel.rs` — `spawn_worker_task()` |
-| Description | Currently active workers. Incremented when a worker task is spawned, decremented when it completes (success or failure). Covers both builtin Rig workers and OpenCode workers. |
+| Description | Currently active workers. Incremented when a worker task is spawned, decremented when it completes. |
 
-**Cardinality:** Number of agents (typically 1–5).
+**Cardinality:** Number of agents (1–5).
 
 #### `spacebot_memory_entry_count`
 
 | Field | Value |
 |-------|-------|
 | Type | `IntGaugeVec` |
 | Labels | `agent_id` |
-| Instrumented in | **Not instrumented.** Defined in registry but not wired to any call site. |
-| Description | Intended to track total memory entries per agent. |
+| Instrumented in | `src/memory/store.rs` — `save()` (inc) and `delete()` (dec) |
+| Description | Approximate memory entry count per agent. Tracks net saves minus deletes — starts at 0 on process start, not the actual database count. |
 
-**Cardinality:** Number of agents (typically 1–5). Currently always 0.
+**Cardinality:** Number of agents (1–5).
 
-**Status:** Requires periodic store queries or integration into `MemoryStore::save()` / `MemoryStore::delete()` to maintain an accurate count. Not blocked for merge — the metric is registered but idle.
+**Note:** This gauge tracks deltas from process start, not the absolute database count. On restart it resets to 0. For the true count, query the database directly.
 
-## Total Cardinality
+#### `spacebot_active_branches`
 
-With the current instrumentation (hardcoded `"unknown"` labels on LLM metrics):
+| Field | Value |
+|-------|-------|
+| Type | `IntGaugeVec` |
+| Labels | `agent_id` |
+| Instrumented in | `src/agent/channel.rs` — branch spawn (inc) and completion (dec) |
+| Description | Currently active branches per agent. |
+
+**Cardinality:** Number of agents (1–5).
+
+## Total Cardinality
 
 | Metric | Series estimate |
 |--------|-----------------|
-| `llm_requests_total` | ~10 (distinct models) |
-| `tool_calls_total` | ~20–100 (agents × tools) |
+| `llm_requests_total` | ~25–375 |
+| `llm_tokens_total` | ~75–1125 |
+| `llm_estimated_cost_dollars` | ~25–375 |
+| `tool_calls_total` | ~20–100 |
 | `memory_reads_total` | 1 |
 | `memory_writes_total` | 1 |
-| `llm_request_duration_seconds` | ~10 (distinct models) |
+| `llm_request_duration_seconds` | ~25–375 |
 | `tool_call_duration_seconds` | 1 |
-| `active_workers` | ~1–5 (agents) |
-| `memory_entry_count` | 0 (not instrumented) |
-| **Total** | **~45–130** |
+| `worker_duration_seconds` | ~1–5 |
+| `active_workers` | ~1–5 |
+| `active_branches` | ~1–5 |
+| `memory_entry_count` | ~1–5 |
+| `process_errors_total` | ~15–75 |
+| `memory_updates_total` | ~3–15 |
+| **Total** | **~195–2465** |
 
-This is well within safe operating range for any Prometheus deployment.
+Well within safe operating range for any Prometheus deployment.
 
 ## Feature Gate Consistency
 
@@ -147,9 +214,11 @@ Every instrumentation call site uses `#[cfg(feature = "metrics")]` at the statem
 | `src/main.rs` | `#[cfg(feature = "metrics")] let _metrics_handle = ...` |
 | `src/llm/model.rs` | `#[cfg(feature = "metrics")] let start` + `#[cfg(feature = "metrics")] { ... }` |
 | `src/hooks/spacebot.rs` | `#[cfg(feature = "metrics")] static TOOL_CALL_TIMERS` + 2 blocks |
-| `src/tools/memory_save.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
+| `src/tools/memory_save.rs` | `#[cfg(feature = "metrics")] { ... }` |
 | `src/tools/memory_recall.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
-| `src/agent/channel.rs` | `#[cfg(feature = "metrics")] ...` (×2, inc + dec) |
+| `src/tools/memory_delete.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
+| `src/memory/store.rs` | `#[cfg(feature = "metrics")] if _result...` + `#[cfg(feature = "metrics")] { ... }` |
+| `src/agent/channel.rs` | `#[cfg(feature = "metrics")]` (×4, branches + workers) |
 | `Cargo.toml` | `prometheus = { version = "0.13", optional = true }`, `metrics = ["dep:prometheus"]` |
 
 All consistent. No path references `crate::telemetry` without a `cfg` gate.

diff --git a/docs/content/docs/(deployment)/meta.json b/docs/content/docs/(deployment)/meta.json
@@ -1,4 +1,4 @@
 {
   "title": "Deployment",
-  "pages": ["hosted", "roadmap"]
+  "pages": ["hosted", "metrics", "roadmap"]
 }
diff --git a/docs/content/docs/(deployment)/metrics.mdx b/docs/content/docs/(deployment)/metrics.mdx
@@ -0,0 +1,155 @@
+---
+title: Metrics
+description: Prometheus-compatible metrics for monitoring Spacebot's LLM usage, costs, and agent activity.
+---
+
+# Metrics
+
+Spacebot exposes Prometheus-compatible metrics for monitoring LLM costs, token usage, agent activity, and memory operations. All telemetry code is behind the `metrics` cargo feature flag — without it, every instrumentation block compiles out to nothing.
+
+## Building with Metrics
+
+```bash
+cargo build --release --features metrics
+```
+
+## Configuration
+
+Add a `[metrics]` block to your `spacebot.toml`:
+
+```toml
+[metrics]
+enabled = true
+port = 9090
+bind = "0.0.0.0"
+```
+
+| Key       | Default       | Description                          |
+| --------- | ------------- | ------------------------------------ |
+| `enabled` | `false`       | Enable the /metrics HTTP endpoint    |
+| `port`    | `9090`        | Port for the metrics server          |
+| `bind`    | `"0.0.0.0"`  | Address to bind the metrics server   |
+
+The metrics server runs as a separate tokio task alongside the main API server and shuts down gracefully with the rest of the process.
+
+## Endpoints
+
+| Path       | Description                                |
+| ---------- | ------------------------------------------ |
+| `/metrics` | Prometheus text exposition format (0.0.4)  |
+| `/health`  | Returns 200 OK (for liveness probes)       |
+
+## Exposed Metrics
+
+All metrics are prefixed with `spacebot_`.
+
+### LLM Metrics
+
+These metrics track every LLM completion request, including token counts and estimated costs.
+
+| Metric | Type | Labels | Description |
+| ------ | ---- | ------ | ----------- |
+| `spacebot_llm_requests_total` | Counter | `agent_id`, `model`, `tier` | Total LLM completion requests |
+| `spacebot_llm_request_duration_seconds` | Histogram | `agent_id`, `model`, `tier` | End-to-end LLM request duration |
+| `spacebot_llm_tokens_total` | Counter | `agent_id`, `model`, `tier`, `direction` | Token counts (`direction`: input, output, cached_input) |
+| `spacebot_llm_estimated_cost_dollars` | Counter | `agent_id`, `model`, `tier` | Estimated cost in USD |
+
+The `tier` label corresponds to the process type: `channel`, `branch`, `worker`, `compactor`, or `cortex`.
+
+The `direction` label on token counts distinguishes input tokens, output (completion) tokens, and cached input tokens. Cached tokens are billed at a lower rate by most providers.
+
+**Cost estimation** uses a built-in pricing table covering Claude, GPT-4o, o-series, Gemini, and DeepSeek models. Unknown models use a conservative fallback rate. Costs are best-effort estimates — exact billing depends on your provider agreement.
+
+### Tool Metrics
+
+| Metric | Type | Labels | Description |
+| ------ | ---- | ------ | ----------- |
+| `spacebot_tool_calls_total` | Counter | `agent_id`, `tool_name` | Total tool calls executed |
+| `spacebot_tool_call_duration_seconds` | Histogram | — | Tool call execution duration |
+
+### Agent & Worker Metrics
+
+| Metric | Type | Labels | Description |
+| ------ | ---- | ------ | ----------- |
+| `spacebot_active_workers` | Gauge | `agent_id` | Currently active workers |
+| `spacebot_active_branches` | Gauge | `agent_id` | Currently active branches |
+| `spacebot_worker_duration_seconds` | Histogram | `agent_id`, `worker_type` | Worker lifetime duration |
+| `spacebot_process_errors_total` | Counter | `agent_id`, `process_type`, `error_type` | Process errors by type |
+
+### Memory Metrics
+
+| Metric | Type | Labels | Description |
+| ------ | ---- | ------ | ----------- |
+| `spacebot_memory_reads_total` | Counter | — | Total memory recall operations |
+| `spacebot_memory_writes_total` | Counter | — | Total memory save operations |
+| `spacebot_memory_entry_count` | Gauge | `agent_id` | Total memory entries per agent |
+| `spacebot_memory_updates_total` | Counter | `agent_id`, `operation` | Memory mutations (`operation`: save, update, delete, forget) |
+
+## Cost Tracking
+
+Token usage and estimated costs are tracked per-request. To see total estimated spend:
+
+```promql
+sum(spacebot_llm_estimated_cost_dollars) by (agent_id)
+```
+
+To see spend rate over the last hour:
+
+```promql
+sum(rate(spacebot_llm_estimated_cost_dollars[1h])) by (agent_id, model) * 3600
+```
+
+To see token throughput:
+
+```promql
+sum(rate(spacebot_llm_tokens_total[5m])) by (direction)
+```
+
+## Prometheus Scrape Config
+
+```yaml
+scrape_configs:
+  - job_name: spacebot
+    scrape_interval: 15s
+    static_configs:
+      - targets: ["localhost:9090"]
+```
+
+## Docker
+
+Expose the metrics port alongside the API port:
+
+```bash
+docker run -d \
+  --name spacebot \
+  -e ANTHROPIC_API_KEY="sk-ant-..." \
+  -v spacebot-data:/data \
+  -p 19898:19898 \
+  -p 9090:9090 \
+  ghcr.io/spacedriveapp/spacebot:slim
+```
+
+The Docker image must be built with `--features metrics` for this to work. The default images do not include metrics support.
+
+## Histogram Buckets
+
+| Metric | Buckets (seconds) |
+| ------ | ----------------- |
+| `llm_request_duration_seconds` | 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120 |
+| `tool_call_duration_seconds` | 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30 |
+| `worker_duration_seconds` | 1, 5, 10, 30, 60, 120, 300, 600, 1800 |
+
+## Cardinality
+
+| Metric | Estimated series |
+| ------ | --------------- |
+| `llm_requests_total` | agents × models × tiers (~25–375) |
+| `llm_tokens_total` | agents × models × tiers × 3 directions (~75–1125) |
+| `llm_estimated_cost_dollars` | agents × models × tiers (~25–375) |
+| `tool_calls_total` | agents × tools (~20–100) |
+| `active_workers` / `active_branches` | agents (~1–5 each) |
+| `process_errors_total` | agents × process_types × error_types (~15–75) |
+| `memory_*` | 1–10 per metric |
+| **Total** | **~160–2000** |
+
+Well within safe operating range for any Prometheus deployment.