Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 97 additions & 28 deletions METRICS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Metrics Reference

Comprehensive reference for Spacebot's Prometheus metrics. For quick-start setup, see `docs/metrics.md`.
Comprehensive reference for Spacebot's Prometheus metrics. For quick-start setup, see `docs/metrics.md`. For the published docs, see the metrics page on [docs.spacebot.sh](https://docs.spacebot.sh).

## Feature Gate

Expand All @@ -27,9 +27,7 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
| Description | Total LLM completion requests (one per `completion()` call, including retries and fallbacks). |

**Cardinality:** `agents × models × tiers`. Currently `agent_id` and `tier` are hardcoded to `"unknown"` because `SpacebotModel` doesn't carry process context. Effective cardinality is just the number of distinct model names (typically 5–15). Once agent context is threaded through, expect `agents(1–5) × models(5–15) × tiers(5)` = 25–375 series.

**Known limitation:** Labels `agent_id` and `tier` are always `"unknown"`. The `SpacebotHook` has these values but can't be used here without structural changes.
**Cardinality:** `agents × models × tiers`. With agent context wired, expect `agents(1–5) × models(5–15) × tiers(5)` = 25–375 series.

#### `spacebot_tool_calls_total`

Expand All @@ -40,7 +38,7 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
| Instrumented in | `src/hooks/spacebot.rs` — `SpacebotHook::on_tool_result()` |
| Description | Total tool calls executed across all processes. Incremented after each tool call completes (success or failure). |

**Cardinality:** `agents × tools`. With 1–5 agents and ~20 tool names, expect 20–100 series. Tool names are a bounded set defined in `src/tools/`.
**Cardinality:** `agents × tools`. With 1–5 agents and ~20 tool names, expect 20–100 series.

#### `spacebot_memory_reads_total`

Expand All @@ -62,6 +60,52 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe

**Cardinality:** 1 series.

#### `spacebot_llm_tokens_total`

| Field | Value |
|-------|-------|
| Type | `IntCounterVec` |
| Labels | `agent_id`, `model`, `tier`, `direction` |
| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
| Description | Total LLM tokens consumed. `direction` is one of `input`, `output`, or `cached_input`. |

**Cardinality:** `agents × models × tiers × 3`. Expect 75–1125 series.

#### `spacebot_llm_estimated_cost_dollars`

| Field | Value |
|-------|-------|
| Type | `CounterVec` (f64) |
| Labels | `agent_id`, `model`, `tier` |
| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
| Description | Estimated LLM cost in USD. Uses a built-in pricing table (`src/llm/pricing.rs`). |

**Cardinality:** Same as `spacebot_llm_requests_total`.

**Note:** Costs are best-effort estimates. The pricing table covers major models (Claude 4/3.5/3, GPT-4o, o-series, Gemini, DeepSeek) with a conservative fallback for unknown models ($3/M input, $15/M output).

#### `spacebot_process_errors_total`

| Field | Value |
|-------|-------|
| Type | `IntCounterVec` |
| Labels | `agent_id`, `process_type`, `error_type` |
| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` error paths |
| Description | Process errors by type. `error_type` classifies the failure (timeout, rate_limit, auth, server, provider, unknown). |

**Cardinality:** `agents × process_types × error_types`. Expect 15–75 series.

#### `spacebot_memory_updates_total`

| Field | Value |
|-------|-------|
| Type | `IntCounterVec` |
| Labels | `agent_id`, `operation` |
| Instrumented in | `src/memory/store.rs` (save/delete), `src/tools/memory_save.rs`, `src/tools/memory_delete.rs` (forget) |
| Description | Memory mutation operations. `operation` is one of `save`, `delete`, or `forget`. |

**Cardinality:** `agents × operations(3)`. Expect 3–15 series.

### Histograms

#### `spacebot_llm_request_duration_seconds`
Expand All @@ -70,16 +114,12 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
|-------|-------|
| Type | `HistogramVec` |
| Labels | `agent_id`, `model`, `tier` |
| Buckets | 0.1, 0.25, 0.5, 1, 2.5, 5, 10 |
| Buckets | 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120 |
| Instrumented in | `src/llm/model.rs` — `SpacebotModel::completion()` |
| Description | End-to-end LLM request duration in seconds. Includes retry loops and fallback chain traversal. |

**Cardinality:** Same as `spacebot_llm_requests_total` (per-bucket overhead is fixed, not per-series).

**Known limitation:** Buckets max out at 10s. LLM requests with retries and fallbacks routinely exceed 10s (15–60s is common). Everything above 10s collapses into the +Inf bucket, losing resolution. A future fix should extend buckets to cover `[..., 15, 30, 60, 120]`.

**What the timer measures:** The timer wraps the entire `completion()` method body, including all retry attempts on the primary model and the full fallback chain. This measures user-perceived latency, not individual provider call latency.

#### `spacebot_tool_call_duration_seconds`

| Field | Value |
Expand All @@ -91,7 +131,19 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe

**Cardinality:** 1 series.

**Implementation note:** Duration is tracked via a `LazyLock<Mutex<HashMap<String, Instant>>>` static keyed by Rig's internal call ID. The timer starts in `on_tool_call` and is consumed in `on_tool_result`. If a tool call starts but the agent terminates before `on_tool_result` fires (e.g. leak detection terminates the agent), the timer entry remains in the map. These orphaned entries are small (String + Instant) and bounded by concurrent tool calls, so this is not a practical concern.
**Implementation note:** Duration is tracked via a `LazyLock<Mutex<HashMap<String, Instant>>>` static keyed by Rig's internal call ID. If a tool call starts but the agent terminates before `on_tool_result` fires (e.g. leak detection), the timer entry remains — bounded by concurrent tool calls, not a practical concern.

#### `spacebot_worker_duration_seconds`

| Field | Value |
|-------|-------|
| Type | `HistogramVec` |
| Labels | `agent_id`, `worker_type` |
| Buckets | 1, 5, 10, 30, 60, 120, 300, 600, 1800 |
| Instrumented in | `src/agent/channel.rs` — `spawn_worker_task()` |
| Description | Worker lifetime duration in seconds from spawn to completion. |

**Cardinality:** `agents × worker_types`. Currently `worker_type` is `"builtin"` — expect 1–5 series.

### Gauges

Expand All @@ -102,40 +154,55 @@ All metrics are prefixed with `spacebot_`. The registry uses a private `promethe
| Type | `IntGaugeVec` |
| Labels | `agent_id` |
| Instrumented in | `src/agent/channel.rs` — `spawn_worker_task()` |
| Description | Currently active workers. Incremented when a worker task is spawned, decremented when it completes (success or failure). Covers both builtin Rig workers and OpenCode workers. |
| Description | Currently active workers. Incremented when a worker task is spawned, decremented when it completes. |

**Cardinality:** Number of agents (typically 1–5).
**Cardinality:** Number of agents (1–5).

#### `spacebot_memory_entry_count`

| Field | Value |
|-------|-------|
| Type | `IntGaugeVec` |
| Labels | `agent_id` |
| Instrumented in | **Not instrumented.** Defined in registry but not wired to any call site. |
| Description | Intended to track total memory entries per agent. |
| Instrumented in | `src/memory/store.rs` — `save()` (inc) and `delete()` (dec) |
| Description | Approximate memory entry count per agent. Tracks net saves minus deletes — starts at 0 on process start, not the actual database count. |

**Cardinality:** Number of agents (typically 1–5). Currently always 0.
**Cardinality:** Number of agents (1–5).

**Status:** Requires periodic store queries or integration into `MemoryStore::save()` / `MemoryStore::delete()` to maintain an accurate count. Not blocked for merge — the metric is registered but idle.
**Note:** This gauge tracks deltas from process start, not the absolute database count. On restart it resets to 0. For the true count, query the database directly.

## Total Cardinality
#### `spacebot_active_branches`

With the current instrumentation (hardcoded `"unknown"` labels on LLM metrics):
| Field | Value |
|-------|-------|
| Type | `IntGaugeVec` |
| Labels | `agent_id` |
| Instrumented in | `src/agent/channel.rs` — branch spawn (inc) and completion (dec) |
| Description | Currently active branches per agent. |

**Cardinality:** Number of agents (1–5).

## Total Cardinality

| Metric | Series estimate |
|--------|-----------------|
| `llm_requests_total` | ~10 (distinct models) |
| `tool_calls_total` | ~20–100 (agents × tools) |
| `llm_requests_total` | ~25–375 |
| `llm_tokens_total` | ~75–1125 |
| `llm_estimated_cost_dollars` | ~25–375 |
| `tool_calls_total` | ~20–100 |
| `memory_reads_total` | 1 |
| `memory_writes_total` | 1 |
| `llm_request_duration_seconds` | ~10 (distinct models) |
| `llm_request_duration_seconds` | ~25–375 |
| `tool_call_duration_seconds` | 1 |
| `active_workers` | ~1–5 (agents) |
| `memory_entry_count` | 0 (not instrumented) |
| **Total** | **~45–130** |
| `worker_duration_seconds` | ~1–5 |
| `active_workers` | ~1–5 |
| `active_branches` | ~1–5 |
| `memory_entry_count` | ~1–5 |
| `process_errors_total` | ~15–75 |
| `memory_updates_total` | ~3–15 |
| **Total** | **~195–2465** |

This is well within safe operating range for any Prometheus deployment.
Well within safe operating range for any Prometheus deployment.

## Feature Gate Consistency

Expand All @@ -147,9 +214,11 @@ Every instrumentation call site uses `#[cfg(feature = "metrics")]` at the statem
| `src/main.rs` | `#[cfg(feature = "metrics")] let _metrics_handle = ...` |
| `src/llm/model.rs` | `#[cfg(feature = "metrics")] let start` + `#[cfg(feature = "metrics")] { ... }` |
| `src/hooks/spacebot.rs` | `#[cfg(feature = "metrics")] static TOOL_CALL_TIMERS` + 2 blocks |
| `src/tools/memory_save.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
| `src/tools/memory_save.rs` | `#[cfg(feature = "metrics")] { ... }` |
| `src/tools/memory_recall.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
| `src/agent/channel.rs` | `#[cfg(feature = "metrics")] ...` (×2, inc + dec) |
| `src/tools/memory_delete.rs` | `#[cfg(feature = "metrics")] crate::telemetry::Metrics::global()...` |
| `src/memory/store.rs` | `#[cfg(feature = "metrics")] if _result...` + `#[cfg(feature = "metrics")] { ... }` |
| `src/agent/channel.rs` | `#[cfg(feature = "metrics")]` (×4, branches + workers) |
| `Cargo.toml` | `prometheus = { version = "0.13", optional = true }`, `metrics = ["dep:prometheus"]` |

All consistent. No path references `crate::telemetry` without a `cfg` gate.
Expand Down
2 changes: 1 addition & 1 deletion docs/content/docs/(deployment)/meta.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"title": "Deployment",
"pages": ["hosted", "roadmap"]
"pages": ["hosted", "metrics", "roadmap"]
}
155 changes: 155 additions & 0 deletions docs/content/docs/(deployment)/metrics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
title: Metrics
description: Prometheus-compatible metrics for monitoring Spacebot's LLM usage, costs, and agent activity.
---

# Metrics

Spacebot exposes Prometheus-compatible metrics for monitoring LLM costs, token usage, agent activity, and memory operations. All telemetry code is behind the `metrics` cargo feature flag — without it, every instrumentation block compiles out to nothing.

## Building with Metrics

```bash
cargo build --release --features metrics
```

## Configuration

Add a `[metrics]` block to your `spacebot.toml`:

```toml
[metrics]
enabled = true
port = 9090
bind = "0.0.0.0"
```

| Key | Default | Description |
| --------- | ------------- | ------------------------------------ |
| `enabled` | `false` | Enable the /metrics HTTP endpoint |
| `port` | `9090` | Port for the metrics server |
| `bind` | `"0.0.0.0"` | Address to bind the metrics server |

The metrics server runs as a separate tokio task alongside the main API server and shuts down gracefully with the rest of the process.

## Endpoints

| Path | Description |
| ---------- | ------------------------------------------ |
| `/metrics` | Prometheus text exposition format (0.0.4) |
| `/health` | Returns 200 OK (for liveness probes) |

## Exposed Metrics

All metrics are prefixed with `spacebot_`.

### LLM Metrics

These metrics track every LLM completion request, including token counts and estimated costs.

| Metric | Type | Labels | Description |
| ------ | ---- | ------ | ----------- |
| `spacebot_llm_requests_total` | Counter | `agent_id`, `model`, `tier` | Total LLM completion requests |
| `spacebot_llm_request_duration_seconds` | Histogram | `agent_id`, `model`, `tier` | End-to-end LLM request duration |
| `spacebot_llm_tokens_total` | Counter | `agent_id`, `model`, `tier`, `direction` | Token counts (`direction`: input, output, cached_input) |
| `spacebot_llm_estimated_cost_dollars` | Counter | `agent_id`, `model`, `tier` | Estimated cost in USD |

The `tier` label corresponds to the process type: `channel`, `branch`, `worker`, `compactor`, or `cortex`.

The `direction` label on token counts distinguishes input tokens, output (completion) tokens, and cached input tokens. Cached tokens are billed at a lower rate by most providers.

**Cost estimation** uses a built-in pricing table covering Claude, GPT-4o, o-series, Gemini, and DeepSeek models. Unknown models use a conservative fallback rate. Costs are best-effort estimates — exact billing depends on your provider agreement.

### Tool Metrics

| Metric | Type | Labels | Description |
| ------ | ---- | ------ | ----------- |
| `spacebot_tool_calls_total` | Counter | `agent_id`, `tool_name` | Total tool calls executed |
| `spacebot_tool_call_duration_seconds` | Histogram | — | Tool call execution duration |

### Agent & Worker Metrics

| Metric | Type | Labels | Description |
| ------ | ---- | ------ | ----------- |
| `spacebot_active_workers` | Gauge | `agent_id` | Currently active workers |
| `spacebot_active_branches` | Gauge | `agent_id` | Currently active branches |
| `spacebot_worker_duration_seconds` | Histogram | `agent_id`, `worker_type` | Worker lifetime duration |
| `spacebot_process_errors_total` | Counter | `agent_id`, `process_type`, `error_type` | Process errors by type |

### Memory Metrics

| Metric | Type | Labels | Description |
| ------ | ---- | ------ | ----------- |
| `spacebot_memory_reads_total` | Counter | — | Total memory recall operations |
| `spacebot_memory_writes_total` | Counter | — | Total memory save operations |
| `spacebot_memory_entry_count` | Gauge | `agent_id` | Total memory entries per agent |
| `spacebot_memory_updates_total` | Counter | `agent_id`, `operation` | Memory mutations (`operation`: save, update, delete, forget) |

## Cost Tracking

Token usage and estimated costs are tracked per-request. To see total estimated spend:

```promql
sum(spacebot_llm_estimated_cost_dollars) by (agent_id)
```

To see spend rate over the last hour:

```promql
sum(rate(spacebot_llm_estimated_cost_dollars[1h])) by (agent_id, model) * 3600
```

To see token throughput:

```promql
sum(rate(spacebot_llm_tokens_total[5m])) by (direction)
```

## Prometheus Scrape Config

```yaml
scrape_configs:
- job_name: spacebot
scrape_interval: 15s
static_configs:
- targets: ["localhost:9090"]
```

## Docker

Expose the metrics port alongside the API port:

```bash
docker run -d \
--name spacebot \
-e ANTHROPIC_API_KEY="sk-ant-..." \
-v spacebot-data:/data \
-p 19898:19898 \
-p 9090:9090 \
ghcr.io/spacedriveapp/spacebot:slim
```

The Docker image must be built with `--features metrics` for this to work. The default images do not include metrics support.

## Histogram Buckets

| Metric | Buckets (seconds) |
| ------ | ----------------- |
| `llm_request_duration_seconds` | 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120 |
| `tool_call_duration_seconds` | 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30 |
| `worker_duration_seconds` | 1, 5, 10, 30, 60, 120, 300, 600, 1800 |

## Cardinality

| Metric | Estimated series |
| ------ | --------------- |
| `llm_requests_total` | agents × models × tiers (~25–375) |
| `llm_tokens_total` | agents × models × tiers × 3 directions (~75–1125) |
| `llm_estimated_cost_dollars` | agents × models × tiers (~25–375) |
| `tool_calls_total` | agents × tools (~20–100) |
| `active_workers` / `active_branches` | agents (~1–5 each) |
| `process_errors_total` | agents × process_types × error_types (~15–75) |
| `memory_*` | 1–10 per metric |
| **Total** | **~160–2000** |

Well within safe operating range for any Prometheus deployment.
Loading