feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context by l33t0 · Pull Request #102 · spacedriveapp/spacebot

l33t0 · 2026-02-21T12:12:37Z

Summary

Continues from #35 which laid the foundation (registry, metrics server, 8 initial metrics, feature gate). This PR resolves every known limitation called out in that PR and adds the operational visibility metrics needed for cost control and agent monitoring.

What's new:

Metrics server is now wired — start_metrics_server() is called in main.rs, so the /metrics endpoint actually starts
Per-agent LLM labels — agent_id and tier are no longer hardcoded to "unknown"; SpacebotModel carries process context via .with_context(), wired at all 7 call sites (channel, branch, worker, compactor, ingestion, cortex, cortex_chat)
Token usage tracking — spacebot_llm_tokens_total counter with direction label (input/output/cached_input)
Cost estimation — spacebot_llm_estimated_cost_dollars counter with a static pricing table (src/llm/pricing.rs) covering Claude 4/3.5/3, GPT-4o, o-series, Gemini, and DeepSeek families
Worker/branch visibility — spacebot_active_branches gauge, spacebot_worker_duration_seconds histogram
Error classification — spacebot_process_errors_total counter with error_type labels (rate_limit, timeout, context_overflow, provider_error, other)
Memory audit trail — spacebot_memory_updates_total counter tracking save/delete/forget operations; memory_entry_count gauge now wired in MemoryStore
LLM histogram fix — buckets extended from [0.1 … 10s] to [0.1 … 120s] to capture retry/fallback latency
Docs — new metrics page for docs.spacebot.sh, updated METRICS.md and docs/metrics.md with full 14-metric inventory, cardinality estimates, and PromQL examples

New metrics

Metric	Type	Labels
`spacebot_llm_tokens_total`	Counter	agent_id, model, tier, direction
`spacebot_llm_estimated_cost_dollars`	Counter	agent_id, model, tier
`spacebot_active_branches`	Gauge	agent_id
`spacebot_worker_duration_seconds`	Histogram	agent_id, worker_type
`spacebot_process_errors_total`	Counter	agent_id, process_type, error_type
`spacebot_memory_updates_total`	Counter	agent_id, operation

Area	Files
Metrics infra	`src/telemetry/registry.rs`, `src/main.rs`
LLM instrumentation	`src/llm/model.rs`, `src/llm/pricing.rs` (new), `src/llm.rs`
Agent context wiring	`src/agent/{channel,branch,worker,compactor,cortex,cortex_chat,ingestion}.rs`
Memory instrumentation	`src/memory/store.rs`, `src/tools/memory_save.rs`, `src/tools/memory_delete.rs`
Hook cleanup	`src/hooks/spacebot.rs`
Documentation	`METRICS.md`, `docs/metrics.md`, `docs/content/docs/(deployment)/metrics.mdx` (new), `meta.json`

Test plan

cargo build --features metrics — compiles (19 pre-existing warnings)
cargo build (without feature) — compiles (19 pre-existing warnings, no metric code included)
cargo test --lib --bins — 96 passed, 0 failed
All #[cfg(feature = "metrics")] gates consistent — no crate::telemetry reference without a gate
Manual: start with metrics.enabled = true, curl localhost:9090/metrics returns all 14 metrics
Manual: verify cost counter increments after LLM calls

…and per-agent context Wire the metrics server startup, fix LLM histogram buckets, and resolve all known limitations from spacedriveapp#35: agent_id/tier labels are no longer hardcoded to "unknown", memory_entry_count gauge is instrumented, and six new metrics cover token usage, estimated USD cost, branch/worker lifecycle, process errors, and memory audit trail. - Wire start_metrics_server() call in main.rs - Extend LLM duration buckets to [0.1 … 120s] - Add agent_id + process_type context to SpacebotModel, wired at all 7 call sites - Add spacebot_llm_tokens_total (input/output/cached_input) - Add spacebot_llm_estimated_cost_dollars with static pricing table (src/llm/pricing.rs) - Add spacebot_active_branches gauge - Add spacebot_worker_duration_seconds histogram - Add spacebot_process_errors_total counter with error classification - Add spacebot_memory_updates_total counter (save/delete/forget) - Wire memory_entry_count gauge in MemoryStore save/delete - Add metrics docs page for docs.spacebot.sh - Update METRICS.md and docs/metrics.md with full inventory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

l33t0 · 2026-02-21T12:16:08Z

Waiting for some time for my local setup to collect enough metrics to create a few Grafana dashboards to visualise the metrics generated here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102

feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102
l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
l33t0:feat/metrics-otel

l33t0 commented Feb 21, 2026 •

edited

Loading

Uh oh!

l33t0 commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

l33t0 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New metrics

Test plan

Uh oh!

l33t0 commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

l33t0 commented Feb 21, 2026 •

edited

Loading