Skip to content

Comments

feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102

Open
l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
l33t0:feat/metrics-otel
Open

feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102
l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
l33t0:feat/metrics-otel

Conversation

@l33t0
Copy link
Contributor

@l33t0 l33t0 commented Feb 21, 2026

Summary

Continues from #35 which laid the foundation (registry, metrics server, 8 initial metrics, feature gate). This PR resolves every known limitation called out in that PR and adds the operational visibility metrics needed for cost control and agent monitoring.

What's new:

  • Metrics server is now wiredstart_metrics_server() is called in main.rs, so the /metrics endpoint actually starts
  • Per-agent LLM labelsagent_id and tier are no longer hardcoded to "unknown"; SpacebotModel carries process context via .with_context(), wired at all 7 call sites (channel, branch, worker, compactor, ingestion, cortex, cortex_chat)
  • Token usage trackingspacebot_llm_tokens_total counter with direction label (input/output/cached_input)
  • Cost estimationspacebot_llm_estimated_cost_dollars counter with a static pricing table (src/llm/pricing.rs) covering Claude 4/3.5/3, GPT-4o, o-series, Gemini, and DeepSeek families
  • Worker/branch visibilityspacebot_active_branches gauge, spacebot_worker_duration_seconds histogram
  • Error classificationspacebot_process_errors_total counter with error_type labels (rate_limit, timeout, context_overflow, provider_error, other)
  • Memory audit trailspacebot_memory_updates_total counter tracking save/delete/forget operations; memory_entry_count gauge now wired in MemoryStore
  • LLM histogram fix — buckets extended from [0.1 … 10s] to [0.1 … 120s] to capture retry/fallback latency
  • Docs — new metrics page for docs.spacebot.sh, updated METRICS.md and docs/metrics.md with full 14-metric inventory, cardinality estimates, and PromQL examples

New metrics

Metric Type Labels
spacebot_llm_tokens_total Counter agent_id, model, tier, direction
spacebot_llm_estimated_cost_dollars Counter agent_id, model, tier
spacebot_active_branches Gauge agent_id
spacebot_worker_duration_seconds Histogram agent_id, worker_type
spacebot_process_errors_total Counter agent_id, process_type, error_type
spacebot_memory_updates_total Counter agent_id, operation
Area Files
Metrics infra src/telemetry/registry.rs, src/main.rs
LLM instrumentation src/llm/model.rs, src/llm/pricing.rs (new), src/llm.rs
Agent context wiring src/agent/{channel,branch,worker,compactor,cortex,cortex_chat,ingestion}.rs
Memory instrumentation src/memory/store.rs, src/tools/memory_save.rs, src/tools/memory_delete.rs
Hook cleanup src/hooks/spacebot.rs
Documentation METRICS.md, docs/metrics.md, docs/content/docs/(deployment)/metrics.mdx (new), meta.json

Test plan

  • cargo build --features metrics — compiles (19 pre-existing warnings)
  • cargo build (without feature) — compiles (19 pre-existing warnings, no metric code included)
  • cargo test --lib --bins — 96 passed, 0 failed
  • All #[cfg(feature = "metrics")] gates consistent — no crate::telemetry reference without a gate
  • Manual: start with metrics.enabled = true, curl localhost:9090/metrics returns all 14 metrics
  • Manual: verify cost counter increments after LLM calls

…and per-agent context

Wire the metrics server startup, fix LLM histogram buckets, and resolve all
known limitations from spacedriveapp#35: agent_id/tier labels are no longer hardcoded to
"unknown", memory_entry_count gauge is instrumented, and six new metrics
cover token usage, estimated USD cost, branch/worker lifecycle, process
errors, and memory audit trail.

- Wire start_metrics_server() call in main.rs
- Extend LLM duration buckets to [0.1 … 120s]
- Add agent_id + process_type context to SpacebotModel, wired at all 7 call sites
- Add spacebot_llm_tokens_total (input/output/cached_input)
- Add spacebot_llm_estimated_cost_dollars with static pricing table (src/llm/pricing.rs)
- Add spacebot_active_branches gauge
- Add spacebot_worker_duration_seconds histogram
- Add spacebot_process_errors_total counter with error classification
- Add spacebot_memory_updates_total counter (save/delete/forget)
- Wire memory_entry_count gauge in MemoryStore save/delete
- Add metrics docs page for docs.spacebot.sh
- Update METRICS.md and docs/metrics.md with full inventory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@l33t0
Copy link
Contributor Author

l33t0 commented Feb 21, 2026

Waiting for some time for my local setup to collect enough metrics to create a few Grafana dashboards to visualise the metrics generated here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant