feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102
Open
l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
Open
feat(telemetry): complete metrics instrumentation with cost tracking and per-agent context#102l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
l33t0 wants to merge 1 commit intospacedriveapp:mainfrom
Conversation
…and per-agent context Wire the metrics server startup, fix LLM histogram buckets, and resolve all known limitations from spacedriveapp#35: agent_id/tier labels are no longer hardcoded to "unknown", memory_entry_count gauge is instrumented, and six new metrics cover token usage, estimated USD cost, branch/worker lifecycle, process errors, and memory audit trail. - Wire start_metrics_server() call in main.rs - Extend LLM duration buckets to [0.1 … 120s] - Add agent_id + process_type context to SpacebotModel, wired at all 7 call sites - Add spacebot_llm_tokens_total (input/output/cached_input) - Add spacebot_llm_estimated_cost_dollars with static pricing table (src/llm/pricing.rs) - Add spacebot_active_branches gauge - Add spacebot_worker_duration_seconds histogram - Add spacebot_process_errors_total counter with error classification - Add spacebot_memory_updates_total counter (save/delete/forget) - Wire memory_entry_count gauge in MemoryStore save/delete - Add metrics docs page for docs.spacebot.sh - Update METRICS.md and docs/metrics.md with full inventory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
Waiting for some time for my local setup to collect enough metrics to create a few Grafana dashboards to visualise the metrics generated here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continues from #35 which laid the foundation (registry, metrics server, 8 initial metrics, feature gate). This PR resolves every known limitation called out in that PR and adds the operational visibility metrics needed for cost control and agent monitoring.
What's new:
start_metrics_server()is called inmain.rs, so the/metricsendpoint actually startsagent_idandtierare no longer hardcoded to"unknown";SpacebotModelcarries process context via.with_context(), wired at all 7 call sites (channel, branch, worker, compactor, ingestion, cortex, cortex_chat)spacebot_llm_tokens_totalcounter withdirectionlabel (input/output/cached_input)spacebot_llm_estimated_cost_dollarscounter with a static pricing table (src/llm/pricing.rs) covering Claude 4/3.5/3, GPT-4o, o-series, Gemini, and DeepSeek familiesspacebot_active_branchesgauge,spacebot_worker_duration_secondshistogramspacebot_process_errors_totalcounter witherror_typelabels (rate_limit, timeout, context_overflow, provider_error, other)spacebot_memory_updates_totalcounter tracking save/delete/forget operations;memory_entry_countgauge now wired inMemoryStore[0.1 … 10s]to[0.1 … 120s]to capture retry/fallback latencyMETRICS.mdanddocs/metrics.mdwith full 14-metric inventory, cardinality estimates, and PromQL examplesNew metrics
spacebot_llm_tokens_totalspacebot_llm_estimated_cost_dollarsspacebot_active_branchesspacebot_worker_duration_secondsspacebot_process_errors_totalspacebot_memory_updates_totalsrc/telemetry/registry.rs,src/main.rssrc/llm/model.rs,src/llm/pricing.rs(new),src/llm.rssrc/agent/{channel,branch,worker,compactor,cortex,cortex_chat,ingestion}.rssrc/memory/store.rs,src/tools/memory_save.rs,src/tools/memory_delete.rssrc/hooks/spacebot.rsMETRICS.md,docs/metrics.md,docs/content/docs/(deployment)/metrics.mdx(new),meta.jsonTest plan
cargo build --features metrics— compiles (19 pre-existing warnings)cargo build(without feature) — compiles (19 pre-existing warnings, no metric code included)cargo test --lib --bins— 96 passed, 0 failed#[cfg(feature = "metrics")]gates consistent — nocrate::telemetryreference without a gatemetrics.enabled = true, curllocalhost:9090/metricsreturns all 14 metrics