fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup#1082
Conversation
These two events were firing once per call and dwarfed every other event in volume — at peak, ~87% of total events on the project. They carry no usable per-call signal beyond raw counts: mcp.request has no tool/server/method props on the noisy code path; tool_executed's duration/error fields are observability noise we route through Sentry and logs anyway. Per-call captures cost real money in PostHog when an upstream caller polls the server. Aggregate both inside MetricsService: - mcp.request and tool_executed now feed an in-process RollupBuffer (a simple counter struct keyed by tool name + source) instead of hitting PostHog per call. - Every 30 minutes, wall-clock aligned (`:00` and `:30` past the hour), drain the buffer and emit one `browseros.server.usage_rollup` event carrying mcp_requests_count, tool_executions_count, _failed, _by_source, _by_tool. Empty buckets skip the emit entirely. - Tool name cardinality is capped at 200 distinct keys per bucket; overflow rolls into __other__. PostHog property-key dots are sanitized to underscores so HogQL prop access stays clean. - All other emit sites (chat.request, http_server.started, cli.command_executed, etc.) keep firing per-call as before — their volume is normal. - Interval is hardcoded — one less knob to misconfigure across deployments. If staging or scale-up ever needs a tighter window, change the constant and ship a release. Side change: drop the `'anonymous'` distinct_id fallback. Instances booted without BROWSEROS_CLIENT_ID and BROWSEROS_INSTALL_ID were funneling every event into a single un-attributable bucket and inflating billing dramatically. Now: no identity ⇒ no event emitted. main.ts surfaces a clear warn at boot so a misconfigured deployment doesn't silently produce zero analytics. On the worst-case caller observed in production (~3 req/sec sustained polling), this drops their PostHog billing from ~10,800 events/hour to 2 events/hour during active intervals — ~5,400× reduction. For a typical user doing 100 tool calls in 30 minutes it's a ~100× reduction. Tests cover: aggregated events buffer instead of capturing; other events still capture immediately; flush drain produces exactly one usage_rollup with correct counts; empty bucket skips emit; shutdown drains before posthog.shutdown; no-identity short-circuit; tool-name cap + __other__ overflow; sanitize-dots in tool names.
✅ Tests passed — 1074/1079
|
Greptile SummaryThis PR replaces per-call PostHog captures for
Confidence Score: 4/5The buffering and flush logic is correct end-to-end; the main risk is that period_start_ts will never match the aligned window boundary, so PostHog dashboards computing event rates from that field will produce subtly wrong numbers. The core reduction from per-call to 30-min rollup works correctly and is well-tested. The Math.ceil boundary case can waste timer cycles at exact UTC boundaries. The period_start_ts issue is a consistent data inaccuracy affecting any rate metric built on this event. The test suite has a hidden ordering dependency that could silently break count assertions if tests are reordered. metrics.ts warrants a second look for the period_start_ts stamping in ensureCurrent() and the Math.ceil boundary in scheduleNextFlush(); the test file needs an explicit rollup-reset in beforeEach. Important Files Changed
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:196
`Math.ceil` returns the same value when `now` is exactly on a 30-minute UTC boundary (all `:00`/`:30` wall-clock marks are exact multiples of 1 800 000 ms since the Unix epoch), so `delay` computes to `0`. The timer fires immediately on the next event loop tick, the bucket is empty so nothing emits, but the recursive call to `scheduleNextFlush()` then re-enters the same arithmetic with the exact same `Date.now()` and may loop through several 0-delay iterations before accumulating enough drift to escape. Using `Math.floor + 1` always rounds up to the strictly-next boundary.
```suggestion
const nextTs = (Math.floor(now / ROLLUP_INTERVAL_MS) + 1) * ROLLUP_INTERVAL_MS
```
### Issue 2 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:113-127
**`period_start_ts` mismatches `interval_seconds`**
`ensureCurrent()` lazy-stamps `period_start_ts` at the moment the first event arrives in a window, not at the aligned window boundary. If the first tool call in a 30-minute window arrives at T+20 min, the emitted event will say `interval_seconds: 1800` while `period_end_ts - period_start_ts` is only ~10 min. Any consumer computing a rate as `count / interval_seconds` will understate the actual rate. Consider setting `period_start_ts` to the aligned boundary (`Math.floor(Date.now() / ROLLUP_INTERVAL_MS) * ROLLUP_INTERVAL_MS`) when the bucket is initialised.
### Issue 3 of 3
packages/browseros-agent/apps/server/tests/lib/metrics.test.ts:43-49
**`beforeEach` never resets the `RollupBuffer`**
The `beforeEach` calls `initialize()` twice to reset config identity but neither call drains the `RollupBuffer`. Tests in the log-dispatch suite record events without calling `shutdown()`, leaving accumulated counts in the singleton buffer. The count-sensitive "emits a single usage_rollup" test only passes because the preceding test drains the buffer via `shutdown()` with no identity. Reordering or adding new non-draining tests can silently break the exact-count assertions. Adding `await metrics.shutdown()` to `beforeEach` would make each test hermetic.
Reviews (1): Last reviewed commit: "fix(metrics): aggregate mcp.request + to..." | Re-trigger Greptile |
- scheduleNextFlush: use `floor(now/I) + 1` instead of `ceil(now/I)`. On the rare tick that lands exactly on a 30-minute UTC boundary, `ceil` returned `now` itself and the resulting `delay=0` re-entered the recursive schedule until the clock advanced by 1 ms. `floor+1` always rounds to the strictly-next boundary so the timer is always armed for the future. - RollupBuffer.ensureCurrent: anchor `period_start_ts` to the aligned window boundary instead of the moment the first event arrived. Previously a first event at T+20min would emit a rollup with `interval_seconds=1800` while `period_end_ts - period_start_ts` was only ~10min, so any dashboard computing `count / interval_seconds` understated the rate. Aligning the start timestamp makes the three time fields agree about the window. - tests/lib/metrics.test.ts beforeEach now calls `await metrics.shutdown()` first to drain the singleton RollupBuffer. Tests previously passed only because of test ordering; this makes every case hermetic. - tests/main.test.ts: add `instanceClientId: 'client-test'` to the shared config so the existing happy-path assertions keep working with the new "no identity ⇒ no event" behavior, and add a new test that explicitly verifies the boot-time warn fires when both identity fields are absent.
Summary
metrics.log('mcp.request', …)andmetrics.log('tool_executed', …)are now buffered in-process instead of hitting PostHog per call.browseros.server.usage_rollupevent flushes every 30 minutes (wall-clock aligned to:00and:30) with rolled-up counts and breakdowns. Empty buckets emit nothing.MetricsService.log.'anonymous'distinct_idfallback so unconfigured instances no longer pool unattributable events into one bucket and inflate billing.Why
A volume audit traced ~87% of PostHog events on the worst day to these two event names, with a 14× jump over baseline in a 48-hour window. Per-call captures (
mcp.requestandtool_executed) had no per-call signal we use —mcp.requestcarries notool/server/methodproperties on the noisy code path, andtool_executed'sduration_ms/error_messageare observability noise already routed through Sentry and logs. The right fix is to stop firing per call.On the worst-case caller observed in prod (~3 req/sec sustained polling), this drops PostHog billing from ~10,800 events/hour to 2 events/hour during active intervals — ~5,400× reduction. For a typical user doing 100 tool calls in 30 minutes it's ~100× reduction. Other events (
chat.request,cli.command_executed,http_server.started, etc.) keep their per-call shape because their volume was always fine.Shape of the rollup event
{ "event": "browseros.server.usage_rollup", "properties": { "interval_seconds": 1800, "period_start_ts": "2026-06-01T14:00:00.000Z", "period_end_ts": "2026-06-01T14:30:00.000Z", "mcp_requests_count": 124, "tool_executions_count": 87, "tool_executions_failed": 3, "tool_executions_by_source": { \"chat\": 60, \"mcp\": 27 }, "tool_executions_by_tool": { "navigate_page": 24, "fill": 18, "click": 15, "take_snapshot": 12, "scroll": 8, "press_key": 6, "get_dom": 4 } } }tool_executions_by_toolkeys are sanitized so PostHog property paths stay clean (dots → underscores, length capped at 80). The map itself is capped at 200 distinct keys per bucket; overflow rolls into__other__.Notes
ROLLUP_INTERVAL_MSconstant and ship a release.'anonymous'fallback is a separate concern from the aggregation: it ensures that other unaggregated emit sites (chat.request, etc.) on unconfigured instances stop firing too.main.tssurfaces a clearlogger.warnat boot when neitherBROWSEROS_CLIENT_IDnorBROWSEROS_INSTALL_IDis set so a misconfigured deployment doesn't silently produce zero analytics.Manual Verification
usage_rollupevents appear on the staging PostHog project every 30 minutes when there's activity, and thatmcp.request/tool_executedstop appearing. Spot-checktool_executions_by_toolandtool_executions_by_sourcefor sensible breakdowns.mcp.request/tool_executedevents.Test plan
The new
tests/lib/metrics.test.tscovers:mcp.requestandtool_executedgo into the buffer instead of capturing immediatelyusage_rollupevent with the right counts,by_toolmap,by_sourcemap, and instance metadata__other__RollupBufferresets afterdrain