fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup by DaniAkash · Pull Request #1082 · browseros-ai/BrowserOS

Dani Akash (DaniAkash) · 2026-06-01T15:04:58Z

Summary

metrics.log('mcp.request', …) and metrics.log('tool_executed', …) are now buffered in-process instead of hitting PostHog per call.
A new browseros.server.usage_rollup event flushes every 30 minutes (wall-clock aligned to :00 and :30) with rolled-up counts and breakdowns. Empty buckets emit nothing.
The 11 existing call sites for these two events don't change — aggregation happens inside MetricsService.log.
Drops the 'anonymous' distinct_id fallback so unconfigured instances no longer pool unattributable events into one bucket and inflate billing.

Why

A volume audit traced ~87% of PostHog events on the worst day to these two event names, with a 14× jump over baseline in a 48-hour window. Per-call captures (mcp.request and tool_executed) had no per-call signal we use — mcp.request carries no tool / server / method properties on the noisy code path, and tool_executed's duration_ms / error_message are observability noise already routed through Sentry and logs. The right fix is to stop firing per call.

On the worst-case caller observed in prod (~3 req/sec sustained polling), this drops PostHog billing from ~10,800 events/hour to 2 events/hour during active intervals — ~5,400× reduction. For a typical user doing 100 tool calls in 30 minutes it's ~100× reduction. Other events (chat.request, cli.command_executed, http_server.started, etc.) keep their per-call shape because their volume was always fine.

Shape of the rollup event

{
  "event": "browseros.server.usage_rollup",
  "properties": {
    "interval_seconds": 1800,
    "period_start_ts": "2026-06-01T14:00:00.000Z",
    "period_end_ts":   "2026-06-01T14:30:00.000Z",
    "mcp_requests_count": 124,
    "tool_executions_count": 87,
    "tool_executions_failed": 3,
    "tool_executions_by_source": { \"chat\": 60, \"mcp\": 27 },
    "tool_executions_by_tool": {
      "navigate_page": 24, "fill": 18, "click": 15, "take_snapshot": 12,
      "scroll": 8, "press_key": 6, "get_dom": 4
    }
  }
}

tool_executions_by_tool keys are sanitized so PostHog property paths stay clean (dots → underscores, length capped at 80). The map itself is capped at 200 distinct keys per bucket; overflow rolls into __other__.

Notes

Interval is hardcoded at 30 minutes — one less knob to misconfigure across deployments. If staging or a future scale-up needs tighter feedback, change the ROLLUP_INTERVAL_MS constant and ship a release.
The dropped 'anonymous' fallback is a separate concern from the aggregation: it ensures that other unaggregated emit sites (chat.request, etc.) on unconfigured instances stop firing too. main.ts surfaces a clear logger.warn at boot when neither BROWSEROS_CLIENT_ID nor BROWSEROS_INSTALL_ID is set so a misconfigured deployment doesn't silently produce zero analytics.
Tool name cardinality and length caps are defensive — current tool names are statically registered and well-behaved; the caps protect against future dynamic / user-supplied names.

Manual Verification

`bun test apps/server/tests/lib/metrics.test.ts` — 14 pass, 0 fail.
`bun run --filter '@browseros/server' typecheck` — clean.
`bunx biome check apps/server/src/lib/metrics.ts apps/server/src/main.ts apps/server/tests/lib/metrics.test.ts` — clean.
Staging rollout: verify usage_rollup events appear on the staging PostHog project every 30 minutes when there's activity, and that mcp.request / tool_executed stop appearing. Spot-check tool_executions_by_tool and tool_executions_by_source for sensible breakdowns.
Prod rollout sequencing: once a majority of active instances are on the new code, disable then delete the temporary PostHog ingestion-side block that's filtering anonymous mcp.request / tool_executed events.

Test plan

The new tests/lib/metrics.test.ts covers:

mcp.request and tool_executed go into the buffer instead of capturing immediately
Other events still capture immediately
Shutdown drain emits exactly one usage_rollup event with the right counts, by_tool map, by_source map, and instance metadata
Empty buffer emits nothing
No-identity short-circuits the underlying capture path
Tool-name cardinality cap rolls overflow into __other__
Dotted tool names are sanitized
RollupBuffer resets after drain

These two events were firing once per call and dwarfed every other event in volume — at peak, ~87% of total events on the project. They carry no usable per-call signal beyond raw counts: mcp.request has no tool/server/method props on the noisy code path; tool_executed's duration/error fields are observability noise we route through Sentry and logs anyway. Per-call captures cost real money in PostHog when an upstream caller polls the server. Aggregate both inside MetricsService: - mcp.request and tool_executed now feed an in-process RollupBuffer (a simple counter struct keyed by tool name + source) instead of hitting PostHog per call. - Every 30 minutes, wall-clock aligned (`:00` and `:30` past the hour), drain the buffer and emit one `browseros.server.usage_rollup` event carrying mcp_requests_count, tool_executions_count, _failed, _by_source, _by_tool. Empty buckets skip the emit entirely. - Tool name cardinality is capped at 200 distinct keys per bucket; overflow rolls into __other__. PostHog property-key dots are sanitized to underscores so HogQL prop access stays clean. - All other emit sites (chat.request, http_server.started, cli.command_executed, etc.) keep firing per-call as before — their volume is normal. - Interval is hardcoded — one less knob to misconfigure across deployments. If staging or scale-up ever needs a tighter window, change the constant and ship a release. Side change: drop the `'anonymous'` distinct_id fallback. Instances booted without BROWSEROS_CLIENT_ID and BROWSEROS_INSTALL_ID were funneling every event into a single un-attributable bucket and inflating billing dramatically. Now: no identity ⇒ no event emitted. main.ts surfaces a clear warn at boot so a misconfigured deployment doesn't silently produce zero analytics. On the worst-case caller observed in production (~3 req/sec sustained polling), this drops their PostHog billing from ~10,800 events/hour to 2 events/hour during active intervals — ~5,400× reduction. For a typical user doing 100 tool calls in 30 minutes it's a ~100× reduction. Tests cover: aggregated events buffer instead of capturing; other events still capture immediately; flush drain produces exactly one usage_rollup with correct counts; empty bucket skips emit; shutdown drains before posthog.shutdown; no-identity short-circuit; tool-name cap + __other__ overflow; sanitize-dots in tool names.

github-actions · 2026-06-01T15:07:21Z

✅ Tests passed — 1074/1079

Suite	Passed	Skipped
✅ `agent`	91/91	0
✅ `build`	10/10	0
✅ `eval`	95/95	0
✅ `server-agent`	246/246	0
✅ `server-api`	64/64	0
✅ `server-browser`	4/4	0
✅ `server-integration`	9/10	1
✅ `server-lib`	266/267	1
✅ `server-root`	60/63	3
✅ `server-tools`	229/229	0

View workflow run

greptile-apps · 2026-06-01T15:12:22Z

Greptile Summary

This PR replaces per-call PostHog captures for mcp.request and tool_executed with an in-process RollupBuffer that flushes a browseros.server.usage_rollup event every 30 minutes (wall-clock aligned), and drops the 'anonymous' distinct-ID fallback so unconfigured instances no longer emit unattributable events.

RollupBuffer accumulates mcp_requests, tool_executions totals, by_source, and by_tool breakdowns with cardinality and length caps; MetricsService.flushRollup() drains it via a wall-clock-aligned setTimeout chain and also on shutdown().
main.ts gains a boot-time logger.warn when metrics is enabled but no instance identity is configured, surfacing the silent no-op the removed 'anonymous' fallback used to mask.
Two data-accuracy issues in metrics.ts: Math.ceil on an exact 30-minute UTC boundary yields delay = 0, and period_start_ts is stamped at first-event time rather than the aligned window boundary, mismatching the static interval_seconds: 1800 field.

Confidence Score: 4/5

The buffering and flush logic is correct end-to-end; the main risk is that period_start_ts will never match the aligned window boundary, so PostHog dashboards computing event rates from that field will produce subtly wrong numbers.

The core reduction from per-call to 30-min rollup works correctly and is well-tested. The Math.ceil boundary case can waste timer cycles at exact UTC boundaries. The period_start_ts issue is a consistent data inaccuracy affecting any rate metric built on this event. The test suite has a hidden ordering dependency that could silently break count assertions if tests are reordered.

metrics.ts warrants a second look for the period_start_ts stamping in ensureCurrent() and the Math.ceil boundary in scheduleNextFlush(); the test file needs an explicit rollup-reset in beforeEach.

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/server/src/lib/metrics.ts	Adds RollupBuffer and wires mcp.request/tool_executed into 30-min aggregation; two issues: Math.ceil exact-boundary 0ms delay and period_start_ts reflecting first-event time rather than aligned window start.
packages/browseros-agent/apps/server/src/main.ts	Adds boot-time logger.warn when no instance identity is configured; minimal and low-risk change.
packages/browseros-agent/apps/server/tests/lib/metrics.test.ts	New test suite covering rollup buffering and flush; beforeEach doesn't reset the singleton RollupBuffer, creating a hidden test-ordering dependency.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:196
`Math.ceil` returns the same value when `now` is exactly on a 30-minute UTC boundary (all `:00`/`:30` wall-clock marks are exact multiples of 1 800 000 ms since the Unix epoch), so `delay` computes to `0`. The timer fires immediately on the next event loop tick, the bucket is empty so nothing emits, but the recursive call to `scheduleNextFlush()` then re-enters the same arithmetic with the exact same `Date.now()` and may loop through several 0-delay iterations before accumulating enough drift to escape. Using `Math.floor + 1` always rounds up to the strictly-next boundary.

```suggestion
    const nextTs = (Math.floor(now / ROLLUP_INTERVAL_MS) + 1) * ROLLUP_INTERVAL_MS
```

### Issue 2 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:113-127
**`period_start_ts` mismatches `interval_seconds`**

`ensureCurrent()` lazy-stamps `period_start_ts` at the moment the first event arrives in a window, not at the aligned window boundary. If the first tool call in a 30-minute window arrives at T+20 min, the emitted event will say `interval_seconds: 1800` while `period_end_ts - period_start_ts` is only ~10 min. Any consumer computing a rate as `count / interval_seconds` will understate the actual rate. Consider setting `period_start_ts` to the aligned boundary (`Math.floor(Date.now() / ROLLUP_INTERVAL_MS) * ROLLUP_INTERVAL_MS`) when the bucket is initialised.

### Issue 3 of 3
packages/browseros-agent/apps/server/tests/lib/metrics.test.ts:43-49
**`beforeEach` never resets the `RollupBuffer`**

The `beforeEach` calls `initialize()` twice to reset config identity but neither call drains the `RollupBuffer`. Tests in the log-dispatch suite record events without calling `shutdown()`, leaving accumulated counts in the singleton buffer. The count-sensitive "emits a single usage_rollup" test only passes because the preceding test drains the buffer via `shutdown()` with no identity. Reordering or adding new non-draining tests can silently break the exact-count assertions. Adding `await metrics.shutdown()` to `beforeEach` would make each test hermetic.

_{Reviews (1): Last reviewed commit: "fix(metrics): aggregate mcp.request + to..." | Re-trigger Greptile}

- scheduleNextFlush: use `floor(now/I) + 1` instead of `ceil(now/I)`. On the rare tick that lands exactly on a 30-minute UTC boundary, `ceil` returned `now` itself and the resulting `delay=0` re-entered the recursive schedule until the clock advanced by 1 ms. `floor+1` always rounds to the strictly-next boundary so the timer is always armed for the future. - RollupBuffer.ensureCurrent: anchor `period_start_ts` to the aligned window boundary instead of the moment the first event arrived. Previously a first event at T+20min would emit a rollup with `interval_seconds=1800` while `period_end_ts - period_start_ts` was only ~10min, so any dashboard computing `count / interval_seconds` understated the rate. Aligning the start timestamp makes the three time fields agree about the window. - tests/lib/metrics.test.ts beforeEach now calls `await metrics.shutdown()` first to drain the singleton RollupBuffer. Tests previously passed only because of test ordering; this makes every case hermetic. - tests/main.test.ts: add `instanceClientId: 'client-test'` to the shared config so the existing happy-path assertions keep working with the new "no identity ⇒ no event" behavior, and add a new test that explicitly verifies the boot-time warn fires when both identity fields are absent.

github-actions Bot added the fix label Jun 1, 2026

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread packages/browseros-agent/apps/server/src/lib/metrics.ts Outdated

Comment thread packages/browseros-agent/apps/server/src/lib/metrics.ts

Comment thread packages/browseros-agent/apps/server/tests/lib/metrics.test.ts

Dani Akash (DaniAkash) merged commit 2fc7d6d into main Jun 2, 2026
22 checks passed

Dani Akash (DaniAkash) deleted the fix/metrics-rollup-buffer branch June 2, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup#1082

fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup#1082
Dani Akash (DaniAkash) merged 2 commits into
mainfrom
fix/metrics-rollup-buffer

Dani Akash (DaniAkash) commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dani Akash (DaniAkash) commented Jun 1, 2026

Summary

Why

Shape of the rollup event

Notes

Manual Verification

Test plan

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Tests passed — 1074/1079

Uh oh!

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 1, 2026 •

edited

Loading