Skip to content

fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup#1082

Merged
Dani Akash (DaniAkash) merged 2 commits into
mainfrom
fix/metrics-rollup-buffer
Jun 2, 2026
Merged

fix(metrics): aggregate mcp.request + tool_executed into 30-min usage rollup#1082
Dani Akash (DaniAkash) merged 2 commits into
mainfrom
fix/metrics-rollup-buffer

Conversation

@DaniAkash
Copy link
Copy Markdown
Contributor

Summary

  • metrics.log('mcp.request', …) and metrics.log('tool_executed', …) are now buffered in-process instead of hitting PostHog per call.
  • A new browseros.server.usage_rollup event flushes every 30 minutes (wall-clock aligned to :00 and :30) with rolled-up counts and breakdowns. Empty buckets emit nothing.
  • The 11 existing call sites for these two events don't change — aggregation happens inside MetricsService.log.
  • Drops the 'anonymous' distinct_id fallback so unconfigured instances no longer pool unattributable events into one bucket and inflate billing.

Why

A volume audit traced ~87% of PostHog events on the worst day to these two event names, with a 14× jump over baseline in a 48-hour window. Per-call captures (mcp.request and tool_executed) had no per-call signal we use — mcp.request carries no tool / server / method properties on the noisy code path, and tool_executed's duration_ms / error_message are observability noise already routed through Sentry and logs. The right fix is to stop firing per call.

On the worst-case caller observed in prod (~3 req/sec sustained polling), this drops PostHog billing from ~10,800 events/hour to 2 events/hour during active intervals — ~5,400× reduction. For a typical user doing 100 tool calls in 30 minutes it's ~100× reduction. Other events (chat.request, cli.command_executed, http_server.started, etc.) keep their per-call shape because their volume was always fine.

Shape of the rollup event

{
  "event": "browseros.server.usage_rollup",
  "properties": {
    "interval_seconds": 1800,
    "period_start_ts": "2026-06-01T14:00:00.000Z",
    "period_end_ts":   "2026-06-01T14:30:00.000Z",
    "mcp_requests_count": 124,
    "tool_executions_count": 87,
    "tool_executions_failed": 3,
    "tool_executions_by_source": { \"chat\": 60, \"mcp\": 27 },
    "tool_executions_by_tool": {
      "navigate_page": 24, "fill": 18, "click": 15, "take_snapshot": 12,
      "scroll": 8, "press_key": 6, "get_dom": 4
    }
  }
}

tool_executions_by_tool keys are sanitized so PostHog property paths stay clean (dots → underscores, length capped at 80). The map itself is capped at 200 distinct keys per bucket; overflow rolls into __other__.

Notes

  • Interval is hardcoded at 30 minutes — one less knob to misconfigure across deployments. If staging or a future scale-up needs tighter feedback, change the ROLLUP_INTERVAL_MS constant and ship a release.
  • The dropped 'anonymous' fallback is a separate concern from the aggregation: it ensures that other unaggregated emit sites (chat.request, etc.) on unconfigured instances stop firing too. main.ts surfaces a clear logger.warn at boot when neither BROWSEROS_CLIENT_ID nor BROWSEROS_INSTALL_ID is set so a misconfigured deployment doesn't silently produce zero analytics.
  • Tool name cardinality and length caps are defensive — current tool names are statically registered and well-behaved; the caps protect against future dynamic / user-supplied names.

Manual Verification

  • `bun test apps/server/tests/lib/metrics.test.ts` — 14 pass, 0 fail.
  • `bun run --filter '@browseros/server' typecheck` — clean.
  • `bunx biome check apps/server/src/lib/metrics.ts apps/server/src/main.ts apps/server/tests/lib/metrics.test.ts` — clean.
  • Staging rollout: verify usage_rollup events appear on the staging PostHog project every 30 minutes when there's activity, and that mcp.request / tool_executed stop appearing. Spot-check tool_executions_by_tool and tool_executions_by_source for sensible breakdowns.
  • Prod rollout sequencing: once a majority of active instances are on the new code, disable then delete the temporary PostHog ingestion-side block that's filtering anonymous mcp.request / tool_executed events.

Test plan

The new tests/lib/metrics.test.ts covers:

  • mcp.request and tool_executed go into the buffer instead of capturing immediately
  • Other events still capture immediately
  • Shutdown drain emits exactly one usage_rollup event with the right counts, by_tool map, by_source map, and instance metadata
  • Empty buffer emits nothing
  • No-identity short-circuits the underlying capture path
  • Tool-name cardinality cap rolls overflow into __other__
  • Dotted tool names are sanitized
  • RollupBuffer resets after drain

These two events were firing once per call and dwarfed every other
event in volume — at peak, ~87% of total events on the project. They
carry no usable per-call signal beyond raw counts: mcp.request has
no tool/server/method props on the noisy code path; tool_executed's
duration/error fields are observability noise we route through Sentry
and logs anyway. Per-call captures cost real money in PostHog when an
upstream caller polls the server.

Aggregate both inside MetricsService:

- mcp.request and tool_executed now feed an in-process RollupBuffer
  (a simple counter struct keyed by tool name + source) instead of
  hitting PostHog per call.
- Every 30 minutes, wall-clock aligned (`:00` and `:30` past the hour),
  drain the buffer and emit one `browseros.server.usage_rollup` event
  carrying mcp_requests_count, tool_executions_count, _failed,
  _by_source, _by_tool. Empty buckets skip the emit entirely.
- Tool name cardinality is capped at 200 distinct keys per bucket;
  overflow rolls into __other__. PostHog property-key dots are
  sanitized to underscores so HogQL prop access stays clean.
- All other emit sites (chat.request, http_server.started,
  cli.command_executed, etc.) keep firing per-call as before — their
  volume is normal.
- Interval is hardcoded — one less knob to misconfigure across
  deployments. If staging or scale-up ever needs a tighter window,
  change the constant and ship a release.

Side change: drop the `'anonymous'` distinct_id fallback. Instances
booted without BROWSEROS_CLIENT_ID and BROWSEROS_INSTALL_ID were
funneling every event into a single un-attributable bucket and
inflating billing dramatically. Now: no identity ⇒ no event emitted.
main.ts surfaces a clear warn at boot so a misconfigured deployment
doesn't silently produce zero analytics.

On the worst-case caller observed in production (~3 req/sec sustained
polling), this drops their PostHog billing from ~10,800 events/hour
to 2 events/hour during active intervals — ~5,400× reduction. For a
typical user doing 100 tool calls in 30 minutes it's a ~100× reduction.

Tests cover: aggregated events buffer instead of capturing; other
events still capture immediately; flush drain produces exactly one
usage_rollup with correct counts; empty bucket skips emit; shutdown
drains before posthog.shutdown; no-identity short-circuit; tool-name
cap + __other__ overflow; sanitize-dots in tool names.
@github-actions github-actions Bot added the fix label Jun 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

✅ Tests passed — 1074/1079

Suite Passed Failed Skipped
agent 91/91 0 0
build 10/10 0 0
eval 95/95 0 0
server-agent 246/246 0 0
server-api 64/64 0 0
server-browser 4/4 0 0
server-integration 9/10 0 1
server-lib 266/267 0 1
server-root 60/63 0 3
server-tools 229/229 0 0

View workflow run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR replaces per-call PostHog captures for mcp.request and tool_executed with an in-process RollupBuffer that flushes a browseros.server.usage_rollup event every 30 minutes (wall-clock aligned), and drops the 'anonymous' distinct-ID fallback so unconfigured instances no longer emit unattributable events.

  • RollupBuffer accumulates mcp_requests, tool_executions totals, by_source, and by_tool breakdowns with cardinality and length caps; MetricsService.flushRollup() drains it via a wall-clock-aligned setTimeout chain and also on shutdown().
  • main.ts gains a boot-time logger.warn when metrics is enabled but no instance identity is configured, surfacing the silent no-op the removed 'anonymous' fallback used to mask.
  • Two data-accuracy issues in metrics.ts: Math.ceil on an exact 30-minute UTC boundary yields delay = 0, and period_start_ts is stamped at first-event time rather than the aligned window boundary, mismatching the static interval_seconds: 1800 field.

Confidence Score: 4/5

The buffering and flush logic is correct end-to-end; the main risk is that period_start_ts will never match the aligned window boundary, so PostHog dashboards computing event rates from that field will produce subtly wrong numbers.

The core reduction from per-call to 30-min rollup works correctly and is well-tested. The Math.ceil boundary case can waste timer cycles at exact UTC boundaries. The period_start_ts issue is a consistent data inaccuracy affecting any rate metric built on this event. The test suite has a hidden ordering dependency that could silently break count assertions if tests are reordered.

metrics.ts warrants a second look for the period_start_ts stamping in ensureCurrent() and the Math.ceil boundary in scheduleNextFlush(); the test file needs an explicit rollup-reset in beforeEach.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/server/src/lib/metrics.ts Adds RollupBuffer and wires mcp.request/tool_executed into 30-min aggregation; two issues: Math.ceil exact-boundary 0ms delay and period_start_ts reflecting first-event time rather than aligned window start.
packages/browseros-agent/apps/server/src/main.ts Adds boot-time logger.warn when no instance identity is configured; minimal and low-risk change.
packages/browseros-agent/apps/server/tests/lib/metrics.test.ts New test suite covering rollup buffering and flush; beforeEach doesn't reset the singleton RollupBuffer, creating a hidden test-ordering dependency.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:196
`Math.ceil` returns the same value when `now` is exactly on a 30-minute UTC boundary (all `:00`/`:30` wall-clock marks are exact multiples of 1 800 000 ms since the Unix epoch), so `delay` computes to `0`. The timer fires immediately on the next event loop tick, the bucket is empty so nothing emits, but the recursive call to `scheduleNextFlush()` then re-enters the same arithmetic with the exact same `Date.now()` and may loop through several 0-delay iterations before accumulating enough drift to escape. Using `Math.floor + 1` always rounds up to the strictly-next boundary.

```suggestion
    const nextTs = (Math.floor(now / ROLLUP_INTERVAL_MS) + 1) * ROLLUP_INTERVAL_MS
```

### Issue 2 of 3
packages/browseros-agent/apps/server/src/lib/metrics.ts:113-127
**`period_start_ts` mismatches `interval_seconds`**

`ensureCurrent()` lazy-stamps `period_start_ts` at the moment the first event arrives in a window, not at the aligned window boundary. If the first tool call in a 30-minute window arrives at T+20 min, the emitted event will say `interval_seconds: 1800` while `period_end_ts - period_start_ts` is only ~10 min. Any consumer computing a rate as `count / interval_seconds` will understate the actual rate. Consider setting `period_start_ts` to the aligned boundary (`Math.floor(Date.now() / ROLLUP_INTERVAL_MS) * ROLLUP_INTERVAL_MS`) when the bucket is initialised.

### Issue 3 of 3
packages/browseros-agent/apps/server/tests/lib/metrics.test.ts:43-49
**`beforeEach` never resets the `RollupBuffer`**

The `beforeEach` calls `initialize()` twice to reset config identity but neither call drains the `RollupBuffer`. Tests in the log-dispatch suite record events without calling `shutdown()`, leaving accumulated counts in the singleton buffer. The count-sensitive "emits a single usage_rollup" test only passes because the preceding test drains the buffer via `shutdown()` with no identity. Reordering or adding new non-draining tests can silently break the exact-count assertions. Adding `await metrics.shutdown()` to `beforeEach` would make each test hermetic.

Reviews (1): Last reviewed commit: "fix(metrics): aggregate mcp.request + to..." | Re-trigger Greptile

Comment thread packages/browseros-agent/apps/server/src/lib/metrics.ts Outdated
Comment thread packages/browseros-agent/apps/server/src/lib/metrics.ts
Comment thread packages/browseros-agent/apps/server/tests/lib/metrics.test.ts
- scheduleNextFlush: use `floor(now/I) + 1` instead of `ceil(now/I)`.
  On the rare tick that lands exactly on a 30-minute UTC boundary,
  `ceil` returned `now` itself and the resulting `delay=0` re-entered
  the recursive schedule until the clock advanced by 1 ms. `floor+1`
  always rounds to the strictly-next boundary so the timer is always
  armed for the future.

- RollupBuffer.ensureCurrent: anchor `period_start_ts` to the aligned
  window boundary instead of the moment the first event arrived.
  Previously a first event at T+20min would emit a rollup with
  `interval_seconds=1800` while `period_end_ts - period_start_ts`
  was only ~10min, so any dashboard computing
  `count / interval_seconds` understated the rate. Aligning the start
  timestamp makes the three time fields agree about the window.

- tests/lib/metrics.test.ts beforeEach now calls
  `await metrics.shutdown()` first to drain the singleton
  RollupBuffer. Tests previously passed only because of test ordering;
  this makes every case hermetic.

- tests/main.test.ts: add `instanceClientId: 'client-test'` to the
  shared config so the existing happy-path assertions keep working
  with the new "no identity ⇒ no event" behavior, and add a new test
  that explicitly verifies the boot-time warn fires when both
  identity fields are absent.
@DaniAkash Dani Akash (DaniAkash) merged commit 2fc7d6d into main Jun 2, 2026
22 checks passed
@DaniAkash Dani Akash (DaniAkash) deleted the fix/metrics-rollup-buffer branch June 2, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant