fix(iam): expose sql metrics for different dashboards#29
fix(iam): expose sql metrics for different dashboards#29wilsonbeltran-bc wants to merge 1 commit into
Conversation
07787cc to
8aca5f5
Compare
8aca5f5 to
1a57b29
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1a57b29. Configure here.
881e44b to
45d9bc6
Compare
| server.add_type_collector(tc) | ||
| end | ||
| server.start | ||
| ::Bigcommerce::Prometheus::Integrations::ActiveRecord.start_safe(client: Bigcommerce::Prometheus.client, process_name: @process_name) |
There was a problem hiding this comment.
should this start after the prometheus server is started?
There was a problem hiding this comment.
Yes, intentional. The subscriber pushes through Bigcommerce::Prometheus.client which is independent of the local server lifecycle, but starting it after server.start ensures the exporter is ready to receive when the first SQL event fires.
|
|
||
| def setup_before_fork | ||
| @app.config.before_fork_callbacks = [] unless @app.config.before_fork_callbacks | ||
| process_name = @process_name |
There was a problem hiding this comment.
why not just pass the instance variable to your method? not sure I see why we need a local variable masking here
|
|
||
| def setup_after_fork | ||
| @app.config.after_fork_callbacks = [] unless @app.config.after_fork_callbacks | ||
| process_name = @process_name |
| # Wrapper for instrumentor wiring: registers the AR type collector on the given server, | ||
| # gated on the active_record_enabled config flag, and swallows errors so an additive | ||
| # feature failure cannot take down core instrumentation. | ||
| def self.register_type_collector(server, process_name: nil) |
There was a problem hiding this comment.
do we need this method? this can be done directly on the hutch and web processes with server.add_type_collector(::Bigcommerce::Prometheus::TypeCollectors::ActiveRecord.new) unless ::Bigcommerce::Prometheus.active_record_enabled
There was a problem hiding this comment.
The wrapper exists for two reasons flagged in earlier review:
Same wiring is needed in both Web and Hutch — Cursor's bugbot flagged the duplication when it was inline.
The wrapper rescues StandardError so a failure in the optional AR instrumentation doesn't take down setup_middleware / @collectors.each(&:start) — also flagged by the bot.
| # Render-side counterpart to Integrations::ActiveRecord. Aggregates per-operation | ||
| # SQL query duration into a Prometheus Histogram exposed at /metrics. | ||
| # | ||
| class ActiveRecord < Bigcommerce::Prometheus::TypeCollectors::Base |
There was a problem hiding this comment.
I feel like OTEL would be a better place to get this info from. Not opposed to adding this here, but we already get this info OOB from the OTEL collectors
There was a problem hiding this comment.
OTEL captures this at the span level, which is great for tracing. The motivation here is to fix existing Grafana panels that query Thanos/Prometheus for aggregate p99/p95 SQL latency over 5m windows — that's much cheaper as a precomputed histogram than running aggregations over OTEL traces in Lightstep. Both can coexist; this isn't replacing OTEL, just exposing the same signal in the format the dashboards already expect.
45d9bc6 to
2d41d61
Compare

What? Why?
There is no Prometheus equivalent in BigCommerce for "how long do my SQL queries take, broken down by SELECT / UPDATE / DELETE". The upstream
prometheus_exporteronly exposes connection-pool stats (active_record_connection_pool_*), not query timing. Services that used to consume StatsD timers (stats.timers.*.sql.<op>.query_time.{p99,p95,mean}) lost that signal when they migrated to Prometheus, and several legacy Grafana panels (e.g. App Registry Dashboard) still query the dead metric.This was first prototyped per-service in
app-registry, but during review it was raised that the instrumentation is generic and belongs in this gem so other services (A-A, permissions, MAUI's Koa side, etc.) can adopt it with a single line of wiring.What changed
New files
lib/bigcommerce/prometheus/integrations/active_record.rbsql.active_recordActiveSupport::Notifications, classifies the operation from the SQL payload (select/update/delete/other), ignoresSCHEMAandCACHEevents, and pushes duration in seconds to the local exporter viaBigcommerce::Prometheus.client. Errors are swallowed so instrumentation never propagates into the request path.lib/bigcommerce/prometheus/type_collectors/active_record.rbsql_query_duration_seconds(Histogram) withoperationlabel and exponential buckets covering ~1 ms to ~10 s. Uses the explicitbc_sqltype to avoid colliding with the upstreamactive_recordtype collector that handles connection-pool stats.spec/bigcommerce/prometheus/integrations/active_record_spec.rbsubscribe!registration, and "never raises if the client raises".spec/bigcommerce/prometheus/type_collectors/active_record_spec.rbduration_secondsmissing.Modified files
lib/bigcommerce/prometheus.rblib/bigcommerce/prometheus/instrumentors/web.rbsetup_before_fork: registers the AR type collector with the local server. Insetup_after_fork: starts the AR integration so the subscriber is wired in each forked worker. Both gated onBigcommerce::Prometheus.active_record_enabled.lib/bigcommerce/prometheus/instrumentors/hutch.rbstart, integrationstartafter server start. Same opt-out gate.lib/bigcommerce/prometheus/configuration.rbactive_record_enabled(defaulttrue, opt-out viaPROMETHEUS_ACTIVE_RECORD_ENABLED=0).bc-prometheus-ruby.gemspecactivesupportas a development dependency. The integration usesActiveSupport::Notificationsat runtime, butactivesupportis not added as a runtime dep because every Rails consumer already has it loaded — making it a hard runtime dep would force non-Rails consumers to install a heavy library they don't need.Exposed metric
After scraping
/metrics(with the defaultruby_prefix added byprometheus_exporter):Default buckets:
[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]seconds. Configurable via the type collector constructor (TypeCollectors::ActiveRecord.new(buckets: [...])).Useful PromQL recipes for consumers
How was it tested?
Specs
Rubocop
End-to-end with a real consumer
Validated locally by pointing
app-registry'sGemfileat this branch and:Bigcommerce::Prometheus::TypeCollectors::ActiveRecord.newto the service'sweb_type_collectors(and to the hardcoded list insystem/providers/grpc.rb).Bigcommerce::Prometheus::Integrations::ActiveRecord.start.foreman startfor app-registry.rails runnerpointed at the grpc process's exporter (PROMETHEUS_SERVER_PORT=9371).The exporter at
http://localhost:9371/metricsthen exposed:Counts and operation labels matched exactly what the runner did (2 SELECT + 1 UPDATE + 2 BEGIN/COMMIT classified as "other").
Backward compatibility
PROMETHEUS_ACTIVE_RECORD_ENABLED=0.activesupportis a dev dep only; non-Rails consumers don't pick up an unwanted runtime dep.Out of scope
operation="other").Resqueinstrumentor (Resque jobs do AR queries, but the existing Resque instrumentor in this gem is a pull-based collector and adding push-based AR there is a separate design question — left for follow-up).app-registry. That's a separate PR that bumps this gem's version and adds 2-3 lines of wiring.Backport
If a backport to the
v0.6.xline is desired (for services still on Rack 2 / Thin), happy to open a separate PR — the changes apply cleanly there too (verified locally on av0.6.0-based branch with the same specs passing).Note
Medium Risk
Adds new runtime SQL instrumentation via
ActiveSupport::Notificationsand wires it into web/hutch processes, which could impact performance or metric volume if misconfigured. Risk is mitigated by opt-out (PROMETHEUS_ACTIVE_RECORD_ENABLED) and error-swallowing/idempotent startup.Overview
Adds ActiveRecord SQL timing metrics by subscribing to
sql.active_recordnotifications and pushing per-query durations (labeled by operation:select/insert/update/delete/other) to the exporter under a newbc_sqltype.Wires the new integration and
sql_query_duration_secondsHistogram type collector into thewebandhutchinstrumentors (plus anactive_record_enabledconfig flag to disable it), and includes specs plus anactivesupportdev dependency to support the notification-based tests.Reviewed by Cursor Bugbot for commit 2d41d61. Bugbot is set up for automated code reviews on this repo. Configure here.