Skip to content

fix(iam): expose sql metrics for different dashboards#29

Open
wilsonbeltran-bc wants to merge 1 commit into
mainfrom
ANA-6973-fix-failing-app-registry-dashboard-grafana-iam
Open

fix(iam): expose sql metrics for different dashboards#29
wilsonbeltran-bc wants to merge 1 commit into
mainfrom
ANA-6973-fix-failing-app-registry-dashboard-grafana-iam

Conversation

@wilsonbeltran-bc
Copy link
Copy Markdown

@wilsonbeltran-bc wilsonbeltran-bc commented May 8, 2026

What? Why?

There is no Prometheus equivalent in BigCommerce for "how long do my SQL queries take, broken down by SELECT / UPDATE / DELETE". The upstream prometheus_exporter only exposes connection-pool stats (active_record_connection_pool_*), not query timing. Services that used to consume StatsD timers (stats.timers.*.sql.<op>.query_time.{p99,p95,mean}) lost that signal when they migrated to Prometheus, and several legacy Grafana panels (e.g. App Registry Dashboard) still query the dead metric.

This was first prototyped per-service in app-registry, but during review it was raised that the instrumentation is generic and belongs in this gem so other services (A-A, permissions, MAUI's Koa side, etc.) can adopt it with a single line of wiring.

What changed

New files

File Purpose
lib/bigcommerce/prometheus/integrations/active_record.rb Push side. Subscribes to sql.active_record ActiveSupport::Notifications, classifies the operation from the SQL payload (select / update / delete / other), ignores SCHEMA and CACHE events, and pushes duration in seconds to the local exporter via Bigcommerce::Prometheus.client. Errors are swallowed so instrumentation never propagates into the request path.
lib/bigcommerce/prometheus/type_collectors/active_record.rb Render side. Registers sql_query_duration_seconds (Histogram) with operation label and exponential buckets covering ~1 ms to ~10 s. Uses the explicit bc_sql type to avoid colliding with the upstream active_record type collector that handles connection-pool stats.
spec/bigcommerce/prometheus/integrations/active_record_spec.rb Specs for the integration: SELECT / UPDATE / DELETE / unknown classification, SCHEMA/CACHE filtering, case-insensitivity, leading whitespace, nil SQL, duration-in-seconds conversion, subscribe! registration, and "never raises if the client raises".
spec/bigcommerce/prometheus/type_collectors/active_record_spec.rb Specs for the type collector: histogram registration, observation, no-op when duration_seconds missing.

Modified files

File Change
lib/bigcommerce/prometheus.rb Adds the two new requires.
lib/bigcommerce/prometheus/instrumentors/web.rb In setup_before_fork: registers the AR type collector with the local server. In setup_after_fork: starts the AR integration so the subscriber is wired in each forked worker. Both gated on Bigcommerce::Prometheus.active_record_enabled.
lib/bigcommerce/prometheus/instrumentors/hutch.rb Same wiring: type collector on start, integration start after server start. Same opt-out gate.
lib/bigcommerce/prometheus/configuration.rb Adds active_record_enabled (default true, opt-out via PROMETHEUS_ACTIVE_RECORD_ENABLED=0).
bc-prometheus-ruby.gemspec Adds activesupport as a development dependency. The integration uses ActiveSupport::Notifications at runtime, but activesupport is not added as a runtime dep because every Rails consumer already has it loaded — making it a hard runtime dep would force non-Rails consumers to install a heavy library they don't need.

Exposed metric

After scraping /metrics (with the default ruby_ prefix added by prometheus_exporter):

# HELP ruby_sql_query_duration_seconds ActiveRecord SQL query duration in seconds, labeled by operation.
# TYPE ruby_sql_query_duration_seconds histogram
ruby_sql_query_duration_seconds_bucket{operation="select",le="0.001"} ...
ruby_sql_query_duration_seconds_bucket{operation="select",le="0.005"} ...
...
ruby_sql_query_duration_seconds_count{operation="select"} ...
ruby_sql_query_duration_seconds_sum{operation="select"} ...

Default buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds. Configurable via the type collector constructor (TypeCollectors::ActiveRecord.new(buckets: [...])).

Useful PromQL recipes for consumers

# p99 latency for SELECT queries on a service named "app-registry":
histogram_quantile(0.99,
  sum by (le) (rate(ruby_sql_query_duration_seconds_bucket{job="app-registry", operation="select"}[5m]))
)

# Mean latency:
sum(rate(ruby_sql_query_duration_seconds_sum{job="app-registry", operation="select"}[5m]))
/
sum(rate(ruby_sql_query_duration_seconds_count{job="app-registry", operation="select"}[5m]))

How was it tested?

Specs

bundle exec rspec
# 41 examples, 0 failures (+15 new, all passing)

Rubocop

bundle exec rubocop
# clean

End-to-end with a real consumer

Validated locally by pointing app-registry's Gemfile at this branch and:

  1. Adding Bigcommerce::Prometheus::TypeCollectors::ActiveRecord.new to the service's web_type_collectors (and to the hardcoded list in system/providers/grpc.rb).
  2. Adding a one-line initializer that calls Bigcommerce::Prometheus::Integrations::ActiveRecord.start.
  3. Restarting foreman start for app-registry.
  4. Triggering real ActiveRecord queries from a rails runner pointed at the grpc process's exporter (PROMETHEUS_SERVER_PORT=9371).

The exporter at http://localhost:9371/metrics then exposed:

ruby_sql_query_duration_seconds_count{operation="select"} 3
ruby_sql_query_duration_seconds_sum{operation="select"} 0.762
ruby_sql_query_duration_seconds_count{operation="update"} 1
ruby_sql_query_duration_seconds_sum{operation="update"} 0.103
ruby_sql_query_duration_seconds_count{operation="other"} 2
ruby_sql_query_duration_seconds_sum{operation="other"} 0.207

Counts and operation labels matched exactly what the runner did (2 SELECT + 1 UPDATE + 2 BEGIN/COMMIT classified as "other").

Backward compatibility

  • The integration is on by default but can be disabled per-service with PROMETHEUS_ACTIVE_RECORD_ENABLED=0.
  • No existing public APIs change. Existing services that don't touch ActiveRecord (no Rails, no AR) get nothing from this — the subscriber simply never fires.
  • activesupport is a dev dep only; non-Rails consumers don't pick up an unwanted runtime dep.

Out of scope

  • Per-table or per-controller breakdown.
  • Instrumenting non-DML statements (TRANSACTION, SCHEMA — currently rolled up under operation="other").
  • Wiring into the Resque instrumentor (Resque jobs do AR queries, but the existing Resque instrumentor in this gem is a pull-based collector and adding push-based AR there is a separate design question — left for follow-up).
  • The corresponding consumer change in app-registry. That's a separate PR that bumps this gem's version and adds 2-3 lines of wiring.

Backport

If a backport to the v0.6.x line is desired (for services still on Rack 2 / Thin), happy to open a separate PR — the changes apply cleanly there too (verified locally on a v0.6.0-based branch with the same specs passing).


Note

Medium Risk
Adds new runtime SQL instrumentation via ActiveSupport::Notifications and wires it into web/hutch processes, which could impact performance or metric volume if misconfigured. Risk is mitigated by opt-out (PROMETHEUS_ACTIVE_RECORD_ENABLED) and error-swallowing/idempotent startup.

Overview
Adds ActiveRecord SQL timing metrics by subscribing to sql.active_record notifications and pushing per-query durations (labeled by operation: select/insert/update/delete/other) to the exporter under a new bc_sql type.

Wires the new integration and sql_query_duration_seconds Histogram type collector into the web and hutch instrumentors (plus an active_record_enabled config flag to disable it), and includes specs plus an activesupport dev dependency to support the notification-based tests.

Reviewed by Cursor Bugbot for commit 2d41d61. Bugbot is set up for automated code reviews on this repo. Configure here.

@wilsonbeltran-bc wilsonbeltran-bc self-assigned this May 8, 2026
@wilsonbeltran-bc wilsonbeltran-bc requested a review from a team as a code owner May 8, 2026 18:12
Comment thread lib/bigcommerce/prometheus/instrumentors/hutch.rb Outdated
@wilsonbeltran-bc wilsonbeltran-bc force-pushed the ANA-6973-fix-failing-app-registry-dashboard-grafana-iam branch from 07787cc to 8aca5f5 Compare May 8, 2026 18:25
Comment thread lib/bigcommerce/prometheus/integrations/active_record.rb
Comment thread lib/bigcommerce/prometheus/instrumentors/web.rb Outdated
@wilsonbeltran-bc wilsonbeltran-bc force-pushed the ANA-6973-fix-failing-app-registry-dashboard-grafana-iam branch from 8aca5f5 to 1a57b29 Compare May 8, 2026 18:36
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1a57b29. Configure here.

Comment thread lib/bigcommerce/prometheus/integrations/active_record.rb
@wilsonbeltran-bc wilsonbeltran-bc force-pushed the ANA-6973-fix-failing-app-registry-dashboard-grafana-iam branch 2 times, most recently from 881e44b to 45d9bc6 Compare May 8, 2026 18:54
server.add_type_collector(tc)
end
server.start
::Bigcommerce::Prometheus::Integrations::ActiveRecord.start_safe(client: Bigcommerce::Prometheus.client, process_name: @process_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this start after the prometheus server is started?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentional. The subscriber pushes through Bigcommerce::Prometheus.client which is independent of the local server lifecycle, but starting it after server.start ensures the exporter is ready to receive when the first SQL event fires.


def setup_before_fork
@app.config.before_fork_callbacks = [] unless @app.config.before_fork_callbacks
process_name = @process_name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just pass the instance variable to your method? not sure I see why we need a local variable masking here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


def setup_after_fork
@app.config.after_fork_callbacks = [] unless @app.config.after_fork_callbacks
process_name = @process_name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# Wrapper for instrumentor wiring: registers the AR type collector on the given server,
# gated on the active_record_enabled config flag, and swallows errors so an additive
# feature failure cannot take down core instrumentation.
def self.register_type_collector(server, process_name: nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this method? this can be done directly on the hutch and web processes with server.add_type_collector(::Bigcommerce::Prometheus::TypeCollectors::ActiveRecord.new) unless ::Bigcommerce::Prometheus.active_record_enabled

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper exists for two reasons flagged in earlier review:

Same wiring is needed in both Web and Hutch — Cursor's bugbot flagged the duplication when it was inline.
The wrapper rescues StandardError so a failure in the optional AR instrumentation doesn't take down setup_middleware / @collectors.each(&:start) — also flagged by the bot.

# Render-side counterpart to Integrations::ActiveRecord. Aggregates per-operation
# SQL query duration into a Prometheus Histogram exposed at /metrics.
#
class ActiveRecord < Bigcommerce::Prometheus::TypeCollectors::Base
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like OTEL would be a better place to get this info from. Not opposed to adding this here, but we already get this info OOB from the OTEL collectors

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTEL captures this at the span level, which is great for tracing. The motivation here is to fix existing Grafana panels that query Thanos/Prometheus for aggregate p99/p95 SQL latency over 5m windows — that's much cheaper as a precomputed histogram than running aggregations over OTEL traces in Lightstep. Both can coexist; this isn't replacing OTEL, just exposing the same signal in the format the dashboards already expect.

@wilsonbeltran-bc wilsonbeltran-bc force-pushed the ANA-6973-fix-failing-app-registry-dashboard-grafana-iam branch from 45d9bc6 to 2d41d61 Compare May 11, 2026 19:16
@wilsonbeltran-bc wilsonbeltran-bc requested a review from exeDog May 11, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants