Public status pages v1: live today + 90-day history + RSS feed by lewispb · Pull Request #79 · basecamp/upright

lewispb · 2026-05-13T14:34:56Z

Summary

First PR for the status-page feature — a read-only view that
public users can hit at status.<host> (or a CNAMEd custom domain) to
see whether each user-facing service is operational. Off by default;
opt in via Upright.configuration.public_status_enabled = true.

The page renders an overall banner, a degraded-services list with outage
durations, and a 90-day uptime bar per service. Today's status comes
live from Prometheus, past days from the daily rollup table — both
flow through a single DailyStatus collection so the view is agnostic
to the source.

Data model

Upright::Service — FrozenRecord loaded from config/services.yml.
New public: flag drives the public_facing scope. Class methods
(overall_status, by_history, degraded) own the collection logic
so controllers/views stay thin.
Upright::Status — plain module with VALUES, PRIORITY, and a
pure for(uptime_fraction) threshold mapper. Both ProbeRollup and
Service depend on it; nobody reaches through ProbeRollup to map a
fraction.
Upright::Rollups::ProbeRollup — one row per (probe, day) with
uptime_fraction and an enum status derived in a before_save
callback (so status can't drift from fraction).
Upright::Rollups::DailyAggregationJob — recurring hourly job
iterating past.ago.to_date..Date.yesterday (today is in progress
and represented live, not persisted).
Upright::Services::LiveStatus (concern) — reads
upright:probe_down_fraction from Prometheus for today's status and
the most recent outage's start time.
Upright::Service::DailyStatus — value object representing one
day; carries status + optional fraction + a tooltip helper.

Page UI

Upright::Public::ServicesController#index powers everything. Views
under app/views/upright/public/services/:

_overall_banner — worst current status across services
_degraded_list — currently-degraded services with outage duration
_service + _uptime_bar — per-service row + 90-day strip

Helpers (Upright::Public::ServicesHelper) own the status-to-label
mapping and outage-duration phrasing.

RSS feed

Same action serves RSS at /feed (route forces format: :rss,
template at index.rss.builder). One item per currently-degraded
service, keyed on outage start time so feed readers see each new
outage. Empty channel envelope when all clear. The layout dispatches
by format — public.html.erb for the page, public.rss.builder for
the feed.

HTTP caching

expires_in 15.seconds, public: true on the action sends
Cache-Control: max-age=15, public for both formats; Rails' default
ETag (body-derived) handles conditional GETs. Load-bearing — SQLite +
an outage-driven traffic spike was the failure mode the original todo
flagged.

Routes

constraints public_status do
  scope module: :public, as: :public do
    root "services#index", as: :services_root
    get "feed", to: "services#index", as: :services_feed, defaults: { format: :rss }
  end
end

Subdomain constraint requires both public_status_enabled AND the
request to hit Upright.configuration.public_status_subdomain (or a
configured CNAME). Routes don't exist on other subdomains, so the
admin app is unaffected.

What's deferred

These were called out in the original UI todo but depend on later work:

ComponentsController#show — per-service 90-day page. The
per-row bars in the index already render the same data; standalone
per-service pages can land later.
IncidentsController#index/show + richer RSS items — need the
incidents domain (separate todo).
Statuspage-compatible API + webhooks — separate todo.

Test plan

bin/rails test passes (174 tests)
On a dev host with public_status_enabled=true, hit
status.<host>/ and confirm the page renders
curl -i status.<host>/ shows Cache-Control: max-age=15, public
and a text/html Content-Type
curl -i status.<host>/feed shows
Content-Type: application/rss+xml and a valid <rss>
envelope
Toggle config.public_status_enabled = false and confirm both
URLs 404 (route falls through the subdomain constraint)

CLAUDE.md @-imports AGENTS.md so Claude Code, Codex, and other tools all pick up the same notes.

Adds Upright.configuration.public_status_enabled (off by default) and public_status_custom_domains for CNAME support. Routes constrain on the 'status' subdomain so both status.<hostname> and CNAMEd customer domains hit Upright::Public::* controllers.

Upright::Service is a FrozenRecord loaded from config/services.yml. Probes opt in via 'service: <code>' in their YAML — Probeable defaults probe_service to try(:service), so no probe-class changes are needed. Probeable also self-registers including classes so Service#probes can iterate them without a central registry.

ProbeRollup.aggregate_day reads upright:probe_uptime_daily from Prometheus at end-of-day and upserts one row per probe with the uptime_fraction and a derived status (operational, degraded_performance, partial_outage, major_outage). ServiceRollup.aggregate_day takes min(uptime_fraction) of the day's ProbeRollups per service, so a service is only as healthy as its worst component. DailyAggregationJob just orchestrates per day — all logic lives on the rollup models. Requires the host app's upright:probe_uptime_daily recording rule to preserve probe_service in its 'by' clauses.

- Inline the enum values rather than naming a STATUSES constant nothing else references. - Fold the nil/operational guards into the case/when so status_for is one expression. - Use Time.now to pair with Date.today elsewhere in the rollup path (both Ruby stdlib clock sources, no AS time-zone coupling).

ServiceRollup was a materialized min(uptime_fraction) of the day's ProbeRollups, grouped by probe_service. That's cheap to compute on demand from ProbeRollup, so the extra table, write step, and aggregation-lag window weren't earning their keep. Upright::Service now exposes uptime_for(day), status_for(day), and daily_uptime(days:) — all backed by ProbeRollup queries. The job only aggregates ProbeRollup; ServiceRollup.aggregate_day is gone. Also switches the job to a lookback: Duration kwarg, iterating lookback.ago.to_date..Date.today. Default 1.day mirrors the previous behaviour (yesterday + today).

bin/seed-prometheus now emits a probe_service label on upright:probe_uptime_daily and upright_probe_up, then runs DailyAggregationJob with a 30-day lookback so ProbeRollup is populated against the seeded series. test/dummy/probes/*.yml gain matching service: attributes so a live probe run produces the same probe_service label as the seed, keeping the rollup path consistent end-to-end.

Flips the dummy app's public_status_enabled, gives the status controller a real action that loads services plus a 90-day window of days, and adds the show view rendering each service with a 90-day uptime bar strip. A new public layout keeps the page free of the admin's signed_in? layout chrome.

Adds a status banner that surfaces the worst service's status for today, a per-service row with the current status label, and a 90-bar uptime strip with per-day tooltips and an average uptime % below. CSS lives alongside the existing app/assets/stylesheets/upright files, which the layout's upright_stylesheet_link_tag globs in automatically. Uses the project's OKLCH design tokens; adds a small status color palette (operational/degraded/partial/major) keyed off the rollup status enum.

The previous distribution only generated uptimes >= 0.92, so partial_outage and major_outage bars never appeared on the status page. Updated each probe's distribution to occasionally hit those tiers, plus deterministic incident days (a MultiProxy major outage 22 days ago, a Gmail partial outage 12 days ago) so the service-level min-of-probes rollup visibly reflects them.

status_for(nil) returns :operational so any rollup that's missing for a day was getting the operational green colour, making empty days look identical to perfect-uptime days. Only assign a status class when the day actually has a rollup; otherwise the bar falls back to --status-none (neutral grey).

Three stacked problems were silently dropping seeded uptime samples: * DailyAggregationJob queries upright:probe_uptime_daily at day.end_of_day, but the seed emitted samples at NOW - day*86400 — Prometheus's 5-min lookback rejected anything off by more than a few hours. Anchor each day's sample at 23:59 UTC, clamping today to NOW. * Default 15d retention dropped seeded blocks older than 15 days. Bump to 90d to match the public status page window. * The recording rule grouped by (name, type, probe_target) and stripped probe_service. Add probe_service to both by clauses so ProbeRollup can tie rollups back to their Service. Also wait until the OLDEST seeded block is queryable before invoking the job — Prometheus loads blocks oldest-last, so racing ahead produced partial rollups.

Today is still in progress, so persisting an aggregate from an incomplete day just produces a stale value the rest of the day. The public status page now shows today from live Prometheus state instead. Rename the job's `lookback:` keyword to `past:` since it's a Duration, and cap the range at Date.yesterday. Switch ProbeRollup.fetch_uptime_for from Time.now to Time.current so the iso8601 query time matches Prometheus's UTC samples — the comparison was epoch-correct either way, but the rendered timestamp picked up the system offset.

Add a `public:` flag to services.yml so only public-facing services show up. Render today live from Prometheus (Service#live_status, via a new LiveStatus concern that reads upright:probe_down_fraction) and past days from ProbeRollup, presented through a single DailyStatus value object so the view is agnostic to the source. Move collection logic onto Service: `overall_status`, `by_history`, `degraded` (with `current_outage_started_at` for outage duration in the banner-adjacent list). The controller is now a one-liner. Extract a StatusHelper and four partials (overall_banner, degraded_list, service, uptime_bar). Rename the `degraded_performance` enum to `degraded` and pull the enum order into Status::VALUES/PRIORITY so the overall_status calculation can reuse it. Swap ProbeRollup tests to fixtures and `travel_to` for stable dates. Add a `:month_day` Date format ("%b %-d") so DailyStatus#tooltip can use to_fs instead of strftime.

The Status concern's helpers and Service collection methods are short enough that an early return obscures rather than clarifies the happy-path expression — wrap them instead. Collapse the three stacked `return nil if` guards in `current_outage_started_at` into a single conditional, leaning on `rindex` returning nil for both empty arrays and no-match.

The public status page renders a collection of services, not a singular status resource. Rename to align with the standard collection→index Rails convention. The status page is now Upright::Public::ServicesController#index, served from `services/index.html.erb`, with helpers in Upright::Public::ServicesHelper.

Status was a Rails concern under Upright::Rollups, which meant the only way to call its `status_for` mapping was through an including class — forcing Service to reach into `Upright::Rollups::ProbeRollup.status_for(...)` for a concept that has nothing to do with rollups. Promote it to a plain module at Upright::Status with VALUES, PRIORITY, and a pure `for(uptime_fraction)`. ProbeRollup declares `enum :status, Upright::Status::VALUES` directly and owns its own `uptime_percentage`. Service and LiveStatus call `Upright::Status.for(fraction)` without the detour. `Upright::Status.for(nil)` now returns nil (a missing rollup is no-data, not :operational) so callers can drop the `fraction && ...` guard. Service#status_for(day) drops out — `live_status` and `daily_status_history` cover every remaining caller.

Drop the dedicated upsert_day method in favor of `find_or_create_by` with a block — inserts the rollup when the (probe_name, period_start) slot is empty and leaves an existing rollup alone. Move the fraction → status derivation into a before_save callback so the rollup's status can never drift from its uptime_fraction. aggregate_day no longer has to spell it out. Rename the per-element variable from `sample` (Prometheus jargon) to `probe_uptime` and the `:name` key to `:probe_name` so the hash matches the rollup column it'll populate. Tests switch to fixtures, dropping the `delete_all` setup and the ad-hoc `create!`, and gain coverage for the no-op-on-existing-rollup path and the before_save callback.

ServicesController#index now sets `Cache-Control: max-age=15, public` (plus the body-derived ETag Rails adds by default) so an outage-driven traffic spike doesn't tear through SQLite. Same TTL on both representations. The same action also responds to RSS at /feed (route defaults format to :rss), listing each currently-degraded service as a feed item keyed on the outage's start time. Channel envelope still renders when nothing is degraded. Layout dispatch is format-driven: `app/views/layouts/upright/public.html.erb` wraps the page, `public.rss.builder` is a one-line passthrough so the RSS body flows through unchanged.

Copilot

Pull request overview

Introduces the first public status-page surface (HTML + RSS) served from a dedicated status subdomain (optionally via custom CNAME hosts), backed by a unified 90-day DailyStatus history that combines today’s live Prometheus state with persisted daily rollups.

Changes:

Added a public status page (HTML) and RSS feed endpoint gated behind public_status_enabled and a subdomain route constraint.
Introduced service/status domain objects (Upright::Service, Upright::Status, Upright::Service::DailyStatus) plus Prometheus-backed live status and a persisted daily rollup model/job.
Updated Prometheus rule templates + dummy dev tooling/fixtures to emit and aggregate 90 days of service-labeled uptime.

Tip

If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
test/models/upright/status_test.rb	Adds unit tests for uptime-fraction → status mapping.
test/models/upright/service_test.rb	Adds tests for service loading, public scope, and rollup-backed uptime queries.
test/models/upright/rollups/probe_rollup_test.rb	Tests rollup persistence, derived status, and rollup-day behavior.
test/jobs/upright/rollups/daily_aggregation_job_test.rb	Verifies the aggregation job excludes today and handles empty windows.
test/integration/public/services_controller_test.rb	Integration coverage for HTML and RSS responses + caching header.
test/fixtures/upright_rollups_probe_rollups.yml	Fixtures for probe rollup history used by service tests.
test/dummy/probes/traceroute_probes.yml	Adds `service` mapping for traceroute probes in the dummy app.
test/dummy/probes/smtp_probes.yml	Adds `service` mapping for SMTP probes in the dummy app.
test/dummy/probes/http_probes.yml	Adds `service` mapping for HTTP probes in the dummy app.
test/dummy/docker-compose.yml	Extends Prometheus retention to 90 days for the dummy environment.
test/dummy/db/schema.rb	Updates dummy schema with new rollups table.
test/dummy/config/services.yml	Defines dummy `Upright::Service` records (public + internal).
test/dummy/config/recurring.yml	Schedules rollup aggregation job in dummy recurring config.
test/dummy/config/prometheus/rules/upright.yml	Updates dummy Prometheus rule grouping to keep `probe_service`.
test/dummy/config/initializers/upright.rb	Enables public status in dummy initializer for development/testing.
test/dummy/bin/seed-prometheus	Seeds 90 days of service-labeled metrics and runs aggregation.
lib/upright/engine.rb	Prints public status URL hint in debug callback when enabled.
lib/upright/configuration.rb	Adds public-status config + custom domain host allowlisting.
lib/generators/upright/install/templates/upright.rules.yml	Updates install template rules to group by `probe_service`.
db/migrate/20260512000001_create_upright_rollups.rb	Creates the `upright_rollups_probe_rollups` table and indexes.
config/routes.rb	Adds public-status constrained routes for root + `/feed`.
config/initializers/mime_types.rb	Registers `application/rss+xml` MIME type.
config/initializers/date_formats.rb	Adds a `:month_day` date format for tooltips.
CLAUDE.md	Points to `AGENTS.md`.
app/views/upright/public/services/index.rss.builder	RSS feed template for degraded services.
app/views/upright/public/services/index.html.erb	Public status index page layout.
app/views/upright/public/services/_uptime_bar.html.erb	Renders a single day “bar” in the uptime strip.
app/views/upright/public/services/_service.html.erb	Renders a service row + 90-day strip + summary uptime.
app/views/upright/public/services/_overall_banner.html.erb	Renders the overall status banner.
app/views/upright/public/services/_degraded_list.html.erb	Renders the list of currently degraded services.
app/views/layouts/upright/public.rss.builder	RSS layout wrapper.
app/views/layouts/upright/public.html.erb	Public HTML layout for status pages.
app/models/upright/status.rb	Defines status values/priority + uptime-threshold mapping.
app/models/upright/service/daily_status.rb	Value object for a single day’s status, fraction, and tooltip.
app/models/upright/service.rb	FrozenRecord-backed service model + history/degraded/overall logic.
app/models/upright/rollups/probe_rollup.rb	Rollup model + Prometheus fetch and persistence behavior.
app/models/concerns/upright/services/live_status.rb	Live Prometheus status + outage start inference.
app/models/concerns/upright/probeable.rb	Adds probe-class tracking and maps probes to a `service`.
app/jobs/upright/rollups/daily_aggregation_job.rb	Job to roll up completed days into probe rollups.
app/helpers/upright/public/services_helper.rb	Labels, outage phrasing, and uptime-average helpers.
app/controllers/upright/public/services_controller.rb	Public controller index with short `expires_in` caching.
app/controllers/upright/public/base_controller.rb	Base controller for public pages with dedicated layout.
app/assets/stylesheets/upright/public_status.css	Styling for the public status page components.
AGENTS.md	Contributor guidance for engine-local workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  def self.overall_status
+    Upright::Status::PRIORITY.find { |status| all.any? { |service| service.live_status == status } } || :operational
+  end


+  def self.degraded
+    all.filter_map do |service|
+      status = service.live_status
+      unless status == :operational
+        { service: service, status: status, started_at: service.current_outage_started_at }
+      end
+    end


+  def self.rollup_day(day)
+    fetch_uptime_for(day).each do |probe_uptime|
+      find_or_create_by(probe_name: probe_uptime.fetch(:probe_name), period_start: day.beginning_of_day) do |rollup|
+        rollup.probe_service   = probe_uptime[:probe_service]
+        rollup.uptime_fraction = probe_uptime.fetch(:uptime_fraction)
+      end
+    end
+  end
+
+  def self.fetch_uptime_for(day)
+    query_time = [ day.end_of_day, Time.current ].min
+
+    response = prometheus_client.query(query: PROMETHEUS_METRIC, time: query_time.iso8601).deep_symbolize_keys


    def configure_allowed_hosts
      port_suffix = Rails.env.local? ? "(:\\d+)?" : ""
-      Rails.application.config.hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ]
+      hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ]
+      Array(@public_status_custom_domains).each do |domain|
+        hosts << /\A#{Regexp.escape(domain)}#{port_suffix}\z/
+      end
+      Rails.application.config.hosts = hosts
      Rails.application.config.action_dispatch.tld_length = 1
    end


+xml.rss(version: "2.0") do
+  xml.channel do
+    xml.title "Upright Status"
+    xml.link upright.public_services_root_url


+      started_at = issue[:started_at] || Time.current
+      xml.item do
+        xml.title "#{issue[:service].name} — #{status_label(issue[:status])}"
+        xml.description "#{issue[:service].name} is currently #{status_label(issue[:status]).downcase} #{outage_duration_phrase(started_at: issue[:started_at])}."
+        xml.pubDate started_at.rfc822
+        xml.guid "#{issue[:service].code}-#{started_at.to_i}", isPermaLink: "false"


lewispb added 20 commits May 13, 2026 15:34

Add AGENTS.md with engine test and DB commands

4711717

CLAUDE.md @-imports AGENTS.md so Claude Code, Codex, and other tools all pick up the same notes.

Rename ProbeRollup.aggregate_day to rollup_day

d64f6ff

lewispb changed the title ~~Public status pages: scaffolding + Service + Rollups~~ Public status pages v1: live today + 90-day history + RSS feed May 15, 2026

lewispb marked this pull request as ready for review May 15, 2026 13:03

Copilot AI review requested due to automatic review settings May 15, 2026 13:03

Copilot started reviewing on behalf of lewispb May 15, 2026 13:04 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public status pages v1: live today + 90-day history + RSS feed#79

Public status pages v1: live today + 90-day history + RSS feed#79
lewispb wants to merge 20 commits into
mainfrom
lewis/public-status-pages

lewispb commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lewispb commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data model

Page UI

RSS feed

HTTP caching

Routes

What's deferred

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lewispb commented May 13, 2026 •

edited

Loading