Public status pages v1: live today + 90-day history + RSS feed#79
Open
lewispb wants to merge 20 commits into
Open
Public status pages v1: live today + 90-day history + RSS feed#79lewispb wants to merge 20 commits into
lewispb wants to merge 20 commits into
Conversation
CLAUDE.md @-imports AGENTS.md so Claude Code, Codex, and other tools all pick up the same notes.
Adds Upright.configuration.public_status_enabled (off by default) and public_status_custom_domains for CNAME support. Routes constrain on the 'status' subdomain so both status.<hostname> and CNAMEd customer domains hit Upright::Public::* controllers.
Upright::Service is a FrozenRecord loaded from config/services.yml. Probes opt in via 'service: <code>' in their YAML — Probeable defaults probe_service to try(:service), so no probe-class changes are needed. Probeable also self-registers including classes so Service#probes can iterate them without a central registry.
ProbeRollup.aggregate_day reads upright:probe_uptime_daily from Prometheus at end-of-day and upserts one row per probe with the uptime_fraction and a derived status (operational, degraded_performance, partial_outage, major_outage). ServiceRollup.aggregate_day takes min(uptime_fraction) of the day's ProbeRollups per service, so a service is only as healthy as its worst component. DailyAggregationJob just orchestrates per day — all logic lives on the rollup models. Requires the host app's upright:probe_uptime_daily recording rule to preserve probe_service in its 'by' clauses.
- Inline the enum values rather than naming a STATUSES constant nothing else references. - Fold the nil/operational guards into the case/when so status_for is one expression. - Use Time.now to pair with Date.today elsewhere in the rollup path (both Ruby stdlib clock sources, no AS time-zone coupling).
ServiceRollup was a materialized min(uptime_fraction) of the day's ProbeRollups, grouped by probe_service. That's cheap to compute on demand from ProbeRollup, so the extra table, write step, and aggregation-lag window weren't earning their keep. Upright::Service now exposes uptime_for(day), status_for(day), and daily_uptime(days:) — all backed by ProbeRollup queries. The job only aggregates ProbeRollup; ServiceRollup.aggregate_day is gone. Also switches the job to a lookback: Duration kwarg, iterating lookback.ago.to_date..Date.today. Default 1.day mirrors the previous behaviour (yesterday + today).
bin/seed-prometheus now emits a probe_service label on upright:probe_uptime_daily and upright_probe_up, then runs DailyAggregationJob with a 30-day lookback so ProbeRollup is populated against the seeded series. test/dummy/probes/*.yml gain matching service: attributes so a live probe run produces the same probe_service label as the seed, keeping the rollup path consistent end-to-end.
Flips the dummy app's public_status_enabled, gives the status controller a real action that loads services plus a 90-day window of days, and adds the show view rendering each service with a 90-day uptime bar strip. A new public layout keeps the page free of the admin's signed_in? layout chrome.
Adds a status banner that surfaces the worst service's status for today, a per-service row with the current status label, and a 90-bar uptime strip with per-day tooltips and an average uptime % below. CSS lives alongside the existing app/assets/stylesheets/upright files, which the layout's upright_stylesheet_link_tag globs in automatically. Uses the project's OKLCH design tokens; adds a small status color palette (operational/degraded/partial/major) keyed off the rollup status enum.
The previous distribution only generated uptimes >= 0.92, so partial_outage and major_outage bars never appeared on the status page. Updated each probe's distribution to occasionally hit those tiers, plus deterministic incident days (a MultiProxy major outage 22 days ago, a Gmail partial outage 12 days ago) so the service-level min-of-probes rollup visibly reflects them.
status_for(nil) returns :operational so any rollup that's missing for a day was getting the operational green colour, making empty days look identical to perfect-uptime days. Only assign a status class when the day actually has a rollup; otherwise the bar falls back to --status-none (neutral grey).
Three stacked problems were silently dropping seeded uptime samples:
* DailyAggregationJob queries upright:probe_uptime_daily at day.end_of_day,
but the seed emitted samples at NOW - day*86400 — Prometheus's 5-min
lookback rejected anything off by more than a few hours. Anchor each
day's sample at 23:59 UTC, clamping today to NOW.
* Default 15d retention dropped seeded blocks older than 15 days. Bump to
90d to match the public status page window.
* The recording rule grouped by (name, type, probe_target) and stripped
probe_service. Add probe_service to both by clauses so ProbeRollup can
tie rollups back to their Service.
Also wait until the OLDEST seeded block is queryable before invoking the
job — Prometheus loads blocks oldest-last, so racing ahead produced
partial rollups.
Today is still in progress, so persisting an aggregate from an incomplete day just produces a stale value the rest of the day. The public status page now shows today from live Prometheus state instead. Rename the job's `lookback:` keyword to `past:` since it's a Duration, and cap the range at Date.yesterday. Switch ProbeRollup.fetch_uptime_for from Time.now to Time.current so the iso8601 query time matches Prometheus's UTC samples — the comparison was epoch-correct either way, but the rendered timestamp picked up the system offset.
Add a `public:` flag to services.yml so only public-facing services
show up. Render today live from Prometheus (Service#live_status, via a
new LiveStatus concern that reads upright:probe_down_fraction) and past
days from ProbeRollup, presented through a single DailyStatus value
object so the view is agnostic to the source.
Move collection logic onto Service: `overall_status`, `by_history`,
`degraded` (with `current_outage_started_at` for outage duration in the
banner-adjacent list). The controller is now a one-liner.
Extract a StatusHelper and four partials (overall_banner, degraded_list,
service, uptime_bar). Rename the `degraded_performance` enum to
`degraded` and pull the enum order into Status::VALUES/PRIORITY so the
overall_status calculation can reuse it. Swap ProbeRollup tests to
fixtures and `travel_to` for stable dates.
Add a `:month_day` Date format ("%b %-d") so DailyStatus#tooltip can use
to_fs instead of strftime.
The Status concern's helpers and Service collection methods are short enough that an early return obscures rather than clarifies the happy-path expression — wrap them instead. Collapse the three stacked `return nil if` guards in `current_outage_started_at` into a single conditional, leaning on `rindex` returning nil for both empty arrays and no-match.
The public status page renders a collection of services, not a singular status resource. Rename to align with the standard collection→index Rails convention. The status page is now Upright::Public::ServicesController#index, served from `services/index.html.erb`, with helpers in Upright::Public::ServicesHelper.
Status was a Rails concern under Upright::Rollups, which meant the only way to call its `status_for` mapping was through an including class — forcing Service to reach into `Upright::Rollups::ProbeRollup.status_for(...)` for a concept that has nothing to do with rollups. Promote it to a plain module at Upright::Status with VALUES, PRIORITY, and a pure `for(uptime_fraction)`. ProbeRollup declares `enum :status, Upright::Status::VALUES` directly and owns its own `uptime_percentage`. Service and LiveStatus call `Upright::Status.for(fraction)` without the detour. `Upright::Status.for(nil)` now returns nil (a missing rollup is no-data, not :operational) so callers can drop the `fraction && ...` guard. Service#status_for(day) drops out — `live_status` and `daily_status_history` cover every remaining caller.
Drop the dedicated upsert_day method in favor of `find_or_create_by` with a block — inserts the rollup when the (probe_name, period_start) slot is empty and leaves an existing rollup alone. Move the fraction → status derivation into a before_save callback so the rollup's status can never drift from its uptime_fraction. aggregate_day no longer has to spell it out. Rename the per-element variable from `sample` (Prometheus jargon) to `probe_uptime` and the `:name` key to `:probe_name` so the hash matches the rollup column it'll populate. Tests switch to fixtures, dropping the `delete_all` setup and the ad-hoc `create!`, and gain coverage for the no-op-on-existing-rollup path and the before_save callback.
ServicesController#index now sets `Cache-Control: max-age=15, public` (plus the body-derived ETag Rails adds by default) so an outage-driven traffic spike doesn't tear through SQLite. Same TTL on both representations. The same action also responds to RSS at /feed (route defaults format to :rss), listing each currently-degraded service as a feed item keyed on the outage's start time. Channel envelope still renders when nothing is degraded. Layout dispatch is format-driven: `app/views/layouts/upright/public.html.erb` wraps the page, `public.rss.builder` is a one-line passthrough so the RSS body flows through unchanged.
There was a problem hiding this comment.
Pull request overview
Introduces the first public status-page surface (HTML + RSS) served from a dedicated status subdomain (optionally via custom CNAME hosts), backed by a unified 90-day DailyStatus history that combines today’s live Prometheus state with persisted daily rollups.
Changes:
- Added a public status page (HTML) and RSS feed endpoint gated behind
public_status_enabledand a subdomain route constraint. - Introduced service/status domain objects (
Upright::Service,Upright::Status,Upright::Service::DailyStatus) plus Prometheus-backed live status and a persisted daily rollup model/job. - Updated Prometheus rule templates + dummy dev tooling/fixtures to emit and aggregate 90 days of service-labeled uptime.
Tip
If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.
Reviewed changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test/models/upright/status_test.rb | Adds unit tests for uptime-fraction → status mapping. |
| test/models/upright/service_test.rb | Adds tests for service loading, public scope, and rollup-backed uptime queries. |
| test/models/upright/rollups/probe_rollup_test.rb | Tests rollup persistence, derived status, and rollup-day behavior. |
| test/jobs/upright/rollups/daily_aggregation_job_test.rb | Verifies the aggregation job excludes today and handles empty windows. |
| test/integration/public/services_controller_test.rb | Integration coverage for HTML and RSS responses + caching header. |
| test/fixtures/upright_rollups_probe_rollups.yml | Fixtures for probe rollup history used by service tests. |
| test/dummy/probes/traceroute_probes.yml | Adds service mapping for traceroute probes in the dummy app. |
| test/dummy/probes/smtp_probes.yml | Adds service mapping for SMTP probes in the dummy app. |
| test/dummy/probes/http_probes.yml | Adds service mapping for HTTP probes in the dummy app. |
| test/dummy/docker-compose.yml | Extends Prometheus retention to 90 days for the dummy environment. |
| test/dummy/db/schema.rb | Updates dummy schema with new rollups table. |
| test/dummy/config/services.yml | Defines dummy Upright::Service records (public + internal). |
| test/dummy/config/recurring.yml | Schedules rollup aggregation job in dummy recurring config. |
| test/dummy/config/prometheus/rules/upright.yml | Updates dummy Prometheus rule grouping to keep probe_service. |
| test/dummy/config/initializers/upright.rb | Enables public status in dummy initializer for development/testing. |
| test/dummy/bin/seed-prometheus | Seeds 90 days of service-labeled metrics and runs aggregation. |
| lib/upright/engine.rb | Prints public status URL hint in debug callback when enabled. |
| lib/upright/configuration.rb | Adds public-status config + custom domain host allowlisting. |
| lib/generators/upright/install/templates/upright.rules.yml | Updates install template rules to group by probe_service. |
| db/migrate/20260512000001_create_upright_rollups.rb | Creates the upright_rollups_probe_rollups table and indexes. |
| config/routes.rb | Adds public-status constrained routes for root + /feed. |
| config/initializers/mime_types.rb | Registers application/rss+xml MIME type. |
| config/initializers/date_formats.rb | Adds a :month_day date format for tooltips. |
| CLAUDE.md | Points to AGENTS.md. |
| app/views/upright/public/services/index.rss.builder | RSS feed template for degraded services. |
| app/views/upright/public/services/index.html.erb | Public status index page layout. |
| app/views/upright/public/services/_uptime_bar.html.erb | Renders a single day “bar” in the uptime strip. |
| app/views/upright/public/services/_service.html.erb | Renders a service row + 90-day strip + summary uptime. |
| app/views/upright/public/services/_overall_banner.html.erb | Renders the overall status banner. |
| app/views/upright/public/services/_degraded_list.html.erb | Renders the list of currently degraded services. |
| app/views/layouts/upright/public.rss.builder | RSS layout wrapper. |
| app/views/layouts/upright/public.html.erb | Public HTML layout for status pages. |
| app/models/upright/status.rb | Defines status values/priority + uptime-threshold mapping. |
| app/models/upright/service/daily_status.rb | Value object for a single day’s status, fraction, and tooltip. |
| app/models/upright/service.rb | FrozenRecord-backed service model + history/degraded/overall logic. |
| app/models/upright/rollups/probe_rollup.rb | Rollup model + Prometheus fetch and persistence behavior. |
| app/models/concerns/upright/services/live_status.rb | Live Prometheus status + outage start inference. |
| app/models/concerns/upright/probeable.rb | Adds probe-class tracking and maps probes to a service. |
| app/jobs/upright/rollups/daily_aggregation_job.rb | Job to roll up completed days into probe rollups. |
| app/helpers/upright/public/services_helper.rb | Labels, outage phrasing, and uptime-average helpers. |
| app/controllers/upright/public/services_controller.rb | Public controller index with short expires_in caching. |
| app/controllers/upright/public/base_controller.rb | Base controller for public pages with dedicated layout. |
| app/assets/stylesheets/upright/public_status.css | Styling for the public status page components. |
| AGENTS.md | Contributor guidance for engine-local workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+10
to
+12
| def self.overall_status | ||
| Upright::Status::PRIORITY.find { |status| all.any? { |service| service.live_status == status } } || :operational | ||
| end |
Comment on lines
+18
to
+24
| def self.degraded | ||
| all.filter_map do |service| | ||
| status = service.live_status | ||
| unless status == :operational | ||
| { service: service, status: status, started_at: service.current_outage_started_at } | ||
| end | ||
| end |
Comment on lines
+14
to
+26
| def self.rollup_day(day) | ||
| fetch_uptime_for(day).each do |probe_uptime| | ||
| find_or_create_by(probe_name: probe_uptime.fetch(:probe_name), period_start: day.beginning_of_day) do |rollup| | ||
| rollup.probe_service = probe_uptime[:probe_service] | ||
| rollup.uptime_fraction = probe_uptime.fetch(:uptime_fraction) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| def self.fetch_uptime_for(day) | ||
| query_time = [ day.end_of_day, Time.current ].min | ||
|
|
||
| response = prometheus_client.query(query: PROMETHEUS_METRIC, time: query_time.iso8601).deep_symbolize_keys |
Comment on lines
140
to
148
| def configure_allowed_hosts | ||
| port_suffix = Rails.env.local? ? "(:\\d+)?" : "" | ||
| Rails.application.config.hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| Array(@public_status_custom_domains).each do |domain| | ||
| hosts << /\A#{Regexp.escape(domain)}#{port_suffix}\z/ | ||
| end | ||
| Rails.application.config.hosts = hosts | ||
| Rails.application.config.action_dispatch.tld_length = 1 | ||
| end |
| xml.rss(version: "2.0") do | ||
| xml.channel do | ||
| xml.title "Upright Status" | ||
| xml.link upright.public_services_root_url |
Comment on lines
+10
to
+15
| started_at = issue[:started_at] || Time.current | ||
| xml.item do | ||
| xml.title "#{issue[:service].name} — #{status_label(issue[:status])}" | ||
| xml.description "#{issue[:service].name} is currently #{status_label(issue[:status]).downcase} #{outage_duration_phrase(started_at: issue[:started_at])}." | ||
| xml.pubDate started_at.rfc822 | ||
| xml.guid "#{issue[:service].code}-#{started_at.to_i}", isPermaLink: "false" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First PR for the status-page feature — a read-only view that
public users can hit at
status.<host>(or a CNAMEd custom domain) tosee whether each user-facing service is operational. Off by default;
opt in via
Upright.configuration.public_status_enabled = true.The page renders an overall banner, a degraded-services list with outage
durations, and a 90-day uptime bar per service. Today's status comes
live from Prometheus, past days from the daily rollup table — both
flow through a single
DailyStatuscollection so the view is agnosticto the source.
Data model
Upright::Service— FrozenRecord loaded fromconfig/services.yml.New
public:flag drives thepublic_facingscope. Class methods(
overall_status,by_history,degraded) own the collection logicso controllers/views stay thin.
Upright::Status— plain module withVALUES,PRIORITY, and apure
for(uptime_fraction)threshold mapper. Both ProbeRollup andService depend on it; nobody reaches through ProbeRollup to map a
fraction.
Upright::Rollups::ProbeRollup— one row per (probe, day) withuptime_fractionand an enumstatusderived in abefore_savecallback (so status can't drift from fraction).
Upright::Rollups::DailyAggregationJob— recurring hourly jobiterating
past.ago.to_date..Date.yesterday(today is in progressand represented live, not persisted).
Upright::Services::LiveStatus(concern) — readsupright:probe_down_fractionfrom Prometheus for today's status andthe most recent outage's start time.
Upright::Service::DailyStatus— value object representing oneday; carries status + optional fraction + a tooltip helper.
Page UI
Upright::Public::ServicesController#indexpowers everything. Viewsunder
app/views/upright/public/services/:_overall_banner— worst current status across services_degraded_list— currently-degraded services with outage duration_service+_uptime_bar— per-service row + 90-day stripHelpers (
Upright::Public::ServicesHelper) own the status-to-labelmapping and outage-duration phrasing.
RSS feed
Same action serves RSS at
/feed(route forcesformat: :rss,template at
index.rss.builder). One item per currently-degradedservice, keyed on outage start time so feed readers see each new
outage. Empty channel envelope when all clear. The layout dispatches
by format —
public.html.erbfor the page,public.rss.builderforthe feed.
HTTP caching
expires_in 15.seconds, public: trueon the action sendsCache-Control: max-age=15, publicfor both formats; Rails' defaultETag (body-derived) handles conditional GETs. Load-bearing — SQLite +
an outage-driven traffic spike was the failure mode the original todo
flagged.
Routes
Subdomain constraint requires both
public_status_enabledAND therequest to hit
Upright.configuration.public_status_subdomain(or aconfigured CNAME). Routes don't exist on other subdomains, so the
admin app is unaffected.
What's deferred
These were called out in the original UI todo but depend on later work:
ComponentsController#show— per-service 90-day page. Theper-row bars in the index already render the same data; standalone
per-service pages can land later.
IncidentsController#index/show+ richer RSS items — need theincidents domain (separate todo).
Test plan
bin/rails testpasses (174 tests)public_status_enabled=true, hitstatus.<host>/and confirm the page renderscurl -i status.<host>/showsCache-Control: max-age=15, publicand a
text/htmlContent-Typecurl -i status.<host>/feedshowsContent-Type: application/rss+xmland a valid<rss>envelope
config.public_status_enabled = falseand confirm bothURLs 404 (route falls through the subdomain constraint)