Skip to content

Public status pages v1: live today + 90-day history + RSS feed#79

Open
lewispb wants to merge 20 commits into
mainfrom
lewis/public-status-pages
Open

Public status pages v1: live today + 90-day history + RSS feed#79
lewispb wants to merge 20 commits into
mainfrom
lewis/public-status-pages

Conversation

@lewispb
Copy link
Copy Markdown
Member

@lewispb lewispb commented May 13, 2026

Summary

First PR for the status-page feature — a read-only view that
public users can hit at status.<host> (or a CNAMEd custom domain) to
see whether each user-facing service is operational. Off by default;
opt in via Upright.configuration.public_status_enabled = true.

The page renders an overall banner, a degraded-services list with outage
durations, and a 90-day uptime bar per service. Today's status comes
live from Prometheus, past days from the daily rollup table — both
flow through a single DailyStatus collection so the view is agnostic
to the source.

image

Data model

  • Upright::Service — FrozenRecord loaded from config/services.yml.
    New public: flag drives the public_facing scope. Class methods
    (overall_status, by_history, degraded) own the collection logic
    so controllers/views stay thin.
  • Upright::Status — plain module with VALUES, PRIORITY, and a
    pure for(uptime_fraction) threshold mapper. Both ProbeRollup and
    Service depend on it; nobody reaches through ProbeRollup to map a
    fraction.
  • Upright::Rollups::ProbeRollup — one row per (probe, day) with
    uptime_fraction and an enum status derived in a before_save
    callback (so status can't drift from fraction).
  • Upright::Rollups::DailyAggregationJob — recurring hourly job
    iterating past.ago.to_date..Date.yesterday (today is in progress
    and represented live, not persisted).
  • Upright::Services::LiveStatus (concern) — reads
    upright:probe_down_fraction from Prometheus for today's status and
    the most recent outage's start time.
  • Upright::Service::DailyStatus — value object representing one
    day; carries status + optional fraction + a tooltip helper.

Page UI

Upright::Public::ServicesController#index powers everything. Views
under app/views/upright/public/services/:

  • _overall_banner — worst current status across services
  • _degraded_list — currently-degraded services with outage duration
  • _service + _uptime_bar — per-service row + 90-day strip

Helpers (Upright::Public::ServicesHelper) own the status-to-label
mapping and outage-duration phrasing.

RSS feed

Same action serves RSS at /feed (route forces format: :rss,
template at index.rss.builder). One item per currently-degraded
service, keyed on outage start time so feed readers see each new
outage. Empty channel envelope when all clear. The layout dispatches
by format — public.html.erb for the page, public.rss.builder for
the feed.

HTTP caching

expires_in 15.seconds, public: true on the action sends
Cache-Control: max-age=15, public for both formats; Rails' default
ETag (body-derived) handles conditional GETs. Load-bearing — SQLite +
an outage-driven traffic spike was the failure mode the original todo
flagged.

Routes

constraints public_status do
  scope module: :public, as: :public do
    root "services#index", as: :services_root
    get "feed", to: "services#index", as: :services_feed, defaults: { format: :rss }
  end
end

Subdomain constraint requires both public_status_enabled AND the
request to hit Upright.configuration.public_status_subdomain (or a
configured CNAME). Routes don't exist on other subdomains, so the
admin app is unaffected.

What's deferred

These were called out in the original UI todo but depend on later work:

  • ComponentsController#show — per-service 90-day page. The
    per-row bars in the index already render the same data; standalone
    per-service pages can land later.
  • IncidentsController#index/show + richer RSS items — need the
    incidents domain (separate todo).
  • Statuspage-compatible API + webhooks — separate todo.

Test plan

  • bin/rails test passes (174 tests)
  • On a dev host with public_status_enabled=true, hit
    status.<host>/ and confirm the page renders
  • curl -i status.<host>/ shows Cache-Control: max-age=15, public
    and a text/html Content-Type
  • curl -i status.<host>/feed shows
    Content-Type: application/rss+xml and a valid <rss>
    envelope
  • Toggle config.public_status_enabled = false and confirm both
    URLs 404 (route falls through the subdomain constraint)

lewispb added 20 commits May 13, 2026 15:34
CLAUDE.md @-imports AGENTS.md so Claude Code, Codex, and other tools
all pick up the same notes.
Adds Upright.configuration.public_status_enabled (off by default) and
public_status_custom_domains for CNAME support. Routes constrain on the
'status' subdomain so both status.<hostname> and CNAMEd customer
domains hit Upright::Public::* controllers.
Upright::Service is a FrozenRecord loaded from config/services.yml.
Probes opt in via 'service: <code>' in their YAML — Probeable defaults
probe_service to try(:service), so no probe-class changes are needed.
Probeable also self-registers including classes so Service#probes can
iterate them without a central registry.
ProbeRollup.aggregate_day reads upright:probe_uptime_daily from
Prometheus at end-of-day and upserts one row per probe with the
uptime_fraction and a derived status (operational, degraded_performance,
partial_outage, major_outage).

ServiceRollup.aggregate_day takes min(uptime_fraction) of the day's
ProbeRollups per service, so a service is only as healthy as its worst
component. DailyAggregationJob just orchestrates per day — all logic
lives on the rollup models.

Requires the host app's upright:probe_uptime_daily recording rule to
preserve probe_service in its 'by' clauses.
- Inline the enum values rather than naming a STATUSES constant nothing
  else references.
- Fold the nil/operational guards into the case/when so status_for is
  one expression.
- Use Time.now to pair with Date.today elsewhere in the rollup path
  (both Ruby stdlib clock sources, no AS time-zone coupling).
ServiceRollup was a materialized min(uptime_fraction) of the day's
ProbeRollups, grouped by probe_service. That's cheap to compute on
demand from ProbeRollup, so the extra table, write step, and
aggregation-lag window weren't earning their keep.

Upright::Service now exposes uptime_for(day), status_for(day), and
daily_uptime(days:) — all backed by ProbeRollup queries. The job only
aggregates ProbeRollup; ServiceRollup.aggregate_day is gone.

Also switches the job to a lookback: Duration kwarg, iterating
lookback.ago.to_date..Date.today. Default 1.day mirrors the previous
behaviour (yesterday + today).
bin/seed-prometheus now emits a probe_service label on
upright:probe_uptime_daily and upright_probe_up, then runs
DailyAggregationJob with a 30-day lookback so ProbeRollup is
populated against the seeded series.

test/dummy/probes/*.yml gain matching service: attributes so a live
probe run produces the same probe_service label as the seed, keeping
the rollup path consistent end-to-end.
Flips the dummy app's public_status_enabled, gives the status
controller a real action that loads services plus a 90-day window of
days, and adds the show view rendering each service with a 90-day
uptime bar strip. A new public layout keeps the page free of the
admin's signed_in? layout chrome.
Adds a status banner that surfaces the worst service's status for
today, a per-service row with the current status label, and a 90-bar
uptime strip with per-day tooltips and an average uptime % below.

CSS lives alongside the existing app/assets/stylesheets/upright files,
which the layout's upright_stylesheet_link_tag globs in automatically.
Uses the project's OKLCH design tokens; adds a small status color
palette (operational/degraded/partial/major) keyed off the rollup
status enum.
The previous distribution only generated uptimes >= 0.92, so partial_outage
and major_outage bars never appeared on the status page. Updated each probe's
distribution to occasionally hit those tiers, plus deterministic incident
days (a MultiProxy major outage 22 days ago, a Gmail partial outage 12 days
ago) so the service-level min-of-probes rollup visibly reflects them.
status_for(nil) returns :operational so any rollup that's missing for a
day was getting the operational green colour, making empty days look
identical to perfect-uptime days. Only assign a status class when the
day actually has a rollup; otherwise the bar falls back to
--status-none (neutral grey).
Three stacked problems were silently dropping seeded uptime samples:

  * DailyAggregationJob queries upright:probe_uptime_daily at day.end_of_day,
    but the seed emitted samples at NOW - day*86400 — Prometheus's 5-min
    lookback rejected anything off by more than a few hours. Anchor each
    day's sample at 23:59 UTC, clamping today to NOW.
  * Default 15d retention dropped seeded blocks older than 15 days. Bump to
    90d to match the public status page window.
  * The recording rule grouped by (name, type, probe_target) and stripped
    probe_service. Add probe_service to both by clauses so ProbeRollup can
    tie rollups back to their Service.

Also wait until the OLDEST seeded block is queryable before invoking the
job — Prometheus loads blocks oldest-last, so racing ahead produced
partial rollups.
Today is still in progress, so persisting an aggregate from an incomplete
day just produces a stale value the rest of the day. The public status
page now shows today from live Prometheus state instead. Rename the
job's `lookback:` keyword to `past:` since it's a Duration, and cap the
range at Date.yesterday.

Switch ProbeRollup.fetch_uptime_for from Time.now to Time.current so the
iso8601 query time matches Prometheus's UTC samples — the comparison was
epoch-correct either way, but the rendered timestamp picked up the
system offset.
Add a `public:` flag to services.yml so only public-facing services
show up. Render today live from Prometheus (Service#live_status, via a
new LiveStatus concern that reads upright:probe_down_fraction) and past
days from ProbeRollup, presented through a single DailyStatus value
object so the view is agnostic to the source.

Move collection logic onto Service: `overall_status`, `by_history`,
`degraded` (with `current_outage_started_at` for outage duration in the
banner-adjacent list). The controller is now a one-liner.

Extract a StatusHelper and four partials (overall_banner, degraded_list,
service, uptime_bar). Rename the `degraded_performance` enum to
`degraded` and pull the enum order into Status::VALUES/PRIORITY so the
overall_status calculation can reuse it. Swap ProbeRollup tests to
fixtures and `travel_to` for stable dates.

Add a `:month_day` Date format ("%b %-d") so DailyStatus#tooltip can use
to_fs instead of strftime.
The Status concern's helpers and Service collection methods are short
enough that an early return obscures rather than clarifies the
happy-path expression — wrap them instead. Collapse the three stacked
`return nil if` guards in `current_outage_started_at` into a single
conditional, leaning on `rindex` returning nil for both empty arrays
and no-match.
The public status page renders a collection of services, not a singular
status resource. Rename to align with the standard collection→index
Rails convention.

The status page is now Upright::Public::ServicesController#index, served
from `services/index.html.erb`, with helpers in
Upright::Public::ServicesHelper.
Status was a Rails concern under Upright::Rollups, which meant the only
way to call its `status_for` mapping was through an including class —
forcing Service to reach into `Upright::Rollups::ProbeRollup.status_for(...)`
for a concept that has nothing to do with rollups.

Promote it to a plain module at Upright::Status with VALUES, PRIORITY,
and a pure `for(uptime_fraction)`. ProbeRollup declares `enum :status,
Upright::Status::VALUES` directly and owns its own `uptime_percentage`.
Service and LiveStatus call `Upright::Status.for(fraction)` without the
detour.

`Upright::Status.for(nil)` now returns nil (a missing rollup is no-data,
not :operational) so callers can drop the `fraction && ...` guard.
Service#status_for(day) drops out — `live_status` and
`daily_status_history` cover every remaining caller.
Drop the dedicated upsert_day method in favor of `find_or_create_by`
with a block — inserts the rollup when the (probe_name, period_start)
slot is empty and leaves an existing rollup alone.

Move the fraction → status derivation into a before_save callback so
the rollup's status can never drift from its uptime_fraction. aggregate_day
no longer has to spell it out.

Rename the per-element variable from `sample` (Prometheus jargon) to
`probe_uptime` and the `:name` key to `:probe_name` so the hash matches
the rollup column it'll populate.

Tests switch to fixtures, dropping the `delete_all` setup and the
ad-hoc `create!`, and gain coverage for the no-op-on-existing-rollup
path and the before_save callback.
ServicesController#index now sets `Cache-Control: max-age=15, public`
(plus the body-derived ETag Rails adds by default) so an outage-driven
traffic spike doesn't tear through SQLite. Same TTL on both
representations.

The same action also responds to RSS at /feed (route defaults format
to :rss), listing each currently-degraded service as a feed item keyed
on the outage's start time. Channel envelope still renders when nothing
is degraded.

Layout dispatch is format-driven: `app/views/layouts/upright/public.html.erb`
wraps the page, `public.rss.builder` is a one-line passthrough so the
RSS body flows through unchanged.
@lewispb lewispb changed the title Public status pages: scaffolding + Service + Rollups Public status pages v1: live today + 90-day history + RSS feed May 15, 2026
@lewispb lewispb marked this pull request as ready for review May 15, 2026 13:03
Copilot AI review requested due to automatic review settings May 15, 2026 13:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces the first public status-page surface (HTML + RSS) served from a dedicated status subdomain (optionally via custom CNAME hosts), backed by a unified 90-day DailyStatus history that combines today’s live Prometheus state with persisted daily rollups.

Changes:

  • Added a public status page (HTML) and RSS feed endpoint gated behind public_status_enabled and a subdomain route constraint.
  • Introduced service/status domain objects (Upright::Service, Upright::Status, Upright::Service::DailyStatus) plus Prometheus-backed live status and a persisted daily rollup model/job.
  • Updated Prometheus rule templates + dummy dev tooling/fixtures to emit and aggregate 90 days of service-labeled uptime.

Tip

If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/models/upright/status_test.rb Adds unit tests for uptime-fraction → status mapping.
test/models/upright/service_test.rb Adds tests for service loading, public scope, and rollup-backed uptime queries.
test/models/upright/rollups/probe_rollup_test.rb Tests rollup persistence, derived status, and rollup-day behavior.
test/jobs/upright/rollups/daily_aggregation_job_test.rb Verifies the aggregation job excludes today and handles empty windows.
test/integration/public/services_controller_test.rb Integration coverage for HTML and RSS responses + caching header.
test/fixtures/upright_rollups_probe_rollups.yml Fixtures for probe rollup history used by service tests.
test/dummy/probes/traceroute_probes.yml Adds service mapping for traceroute probes in the dummy app.
test/dummy/probes/smtp_probes.yml Adds service mapping for SMTP probes in the dummy app.
test/dummy/probes/http_probes.yml Adds service mapping for HTTP probes in the dummy app.
test/dummy/docker-compose.yml Extends Prometheus retention to 90 days for the dummy environment.
test/dummy/db/schema.rb Updates dummy schema with new rollups table.
test/dummy/config/services.yml Defines dummy Upright::Service records (public + internal).
test/dummy/config/recurring.yml Schedules rollup aggregation job in dummy recurring config.
test/dummy/config/prometheus/rules/upright.yml Updates dummy Prometheus rule grouping to keep probe_service.
test/dummy/config/initializers/upright.rb Enables public status in dummy initializer for development/testing.
test/dummy/bin/seed-prometheus Seeds 90 days of service-labeled metrics and runs aggregation.
lib/upright/engine.rb Prints public status URL hint in debug callback when enabled.
lib/upright/configuration.rb Adds public-status config + custom domain host allowlisting.
lib/generators/upright/install/templates/upright.rules.yml Updates install template rules to group by probe_service.
db/migrate/20260512000001_create_upright_rollups.rb Creates the upright_rollups_probe_rollups table and indexes.
config/routes.rb Adds public-status constrained routes for root + /feed.
config/initializers/mime_types.rb Registers application/rss+xml MIME type.
config/initializers/date_formats.rb Adds a :month_day date format for tooltips.
CLAUDE.md Points to AGENTS.md.
app/views/upright/public/services/index.rss.builder RSS feed template for degraded services.
app/views/upright/public/services/index.html.erb Public status index page layout.
app/views/upright/public/services/_uptime_bar.html.erb Renders a single day “bar” in the uptime strip.
app/views/upright/public/services/_service.html.erb Renders a service row + 90-day strip + summary uptime.
app/views/upright/public/services/_overall_banner.html.erb Renders the overall status banner.
app/views/upright/public/services/_degraded_list.html.erb Renders the list of currently degraded services.
app/views/layouts/upright/public.rss.builder RSS layout wrapper.
app/views/layouts/upright/public.html.erb Public HTML layout for status pages.
app/models/upright/status.rb Defines status values/priority + uptime-threshold mapping.
app/models/upright/service/daily_status.rb Value object for a single day’s status, fraction, and tooltip.
app/models/upright/service.rb FrozenRecord-backed service model + history/degraded/overall logic.
app/models/upright/rollups/probe_rollup.rb Rollup model + Prometheus fetch and persistence behavior.
app/models/concerns/upright/services/live_status.rb Live Prometheus status + outage start inference.
app/models/concerns/upright/probeable.rb Adds probe-class tracking and maps probes to a service.
app/jobs/upright/rollups/daily_aggregation_job.rb Job to roll up completed days into probe rollups.
app/helpers/upright/public/services_helper.rb Labels, outage phrasing, and uptime-average helpers.
app/controllers/upright/public/services_controller.rb Public controller index with short expires_in caching.
app/controllers/upright/public/base_controller.rb Base controller for public pages with dedicated layout.
app/assets/stylesheets/upright/public_status.css Styling for the public status page components.
AGENTS.md Contributor guidance for engine-local workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +12
def self.overall_status
Upright::Status::PRIORITY.find { |status| all.any? { |service| service.live_status == status } } || :operational
end
Comment on lines +18 to +24
def self.degraded
all.filter_map do |service|
status = service.live_status
unless status == :operational
{ service: service, status: status, started_at: service.current_outage_started_at }
end
end
Comment on lines +14 to +26
def self.rollup_day(day)
fetch_uptime_for(day).each do |probe_uptime|
find_or_create_by(probe_name: probe_uptime.fetch(:probe_name), period_start: day.beginning_of_day) do |rollup|
rollup.probe_service = probe_uptime[:probe_service]
rollup.uptime_fraction = probe_uptime.fetch(:uptime_fraction)
end
end
end

def self.fetch_uptime_for(day)
query_time = [ day.end_of_day, Time.current ].min

response = prometheus_client.query(query: PROMETHEUS_METRIC, time: query_time.iso8601).deep_symbolize_keys
Comment on lines 140 to 148
def configure_allowed_hosts
port_suffix = Rails.env.local? ? "(:\\d+)?" : ""
Rails.application.config.hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ]
hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ]
Array(@public_status_custom_domains).each do |domain|
hosts << /\A#{Regexp.escape(domain)}#{port_suffix}\z/
end
Rails.application.config.hosts = hosts
Rails.application.config.action_dispatch.tld_length = 1
end
xml.rss(version: "2.0") do
xml.channel do
xml.title "Upright Status"
xml.link upright.public_services_root_url
Comment on lines +10 to +15
started_at = issue[:started_at] || Time.current
xml.item do
xml.title "#{issue[:service].name} — #{status_label(issue[:status])}"
xml.description "#{issue[:service].name} is currently #{status_label(issue[:status]).downcase} #{outage_duration_phrase(started_at: issue[:started_at])}."
xml.pubDate started_at.rfc822
xml.guid "#{issue[:service].code}-#{started_at.to_i}", isPermaLink: "false"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants