Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions core-services/ship-status/component-monitor-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -554,3 +554,88 @@ components:
history_runs: 5
failed_runs_threshold: 3
# END: auto-generated build-farm entries

# BEGIN: Release Controller entries
- component_slug: "release-controller"
sub_component_slug: "openshift-release"
http_monitor:
url: "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check retry_after values across all HTTP monitors
rg -n 'retry_after:' core-services/ship-status/component-monitor-config.yaml

Repository: openshift/release

Length of output: 449


🏁 Script executed:

# Check the context around the new retry_after values to see if there are explanatory comments
head -n 650 core-services/ship-status/component-monitor-config.yaml | tail -n 100

Repository: openshift/release

Length of output: 3537


🏁 Script executed:

# Check git history for when these lines were added
git log --oneline -p core-services/ship-status/component-monitor-config.yaml | head -n 200

Repository: openshift/release

Length of output: 6442


Confirm that retry_after: 5s is intentional for Release Controller and CRT services.

The Release Controller entries (lines 564–624) and CRT Services entries (lines 633–639) all use retry_after: 5s, while existing Prow monitors use retry_after: 4m (lines 8, 109). This 48× difference in retry frequency is consistent across all new entries, suggesting intentional design, but no comments explain the rationale for the difference.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core-services/ship-status/component-monitor-config.yaml` at line 564, Check
and confirm whether the shorter retry interval is intentional: review the
Release Controller entries and CRT Services entries that set "retry_after: 5s"
and either (a) change them to match existing Prow monitors' "retry_after: 4m" if
they should follow the same backoff policy, or (b) keep "retry_after: 5s" but
add an inline comment above those entries explaining the rationale and risk
tradeoffs for the 5s interval; update the Release Controller and CRT Services
monitor blocks that currently contain "retry_after: 5s" accordingly so the
intent is explicit.

- component_slug: "release-controller"
sub_component_slug: "origin-release"
http_monitor:
url: "https://origin-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64"
http_monitor:
url: "https://openshift-release-arm64.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi"
http_monitor:
url: "https://openshift-release-multi.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le"
http_monitor:
url: "https://openshift-release-ppc64le.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x"
http_monitor:
url: "https://openshift-release-s390x.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-priv"
http_monitor:
url: "https://openshift-release-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64-priv"
http_monitor:
url: "https://openshift-release-arm64-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi-priv"
http_monitor:
url: "https://openshift-release-multi-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le-priv"
http_monitor:
url: "https://openshift-release-ppc64le-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x-priv"
http_monitor:
url: "https://openshift-release-s390x-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
# END: Release Controller entries
Comment on lines +558 to +625
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing monitor configuration for dpcr-openshift-release sub-component.

The dashboard configuration (dashboard-config.yaml lines 535-542) includes a dpcr-openshift-release sub-component, but there is no corresponding monitor entry in this file. This will cause the dashboard to display a component without any health monitoring.

📊 Proposed fix: Add monitor for dpcr-openshift-release

Add the following entry after line 624 (before the "END: Release Controller entries" comment):

       code: 403
       retry_after: 5s
+  - component_slug: "release-controller"
+    sub_component_slug: "dpcr-openshift-release"
+    http_monitor:
+      url: "https://openshift-release.apps.cr.j7t7.p1.openshiftapps.com"
+      code: 200
+      retry_after: 5s
   # END: Release Controller entries
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# BEGIN: Release Controller entries
- component_slug: "release-controller"
sub_component_slug: "openshift-release"
http_monitor:
url: "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "origin-release"
http_monitor:
url: "https://origin-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64"
http_monitor:
url: "https://openshift-release-arm64.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi"
http_monitor:
url: "https://openshift-release-multi.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le"
http_monitor:
url: "https://openshift-release-ppc64le.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x"
http_monitor:
url: "https://openshift-release-s390x.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-priv"
http_monitor:
url: "https://openshift-release-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64-priv"
http_monitor:
url: "https://openshift-release-arm64-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi-priv"
http_monitor:
url: "https://openshift-release-multi-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le-priv"
http_monitor:
url: "https://openshift-release-ppc64le-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x-priv"
http_monitor:
url: "https://openshift-release-s390x-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
# END: Release Controller entries
# BEGIN: Release Controller entries
- component_slug: "release-controller"
sub_component_slug: "openshift-release"
http_monitor:
url: "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "origin-release"
http_monitor:
url: "https://origin-release.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64"
http_monitor:
url: "https://openshift-release-arm64.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi"
http_monitor:
url: "https://openshift-release-multi.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le"
http_monitor:
url: "https://openshift-release-ppc64le.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x"
http_monitor:
url: "https://openshift-release-s390x.apps.ci.l2s4.p1.openshiftapps.com"
code: 200
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-priv"
http_monitor:
url: "https://openshift-release-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-arm64-priv"
http_monitor:
url: "https://openshift-release-arm64-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-multi-priv"
http_monitor:
url: "https://openshift-release-multi-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-ppc64le-priv"
http_monitor:
url: "https://openshift-release-ppc64le-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "openshift-release-s390x-priv"
http_monitor:
url: "https://openshift-release-s390x-priv.apps.ci.l2s4.p1.openshiftapps.com"
code: 403
retry_after: 5s
- component_slug: "release-controller"
sub_component_slug: "dpcr-openshift-release"
http_monitor:
url: "https://openshift-release.apps.cr.j7t7.p1.openshiftapps.com"
code: 200
retry_after: 5s
# END: Release Controller entries
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core-services/ship-status/component-monitor-config.yaml` around lines 558 -
625, Add a missing monitor entry for the release-controller sub_component_slug
"dpcr-openshift-release": create a block matching the other release-controller
entries (component_slug "release-controller", sub_component_slug
"dpcr-openshift-release") with an http_monitor that points to the
dpcr-openshift-release service (e.g.,
https://dpcr-openshift-release.apps.ci.l2s4.p1.openshiftapps.com), set the
expected code to 200 and retry_after to 5s, and insert it alongside the other
Release Controller entries just before the "END: Release Controller entries"
marker.


# BEGIN: CRT Services entries
- component_slug: "crt"
sub_component_slug: "backstage"
http_monitor:
url: "https://backstage.dptools.openshift.org"
code: 403
retry_after: 5s
- component_slug: "crt"
sub_component_slug: "ci-search"
http_monitor:
url: "https://search.dptools.openshift.org"
code: 200
retry_after: 5s
# END: CRT Services entries

119 changes: 119 additions & 0 deletions core-services/ship-status/dashboard-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,125 @@ components:
- rover_group: "test-platform-ci-admins"
- service_account: "system:serviceaccount:ship-status:component-monitor"

- name: "Release Controller"
description: "Manages OpenShift release payloads, acceptance testing, and promotion"
ship_team: "CRT"
sub_components:
- name: "openshift-release"
description: "openshift-release.apps.ci.l2s4.p1.openshiftapps.com"
long_description: "The primary release controller for OpenShift Container Platform. Manages release payload assembly, acceptance job orchestration, and promotion of nightly and CI releases for the amd64 architecture."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "origin-release"
description: "origin-release.apps.ci.l2s4.p1.openshiftapps.com"
long_description: "Release controller for the OKD (Origin) distribution. Manages the upstream community release pipeline separate from the downstream OCP releases."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-arm64"
description: "openshift-release-arm64.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-multi"
description: "openshift-release-multi.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-ppc64le"
description: "openshift-release-ppc64le.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-s390x"
description: "openshift-release-s390x.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-priv"
description: "openshift-release-priv.apps.ci.l2s4.p1.openshiftapps.com"
long_description: "Handles release payloads for private/embargoed OCP content behind OAuth authentication."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-arm64-priv"
description: "openshift-release-arm64-priv.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-multi-priv"
description: "openshift-release-multi-priv.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-ppc64le-priv"
description: "openshift-release-ppc64le-priv.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "openshift-release-s390x-priv"
description: "openshift-release-s390x-priv.apps.ci.l2s4.p1.openshiftapps.com"
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "dpcr-openshift-release"
description: "openshift-release.apps.cr.j7t7.p1.openshiftapps.com"
long_description: "Release controller instance running on the DPCR cluster, providing redundancy and serving the DPCR-hosted release pipeline."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
owners:
- rover_group: "continuous-release-team"
- service_account: "system:serviceaccount:crt-argocd:component-monitor"
Comment on lines +543 to +545
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the crt-argocd service account exists and compare with referenced component-monitor

# Look for any references to crt-argocd namespace in the codebase
rg -n 'crt-argocd' --type yaml

# Look for component-monitor service account definitions
fd -e yaml -e yml | xargs rg -l 'kind:\s*ServiceAccount' | xargs rg -A5 'name:\s*component-monitor'

Repository: openshift/release

Length of output: 3189


🏁 Script executed:

# Search for all service accounts in crt-argocd namespace
rg -A3 'namespace:\s*crt-argocd' --type yaml | rg -B3 'kind:\s*ServiceAccount'

# Also search for any ServiceAccount definitions with crt-argocd context
fd -e yaml -e yml | xargs rg -B5 -A5 'namespace:\s*crt-argocd' | grep -A10 'ServiceAccount'

# Check if crt-argocd namespace is defined anywhere
rg 'namespace.*crt-argocd' --type yaml -B2 -A2

Repository: openshift/release

Length of output: 43


🏁 Script executed:

# Search for all references to crt-argocd in YAML files
rg 'crt-argocd' --type yaml -B2 -A2

# Search for all ServiceAccount definitions in the codebase
fd -e yaml -e yml -exec grep -l 'kind:\s*ServiceAccount' {} \; | head -20

# Look for Release Controller and CRT related configs
rg -i 'release.*controller|continuous.*release' --type yaml -B1 -A1 | head -50

Repository: openshift/release

Length of output: 4665


🏁 Script executed:

# Search for all ServiceAccount definitions
rg 'kind:\s*ServiceAccount' --type yaml -A2

# Look for all namespaces in serviceaccount.yaml to understand the full context
cat clusters/build-clusters/common/ship-status/serviceaccount.yaml | head -100

Repository: openshift/release

Length of output: 50374


🏁 Script executed:

cat clusters/build-clusters/common/ship-status/serviceaccount.yaml

Repository: openshift/release

Length of output: 1048


Fix namespace reference for component-monitor service account.

The service account referenced at lines 545 and 568 uses system:serviceaccount:crt-argocd:component-monitor, but this account does not exist in the crt-argocd namespace. The component-monitor service account is defined only in the ship-status namespace (clusters/build-clusters/common/ship-status/serviceaccount.yaml). Update the references to use system:serviceaccount:ship-status:component-monitor to match the actual service account location, consistent with other components in this file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core-services/ship-status/dashboard-config.yaml` around lines 543 - 545,
Update the incorrect namespace on the component-monitor service account
references: replace occurrences of
"system:serviceaccount:crt-argocd:component-monitor" with
"system:serviceaccount:ship-status:component-monitor" (the entries under the
owners list where the service account is specified) so they point to the actual
service account defined in the ship-status namespace and match other components
in this file.

- name: "CRT"
description: "Services managed by the Continuous Release Tooling team"
ship_team: "CRT"
sub_components:
- name: "Backstage"
description: "backstage.dptools.openshift.org"
long_description: "Backstage provides a centralized service catalog, documentation hub, and developer portal for the OpenShift CI platform. It aggregates team information, service ownership, and technical documentation into a single interface."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
- name: "CI-Search"
description: "search.dptools.openshift.org"
long_description: "CI-Search indexes and provides full-text search across OpenShift CI test results, build logs, and job artifacts. It integrates with Bugzilla and Jira to help developers quickly find relevant test failures and correlate them with known issues."
monitoring:
frequency: 5m
component_monitor: "crt-component-monitor"
auto_resolve: true
requires_confirmation: false
owners:
- rover_group: "continuous-release-team"
- service_account: "system:serviceaccount:crt-argocd:component-monitor"

tags:
- name: "ci-frontend"
description: "User-facing web interfaces for viewing CI jobs and test results"
Expand Down