HYPERFLEET-988 - docs: define and document E2E test tier classification criteria by rafabene · Pull Request #88 · openshift-hyperfleet/hyperfleet-e2e

rafabene · 2026-05-04T14:50:27Z

Summary

Created test-design/README.md with comprehensive tier classification guide: explicit criteria (blast radius, user impact, data integrity, frequency), decision flowchart, and 5 concrete examples per tier drawn from existing test cases
Updated pkg/labels/labels.go to reference the new guide for full definitions
Updated test-design/templates/testcase-template.md to mention the tier classification guide
Spot-checked all 50+ existing test cases across 9 files -- all tier assignments are consistent with the documented criteria

Test plan

make check passes (vet, lint, unit tests)
All markdown code blocks have language identifiers (MD040)
All anchor links in README.md verified against actual headings in test case files
No runtime code changes -- documentation only

Closes: HYPERFLEET-988

Summary by CodeRabbit

Documentation
- Added comprehensive E2E test design documentation defining tier classification (Tier0/Tier1/Tier2), CI/CD gating rules, decision criteria, and a decision flowchart.
- Introduced orthogonal test labels (negative, perf, upgrade, disruptive, slow) and guidance on combining them with tiers.
- Updated test case template to reference the new design doc and clarified severity label scope and linking to full criteria.

…on criteria

openshift-ci · 2026-05-04T14:50:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ma-hill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-04T14:53:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cf62d9d4-d352-495c-9a77-dcb0886e5bcb

📥 Commits

Reviewing files that changed from the base of the PR and between 71e6649 and 6f6b559.

📒 Files selected for processing (1)

test-design/README.md

Walkthrough

Added a new test-design README documenting HyperFleet E2E test design: a three-tier test classification (Tier0/Tier1/Tier2) with CI/CD gating rules, decision questions, and a tier-decision flowchart; introduced orthogonal test labels (negative, perf, upgrade, disruptive, slow) and step-by-step test-case authoring guidance. Updated a comment in pkg/labels/labels.go to reference the new test-design README and adjusted the test case template header to point to the tier classification guide. No code or exported symbols were changed.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: defining and documenting E2E test tier classification criteria across three files with criteria, flowchart, and examples.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rafabene · 2026-05-04T14:59:49Z

/retest

pnguyen44 · 2026-05-04T15:43:36Z

+```text
+Is this a core lifecycle happy path (create/update/delete)?
+  |
+  +-- YES --> Tier0


The flowchart only checks for "core lifecycle happy path" but Tier0 also covers things like platform guarantees (e.g. adapter ordering) and data integrity. Someone using just the flowchart could accidentally mark those as Tier1. Maybe broaden the first question to include all the Tier0 criteria? Or if that makes the flowchart too complex, could drop it since the criteria bullets already cover the same ground.

Good catch — the ambiguity matters since this drives Prow job configuration. Tier1 is intended as non-blocking (advisory). Updated the table to make this explicit:

| Tier | CI/CD Gate | Response SLA |
|------|-----------||--------------|
| Tier0 | Blocks release | Fix immediately |
| Tier1 | Advisory (non-blocking) | Fix before next sprint ends |
| Tier2 | Informational | Fix when capacity allows |

This also makes the Tier1 vs Tier2 distinction clearer: Tier1 has a sprint SLA, Tier2 has no deadline.

Good catch — the flowchart's first question was too narrow. Updated it to cover all four Tier0 criteria:

Does this test validate a core Tier0 concern? (lifecycle happy path, data integrity, platform guarantee, or system-wide blast radius)

This keeps the flowchart concise while ensuring all Tier0 criteria are represented in the first decision point.

pnguyen44 · 2026-05-04T15:46:21Z

+| Tier | CI/CD Gate | Response SLA |
+|------|-----------|--------------|
+| **Tier0** | Blocks release | Fix immediately |
+| **Tier1** | Should be addressed before release | Fix before next sprint ends |


Does Tier1 block the release pipeline or not? "Should be addressed" is ambiguous. If it blocks, say "Blocks release" (maybe with a lower urgency SLA than Tier0). If it doesn't block, clarify that it's advisory so the distinction from Tier2 is clear.

pnguyen44 · 2026-05-04T15:50:23Z

+|-----------|-----------|
+| [Cluster reflects adapter failure in top-level status](testcases/cluster.md#test-title-cluster-can-reflect-adapter-failure-in-top-level-status) | Error handling -- validates failure reporting, not the happy path. Core operations work; this tests that failures are visible |
+| [PATCH to soft-deleted cluster returns 409](testcases/delete-cluster.md#test-title-patch-to-soft-deleted-cluster-returns-409-conflict) | API contract -- validates rejection of invalid mutations. The delete path works; this tests that the API guards against misuse |
+| [Concurrent cluster creations without conflicts](testcases/concurrent-processing.md#test-title-system-can-process-concurrent-cluster-creations-without-resource-conflicts) | Secondary workflow -- concurrent creation is important but not the primary single-cluster path |


Should this be Tier0 instead? If concurrent creations actually conflict with each other, that sounds more like data integrity than a secondary workflow.

Fair point — the name "without conflicts" does imply data integrity risk if it fails. The reasoning for Tier1 is that the single-cluster create path still works; this tests the concurrent/multi-tenant scenario. If concurrent creates actually produce corrupted resources (data integrity), I'd agree it should be Tier0. But if the failure mode is more like "one of the creates fails with a retryable conflict error," that's Tier1 error handling territory.

I'll clarify the "Why Tier1" justification to address this: the test validates that the system handles concurrent load gracefully, not that individual creates work. If the test scope actually covers data corruption scenarios, we should revisit the tier.

pnguyen44 · 2026-05-04T16:50:39Z

+|-----------|-----------|
+| [Cluster reaches correct status after adapter crash and recovery](testcases/cluster.md#test-title-cluster-can-reach-correct-status-after-adapter-crash-and-recovery) | Infrastructure recovery -- involves killing adapter pods and verifying self-healing. Crashes are rare; operators can restart pods manually |
+| [Maestro server unavailability graceful handling](testcases/adapter-with-maestro-transport.md#test-title-adapter-can-handle-maestro-server-unavailability-gracefully) | Infrastructure recovery -- simulates Maestro outage. Server failures are rare and recoverable |
+| [Stuck deletion -- adapter unable to finalize](testcases/delete-cluster.md#test-title-stuck-deletion----adapter-unable-to-finalize-prevents-hard-delete) | Error case -- adapter permanently fails to finalize. Requires specific adapter misconfiguration and manual intervention |


Is stuck deletion really Tier2? A cluster that can never finish deleting leaks resources over time. Feels more like Tier1 since the impact grows if no one catches it.

You're right — the compounding resource leak elevates this beyond a pure edge case. Moved "Stuck deletion" from Tier2 to Tier1 under the "Operational visibility" criterion. The distinction from crash recovery (which stays Tier2) is that crash recovery is self-healing, while stuck deletion silently leaks resources if nobody catches it.

Updated justification: Operational visibility — adapter permanently fails to finalize, causing silent resource leak that compounds over time if undetected.

pnguyen44 · 2026-05-04T16:55:10Z

+#### Decision questions
+
+- "If this test fails in production, can users still create/update/delete clusters and nodepools?" -- If **no**, it's Tier0


The decision questions in each tier restate the criteria in question form without adding new info. Consider dropping them to keep the doc shorter.

I see the overlap — the criteria are the authoritative rules and the questions restate them. I kept the questions because they're how I'd actually think through a tier assignment during review ("if this fails, can users still...?"), and the flowchart at the bottom condenses them into a quick path. But I'm open to dropping them if the team feels the doc is too long — the criteria bullets alone are sufficient. What do others think?

…on guide - Clarify CI/CD gate: Tier1 is advisory (non-blocking), Tier2 is informational - Promote 'Stuck deletion' from Tier2 to Tier1 (operational visibility - silent resource leak) - Broaden flowchart first question to cover all four Tier0 criteria

HYPERFLEET-988 - docs: define and document E2E test tier classificati…

71e6649

…on criteria

openshift-ci Bot requested review from mbrudnoy and vkareh May 4, 2026 14:50

pnguyen44 reviewed May 4, 2026

View reviewed changes

		#### Decision questions

		- "If this test fails in production, can users still create/update/delete clusters and nodepools?" -- If no, it's Tier0

Conversation

rafabene commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

rafabene commented May 4, 2026

Uh oh!

pnguyen44 May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafabene commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

pnguyen44 May 4, 2026 •

edited

Loading