Skip to content

HYPERFLEET-988 - docs: define and document E2E test tier classification criteria#88

Open
rafabene wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-988-tier-classification-criteria
Open

HYPERFLEET-988 - docs: define and document E2E test tier classification criteria#88
rafabene wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-988-tier-classification-criteria

Conversation

@rafabene
Copy link
Copy Markdown
Contributor

@rafabene rafabene commented May 4, 2026

Summary

  • Created test-design/README.md with comprehensive tier classification guide: explicit criteria (blast radius, user impact, data integrity, frequency), decision flowchart, and 5 concrete examples per tier drawn from existing test cases
  • Updated pkg/labels/labels.go to reference the new guide for full definitions
  • Updated test-design/templates/testcase-template.md to mention the tier classification guide
  • Spot-checked all 50+ existing test cases across 9 files -- all tier assignments are consistent with the documented criteria

Test plan

  • make check passes (vet, lint, unit tests)
  • All markdown code blocks have language identifiers (MD040)
  • All anchor links in README.md verified against actual headings in test case files
  • No runtime code changes -- documentation only

Closes: HYPERFLEET-988

Summary by CodeRabbit

  • Documentation
    • Added comprehensive E2E test design documentation defining tier classification (Tier0/Tier1/Tier2), CI/CD gating rules, decision criteria, and a decision flowchart.
    • Introduced orthogonal test labels (negative, perf, upgrade, disruptive, slow) and guidance on combining them with tiers.
    • Updated test case template to reference the new design doc and clarified severity label scope and linking to full criteria.

@openshift-ci openshift-ci Bot requested review from mbrudnoy and vkareh May 4, 2026 14:50
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ma-hill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cf62d9d4-d352-495c-9a77-dcb0886e5bcb

📥 Commits

Reviewing files that changed from the base of the PR and between 71e6649 and 6f6b559.

📒 Files selected for processing (1)
  • test-design/README.md

Walkthrough

Added a new test-design README documenting HyperFleet E2E test design: a three-tier test classification (Tier0/Tier1/Tier2) with CI/CD gating rules, decision questions, and a tier-decision flowchart; introduced orthogonal test labels (negative, perf, upgrade, disruptive, slow) and step-by-step test-case authoring guidance. Updated a comment in pkg/labels/labels.go to reference the new test-design README and adjusted the test case template header to point to the tier classification guide. No code or exported symbols were changed.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: defining and documenting E2E test tier classification criteria across three files with criteria, flowchart, and examples.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@rafabene
Copy link
Copy Markdown
Contributor Author

rafabene commented May 4, 2026

/retest

Comment thread test-design/README.md
Comment on lines +106 to +109
```text
Is this a core lifecycle happy path (create/update/delete)?
|
+-- YES --> Tier0
Copy link
Copy Markdown

@pnguyen44 pnguyen44 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flowchart only checks for "core lifecycle happy path" but Tier0 also covers things like platform guarantees (e.g. adapter ordering) and data integrity. Someone using just the flowchart could accidentally mark those as Tier1. Maybe broaden the first question to include all the Tier0 criteria? Or if that makes the flowchart too complex, could drop it since the criteria bullets already cover the same ground.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the ambiguity matters since this drives Prow job configuration. Tier1 is intended as non-blocking (advisory). Updated the table to make this explicit:

| Tier | CI/CD Gate | Response SLA |
|------|-----------||--------------|
| Tier0 | Blocks release | Fix immediately |
| Tier1 | Advisory (non-blocking) | Fix before next sprint ends |
| Tier2 | Informational | Fix when capacity allows |

This also makes the Tier1 vs Tier2 distinction clearer: Tier1 has a sprint SLA, Tier2 has no deadline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the flowchart's first question was too narrow. Updated it to cover all four Tier0 criteria:

Does this test validate a core Tier0 concern?
(lifecycle happy path, data integrity, platform guarantee, or system-wide blast radius)

This keeps the flowchart concise while ensuring all Tier0 criteria are represented in the first decision point.

Comment thread test-design/README.md Outdated
| Tier | CI/CD Gate | Response SLA |
|------|-----------|--------------|
| **Tier0** | Blocks release | Fix immediately |
| **Tier1** | Should be addressed before release | Fix before next sprint ends |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Tier1 block the release pipeline or not? "Should be addressed" is ambiguous. If it blocks, say "Blocks release" (maybe with a lower urgency SLA than Tier0). If it doesn't block, clarify that it's advisory so the distinction from Tier2 is clear.

Comment thread test-design/README.md
|-----------|-----------|
| [Cluster reflects adapter failure in top-level status](testcases/cluster.md#test-title-cluster-can-reflect-adapter-failure-in-top-level-status) | Error handling -- validates failure reporting, not the happy path. Core operations work; this tests that failures are visible |
| [PATCH to soft-deleted cluster returns 409](testcases/delete-cluster.md#test-title-patch-to-soft-deleted-cluster-returns-409-conflict) | API contract -- validates rejection of invalid mutations. The delete path works; this tests that the API guards against misuse |
| [Concurrent cluster creations without conflicts](testcases/concurrent-processing.md#test-title-system-can-process-concurrent-cluster-creations-without-resource-conflicts) | Secondary workflow -- concurrent creation is important but not the primary single-cluster path |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Tier0 instead? If concurrent creations actually conflict with each other, that sounds more like data integrity than a secondary workflow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — the name "without conflicts" does imply data integrity risk if it fails. The reasoning for Tier1 is that the single-cluster create path still works; this tests the concurrent/multi-tenant scenario. If concurrent creates actually produce corrupted resources (data integrity), I'd agree it should be Tier0. But if the failure mode is more like "one of the creates fails with a retryable conflict error," that's Tier1 error handling territory.

I'll clarify the "Why Tier1" justification to address this: the test validates that the system handles concurrent load gracefully, not that individual creates work. If the test scope actually covers data corruption scenarios, we should revisit the tier.

Comment thread test-design/README.md Outdated
|-----------|-----------|
| [Cluster reaches correct status after adapter crash and recovery](testcases/cluster.md#test-title-cluster-can-reach-correct-status-after-adapter-crash-and-recovery) | Infrastructure recovery -- involves killing adapter pods and verifying self-healing. Crashes are rare; operators can restart pods manually |
| [Maestro server unavailability graceful handling](testcases/adapter-with-maestro-transport.md#test-title-adapter-can-handle-maestro-server-unavailability-gracefully) | Infrastructure recovery -- simulates Maestro outage. Server failures are rare and recoverable |
| [Stuck deletion -- adapter unable to finalize](testcases/delete-cluster.md#test-title-stuck-deletion----adapter-unable-to-finalize-prevents-hard-delete) | Error case -- adapter permanently fails to finalize. Requires specific adapter misconfiguration and manual intervention |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is stuck deletion really Tier2? A cluster that can never finish deleting leaks resources over time. Feels more like Tier1 since the impact grows if no one catches it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the compounding resource leak elevates this beyond a pure edge case. Moved "Stuck deletion" from Tier2 to Tier1 under the "Operational visibility" criterion. The distinction from crash recovery (which stays Tier2) is that crash recovery is self-healing, while stuck deletion silently leaks resources if nobody catches it.

Updated justification: Operational visibility — adapter permanently fails to finalize, causing silent resource leak that compounds over time if undetected.

Comment thread test-design/README.md
Comment on lines +34 to +36
#### Decision questions

- "If this test fails in production, can users still create/update/delete clusters and nodepools?" -- If **no**, it's Tier0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision questions in each tier restate the criteria in question form without adding new info. Consider dropping them to keep the doc shorter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the overlap — the criteria are the authoritative rules and the questions restate them. I kept the questions because they're how I'd actually think through a tier assignment during review ("if this fails, can users still...?"), and the flowchart at the bottom condenses them into a quick path. But I'm open to dropping them if the team feels the doc is too long — the criteria bullets alone are sufficient. What do others think?

…on guide

- Clarify CI/CD gate: Tier1 is advisory (non-blocking), Tier2 is informational
- Promote 'Stuck deletion' from Tier2 to Tier1 (operational visibility - silent resource leak)
- Broaden flowchart first question to cover all four Tier0 criteria
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants