HYPERFLEET-988 - docs: define and document E2E test tier classification criteria#88
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughAdded a new test-design README documenting HyperFleet E2E test design: a three-tier test classification (Tier0/Tier1/Tier2) with CI/CD gating rules, decision questions, and a tier-decision flowchart; introduced orthogonal test labels (negative, perf, upgrade, disruptive, slow) and step-by-step test-case authoring guidance. Updated a comment in pkg/labels/labels.go to reference the new test-design README and adjusted the test case template header to point to the tier classification guide. No code or exported symbols were changed. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/retest |
| ```text | ||
| Is this a core lifecycle happy path (create/update/delete)? | ||
| | | ||
| +-- YES --> Tier0 |
There was a problem hiding this comment.
The flowchart only checks for "core lifecycle happy path" but Tier0 also covers things like platform guarantees (e.g. adapter ordering) and data integrity. Someone using just the flowchart could accidentally mark those as Tier1. Maybe broaden the first question to include all the Tier0 criteria? Or if that makes the flowchart too complex, could drop it since the criteria bullets already cover the same ground.
There was a problem hiding this comment.
Good catch — the ambiguity matters since this drives Prow job configuration. Tier1 is intended as non-blocking (advisory). Updated the table to make this explicit:
| Tier | CI/CD Gate | Response SLA |
|------|-----------||--------------|
| Tier0 | Blocks release | Fix immediately |
| Tier1 | Advisory (non-blocking) | Fix before next sprint ends |
| Tier2 | Informational | Fix when capacity allows |
This also makes the Tier1 vs Tier2 distinction clearer: Tier1 has a sprint SLA, Tier2 has no deadline.
There was a problem hiding this comment.
Good catch — the flowchart's first question was too narrow. Updated it to cover all four Tier0 criteria:
Does this test validate a core Tier0 concern?
(lifecycle happy path, data integrity, platform guarantee, or system-wide blast radius)
This keeps the flowchart concise while ensuring all Tier0 criteria are represented in the first decision point.
| | Tier | CI/CD Gate | Response SLA | | ||
| |------|-----------|--------------| | ||
| | **Tier0** | Blocks release | Fix immediately | | ||
| | **Tier1** | Should be addressed before release | Fix before next sprint ends | |
There was a problem hiding this comment.
Does Tier1 block the release pipeline or not? "Should be addressed" is ambiguous. If it blocks, say "Blocks release" (maybe with a lower urgency SLA than Tier0). If it doesn't block, clarify that it's advisory so the distinction from Tier2 is clear.
| |-----------|-----------| | ||
| | [Cluster reflects adapter failure in top-level status](testcases/cluster.md#test-title-cluster-can-reflect-adapter-failure-in-top-level-status) | Error handling -- validates failure reporting, not the happy path. Core operations work; this tests that failures are visible | | ||
| | [PATCH to soft-deleted cluster returns 409](testcases/delete-cluster.md#test-title-patch-to-soft-deleted-cluster-returns-409-conflict) | API contract -- validates rejection of invalid mutations. The delete path works; this tests that the API guards against misuse | | ||
| | [Concurrent cluster creations without conflicts](testcases/concurrent-processing.md#test-title-system-can-process-concurrent-cluster-creations-without-resource-conflicts) | Secondary workflow -- concurrent creation is important but not the primary single-cluster path | |
There was a problem hiding this comment.
Should this be Tier0 instead? If concurrent creations actually conflict with each other, that sounds more like data integrity than a secondary workflow.
There was a problem hiding this comment.
Fair point — the name "without conflicts" does imply data integrity risk if it fails. The reasoning for Tier1 is that the single-cluster create path still works; this tests the concurrent/multi-tenant scenario. If concurrent creates actually produce corrupted resources (data integrity), I'd agree it should be Tier0. But if the failure mode is more like "one of the creates fails with a retryable conflict error," that's Tier1 error handling territory.
I'll clarify the "Why Tier1" justification to address this: the test validates that the system handles concurrent load gracefully, not that individual creates work. If the test scope actually covers data corruption scenarios, we should revisit the tier.
| |-----------|-----------| | ||
| | [Cluster reaches correct status after adapter crash and recovery](testcases/cluster.md#test-title-cluster-can-reach-correct-status-after-adapter-crash-and-recovery) | Infrastructure recovery -- involves killing adapter pods and verifying self-healing. Crashes are rare; operators can restart pods manually | | ||
| | [Maestro server unavailability graceful handling](testcases/adapter-with-maestro-transport.md#test-title-adapter-can-handle-maestro-server-unavailability-gracefully) | Infrastructure recovery -- simulates Maestro outage. Server failures are rare and recoverable | | ||
| | [Stuck deletion -- adapter unable to finalize](testcases/delete-cluster.md#test-title-stuck-deletion----adapter-unable-to-finalize-prevents-hard-delete) | Error case -- adapter permanently fails to finalize. Requires specific adapter misconfiguration and manual intervention | |
There was a problem hiding this comment.
Is stuck deletion really Tier2? A cluster that can never finish deleting leaks resources over time. Feels more like Tier1 since the impact grows if no one catches it.
There was a problem hiding this comment.
You're right — the compounding resource leak elevates this beyond a pure edge case. Moved "Stuck deletion" from Tier2 to Tier1 under the "Operational visibility" criterion. The distinction from crash recovery (which stays Tier2) is that crash recovery is self-healing, while stuck deletion silently leaks resources if nobody catches it.
Updated justification: Operational visibility — adapter permanently fails to finalize, causing silent resource leak that compounds over time if undetected.
| #### Decision questions | ||
|
|
||
| - "If this test fails in production, can users still create/update/delete clusters and nodepools?" -- If **no**, it's Tier0 |
There was a problem hiding this comment.
The decision questions in each tier restate the criteria in question form without adding new info. Consider dropping them to keep the doc shorter.
There was a problem hiding this comment.
I see the overlap — the criteria are the authoritative rules and the questions restate them. I kept the questions because they're how I'd actually think through a tier assignment during review ("if this fails, can users still...?"), and the flowchart at the bottom condenses them into a quick path. But I'm open to dropping them if the team feels the doc is too long — the criteria bullets alone are sufficient. What do others think?
…on guide - Clarify CI/CD gate: Tier1 is advisory (non-blocking), Tier2 is informational - Promote 'Stuck deletion' from Tier2 to Tier1 (operational visibility - silent resource leak) - Broaden flowchart first question to cover all four Tier0 criteria
Summary
test-design/README.mdwith comprehensive tier classification guide: explicit criteria (blast radius, user impact, data integrity, frequency), decision flowchart, and 5 concrete examples per tier drawn from existing test casespkg/labels/labels.goto reference the new guide for full definitionstest-design/templates/testcase-template.mdto mention the tier classification guideTest plan
make checkpasses (vet, lint, unit tests)Closes: HYPERFLEET-988
Summary by CodeRabbit