-
Notifications
You must be signed in to change notification settings - Fork 428
CNTRLPLANE-642: Handle missing resource group gracefully during Azure cluster deletion #7394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
0eafb40 to
78f9262
Compare
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
78f9262 to
f61eac0
Compare
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
59fce85 to
c7aa13f
Compare
|
/test all |
|
/retest |
|
/test e2e-aks |
|
/auto-cc |
jparrill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, looks good overall
cmd/cluster/azure/destroy.go
Outdated
| "github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resources/armresources" | ||
|
|
||
| "k8s.io/apimachinery/pkg/util/errors" | ||
| kerrors "k8s.io/apimachinery/pkg/util/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed the kerrors alias since there's no conflicting import.
AI-assisted response via Claude Code
cmd/cluster/azure/destroy.go
Outdated
| inputErrors = append(inputErrors, fmt.Errorf("location is required")) | ||
| } | ||
| if err := errors.NewAggregate(inputErrors); err != nil { | ||
| if err := kerrors.NewAggregate(inputErrors); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated to use errors.NewAggregate directly.
AI-assisted response via Claude Code
cmd/infra/azure/destroy.go
Outdated
| } else { | ||
| logger.Info("Deleting main resource group", "resource-group", mainResourceGroup) | ||
| destroyFuture, err = resourceGroupClient.BeginDelete(ctx, mainResourceGroup, nil) | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see many if/else in this block (from if o.PreserveResourceGroup {. Is possible to enhance it a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Extracted the resource group deletion logic into a deleteResourceGroup helper method that handles the 404 case gracefully. This reduces nesting and makes the code more readable.
AI-assisted response via Claude Code
| // - Azure SDK Design Guidelines for Go: https://azure.github.io/azure-sdk/golang_introduction.html | ||
| // - Azure SDK internal mock package: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/internal/mock | ||
| // - Azure SDK fake server examples: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/samples/fakes | ||
| // - policy.Transporter interface: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore/policy#Transporter | ||
| // - azcore.TokenCredential interface: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore#TokenCredential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty useful 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Glad these utilities are helpful.
AI-assisted response via Claude Code
| } | ||
|
|
||
| // Do returns a mock HTTP response with the configured status code and body. | ||
| func (m *MockAzureTransport) Do(req *http.Request) (*http.Response, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these mocks be changed by mockgen ones? (Maybe not due to the best practices things, but just confirming)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Azure SDK for Go recommends mocking at the HTTP transport level rather than using interface mocks. This is because the SDK's client types aren't designed around mockable interfaces - they use concrete types that wrap the transport layer.
The Azure SDK team provides guidance on this approach:
Using policy.Transporter gives us control at the HTTP level, which is where Azure recommends testing. This allows us to test the actual client code paths including error handling, retries, and response parsing - things that would be bypassed with interface mocks.
That said, if the team prefers mockgen for consistency with other parts of the codebase, we could explore wrapping the client behind an interface, though that would require additional abstraction layers.
AI-assisted response via Claude Code
…deletion When a user-provided resource group name does not exist in Azure (404), the destroy operation now continues gracefully rather than failing. This prevents cluster destroy from getting stuck when the Azure infrastructure has already been manually deleted or was never fully created. Changes: - Add IsNotFoundError helper to check for Azure 404 responses using Azure SDK best practices (errors.As with azcore.ResponseError) - Add deleteResourceGroup helper to consolidate resource group deletion logic - Check resource group existence before attempting deletion when using user-provided resource group names - Handle 404 errors gracefully by logging and continuing with cluster deletion The helper function follows Azure SDK guidelines for error handling: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore#ResponseError 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration-style tests that verify the actual DestroyInfraOptions.Run behavior when Azure returns a 404 ResourceGroupNotFound error. This follows the Azure SDK testing pattern of injecting mock transports at the HTTP level. Changes: - Add clientOptions and azureCredential fields to DestroyInfraOptions for dependency injection during testing - Add shared Azure test utilities in support/testutil/azure.go: - FakeAzureCredential: implements azcore.TokenCredential - MockAzureTransport: implements policy.Transporter - MockAzureResourceGroupTransport: handles HEAD/DELETE methods properly - Factory functions for common error scenarios - Add TestDestroyInfraRun test that verifies: - Resource group exists: deletion succeeds - 404 errors are handled gracefully (no error returned) - Other errors (403) are properly propagated References: - Azure SDK Design Guidelines: https://azure.github.io/azure-sdk/golang_introduction.html - Azure SDK internal mock package: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/internal/mock - Azure SDK fake server examples: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/samples/fakes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
c7aa13f to
380fcea
Compare
|
Tested this out. Here is what previously happened. Here is what happens now with this PR: |
When deleting a HostedCluster after the Azure resource group has been removed externally, the cluster deletion gets stuck because CAPZ AzureMachine finalizers prevent Machine deletion, which blocks NodePool and HostedCluster cleanup. This implements the platform.OrphanDeleter interface for Azure: - Check if the Azure resource group exists via Azure SDK - If gone (404), remove finalizers from all deleting AzureMachines - This unblocks the deletion chain The implementation uses a fail-safe approach: only remove finalizers when the Azure API definitively returns 404. For any other error (permissions, network issues, etc.), assume the resource group exists to avoid orphaning cloud resources that would continue to incur costs. This extends CNTRLPLANE-642 with operator-side logic to complement the CLI-side fix for handling missing resource groups during cluster deletion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test all |
|
/retest-required |
|
@bryan-cox: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
Fixes Azure HostedCluster deletion when the resource group is already gone (404).
Two fixes:
hypershift destroy cluster azurenow handles missing resource groups gracefully instead of erroringDeleteOrphanedMachinesfor Azure to remove CAPZ finalizers from orphaned AzureMachines, unblocking the Machine → NodePool → HostedCluster deletion chainKey changes:
errors.AspatternresourceGroupCheckerinterface with fail-safe approach (only act on definitive 404)Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/CNTRLPLANE-642
Special notes for your reviewer:
The operator fix uses a fail-safe approach: finalizers are only removed when Azure definitively returns 404. Other errors (403, network issues) assume the resource group exists to avoid orphaning cloud resources.
Checklist: