Skip to content

Conversation

@bryan-cox
Copy link
Member

@bryan-cox bryan-cox commented Dec 16, 2025

What this PR does / why we need it:

Fixes Azure HostedCluster deletion when the resource group is already gone (404).

Two fixes:

  1. CLI: hypershift destroy cluster azure now handles missing resource groups gracefully instead of erroring
  2. Operator: Implements DeleteOrphanedMachines for Azure to remove CAPZ finalizers from orphaned AzureMachines, unblocking the Machine → NodePool → HostedCluster deletion chain

Key changes:

  • Refactor Azure 404 error detection to use SDK-recommended errors.As pattern
  • Add resourceGroupChecker interface with fail-safe approach (only act on definitive 404)
  • Add unit/integration tests with Azure SDK mock patterns

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The operator fix uses a fail-safe approach: finalizers are only removed when Azure definitively returns 404. Other errors (403, network issues) assume the resource group exists to avoid orphaning cloud resources.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 16, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching

This follows Azure SDK best practices per the documentation at vendor/github.com/Azure/azure-sdk-for-go/sdk/azcore/errors.go which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching

This follows Azure SDK best practices per the documentation at vendor/github.com/Azure/azure-sdk-for-go/sdk/azcore/errors.go which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added do-not-merge/needs-area area/cli Indicates the PR includes changes for CLI area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Dec 16, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.16" instead.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable, testable 404 error detection
  • Add comprehensive unit tests for the helper function

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable, testable 404 error detection
  • Add comprehensive unit tests for the helper function

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels Dec 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable, testable 404 error detection
  • Add integration-style tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go for reuse by other Azure CLI tests

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying production code structure

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable, testable 404 error detection
  • Add integration-style tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go for reuse by other Azure CLI tests

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying production code structure

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable 404 error detection
  • Add integration tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go for reuse by other Azure CLI tests

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying production code structure

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable 404 error detection
  • Add integration tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go for reuse by other Azure CLI tests

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying production code structure

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable 404 error detection
  • Add integration tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying CLI behavior

Test scenarios:

  • Resource group exists → deletion succeeds
  • Resource group not found (404) → succeeds without error
  • Authorization error (403) → returns error

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the CLI now handles this gracefully instead of failing with an error. This is the expected behavior since the desired state (resources deleted) is already achieved.

Changes:

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable 404 error detection
  • Add integration tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying CLI behavior

Test scenarios:

  • Resource group exists → deletion succeeds
  • Resource group not found (404) → succeeds without error
  • Authorization error (403) → returns error

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-642](https://issues.redhat.com//browse/CNTRLPLANE-642)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox force-pushed the CNTRLPLANE-642 branch 2 times, most recently from 59fce85 to c7aa13f Compare December 16, 2025 15:03
@bryan-cox
Copy link
Member Author

/test all

@bryan-cox
Copy link
Member Author

/retest

@bryan-cox
Copy link
Member Author

/test e2e-aks

@bryan-cox
Copy link
Member Author

/auto-cc

@openshift-ci openshift-ci bot requested review from enxebre and jparrill December 17, 2025 10:48
Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, looks good overall

"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resources/armresources"

"k8s.io/apimachinery/pkg/util/errors"
kerrors "k8s.io/apimachinery/pkg/util/errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the kerrors alias since there's no conflicting import.


AI-assisted response via Claude Code

inputErrors = append(inputErrors, fmt.Errorf("location is required"))
}
if err := errors.NewAggregate(inputErrors); err != nil {
if err := kerrors.NewAggregate(inputErrors); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated to use errors.NewAggregate directly.


AI-assisted response via Claude Code

} else {
logger.Info("Deleting main resource group", "resource-group", mainResourceGroup)
destroyFuture, err = resourceGroupClient.BeginDelete(ctx, mainResourceGroup, nil)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see many if/else in this block (from if o.PreserveResourceGroup {. Is possible to enhance it a bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Extracted the resource group deletion logic into a deleteResourceGroup helper method that handles the 404 case gracefully. This reduces nesting and makes the code more readable.


AI-assisted response via Claude Code

Comment on lines +9 to +13
// - Azure SDK Design Guidelines for Go: https://azure.github.io/azure-sdk/golang_introduction.html
// - Azure SDK internal mock package: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/internal/mock
// - Azure SDK fake server examples: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/samples/fakes
// - policy.Transporter interface: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore/policy#Transporter
// - azcore.TokenCredential interface: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore#TokenCredential
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty useful 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Glad these utilities are helpful.


AI-assisted response via Claude Code

}

// Do returns a mock HTTP response with the configured status code and body.
func (m *MockAzureTransport) Do(req *http.Request) (*http.Response, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these mocks be changed by mockgen ones? (Maybe not due to the best practices things, but just confirming)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure SDK for Go recommends mocking at the HTTP transport level rather than using interface mocks. This is because the SDK's client types aren't designed around mockable interfaces - they use concrete types that wrap the transport layer.

The Azure SDK team provides guidance on this approach:

Using policy.Transporter gives us control at the HTTP level, which is where Azure recommends testing. This allows us to test the actual client code paths including error handling, retries, and response parsing - things that would be bypassed with interface mocks.

That said, if the team prefers mockgen for consistency with other parts of the codebase, we could explore wrapping the client behind an interface, though that would require additional abstraction layers.


AI-assisted response via Claude Code

bryan-cox and others added 2 commits December 17, 2025 06:47
…deletion

When a user-provided resource group name does not exist in Azure (404),
the destroy operation now continues gracefully rather than failing.
This prevents cluster destroy from getting stuck when the Azure infrastructure
has already been manually deleted or was never fully created.

Changes:
- Add IsNotFoundError helper to check for Azure 404 responses using
  Azure SDK best practices (errors.As with azcore.ResponseError)
- Add deleteResourceGroup helper to consolidate resource group deletion logic
- Check resource group existence before attempting deletion when using
  user-provided resource group names
- Handle 404 errors gracefully by logging and continuing with cluster deletion

The helper function follows Azure SDK guidelines for error handling:
https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azcore#ResponseError

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration-style tests that verify the actual DestroyInfraOptions.Run
behavior when Azure returns a 404 ResourceGroupNotFound error. This follows
the Azure SDK testing pattern of injecting mock transports at the HTTP level.

Changes:
- Add clientOptions and azureCredential fields to DestroyInfraOptions
  for dependency injection during testing
- Add shared Azure test utilities in support/testutil/azure.go:
  - FakeAzureCredential: implements azcore.TokenCredential
  - MockAzureTransport: implements policy.Transporter
  - MockAzureResourceGroupTransport: handles HEAD/DELETE methods properly
  - Factory functions for common error scenarios
- Add TestDestroyInfraRun test that verifies:
  - Resource group exists: deletion succeeds
  - 404 errors are handled gracefully (no error returned)
  - Other errors (403) are properly propagated

References:
- Azure SDK Design Guidelines: https://azure.github.io/azure-sdk/golang_introduction.html
- Azure SDK internal mock package: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/internal/mock
- Azure SDK fake server examples: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/samples/fakes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@bryan-cox
Copy link
Member Author

Tested this out. Here is what previously happened.

% ./bin/hypershift destroy cluster azure --name $CLUSTER_NAME --azure-creds $AZURE_CREDS --resource-group-name ${MANAGED_RG_NAME}
{"level":"info","ts":"2025-12-17T10:55:06-05:00","msg":"Found hosted cluster","namespace":"clusters","name":"brcox-sm-hc"}
{"level":"info","ts":"2025-12-17T10:55:06-05:00","msg":"Using credentials from file","path":"/Users/brcox/.azure/credentials"}
{"level":"error","ts":"2025-12-17T10:55:07-05:00","msg":"Failed to destroy cluster","error":"failed to get resource group name, 'brcox-sm-managed-rg': GET https://management.azure.com/subscriptions/5f99720c-6823-4792-8a28-69efb0719eea/resourcegroups/brcox-sm-managed-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: ResourceGroupNotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"ResourceGroupNotFound\",\n    \"message\": \"Resource group 'brcox-sm-managed-rg' could not be found.\"\n  }\n}\n--------------------------------------------------------------------------------\n","stacktrace":"github.com/openshift/hypershift/cmd/cluster/azure.NewDestroyCommand.func1\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/cmd/cluster/azure/destroy.go:53\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/vendor/github.com/spf13/cobra/command.go:1019\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/vendor/github.com/spf13/cobra/command.go:1148\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/vendor/github.com/spf13/cobra/command.go:1071\ngithub.com/spf13/cobra.(*Command).ExecuteContext\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/vendor/github.com/spf13/cobra/command.go:1064\nmain.main\n\t/Users/brcox/bryan-cox/hypershift_ws/hypershift/main.go:78\nruntime.main\n\t/opt/homebrew/Cellar/go/1.24.5/libexec/src/runtime/proc.go:283"}

Here is what happens now with this PR:

% ./bin/hypershift destroy cluster azure --name $CLUSTER_NAME --azure-creds $AZURE_CREDS --resource-group-name ${MANAGED_RG_NAME}
{"level":"info","ts":"2025-12-17T10:57:43-05:00","msg":"Found hosted cluster","namespace":"clusters","name":"brcox-sm-hc"}
{"level":"info","ts":"2025-12-17T10:57:43-05:00","msg":"Using credentials from file","path":"/Users/brcox/.azure/credentials"}
{"level":"info","ts":"2025-12-17T10:57:44-05:00","msg":"Resource group not found, skipping Azure resource cleanup","resource-group":"brcox-sm-managed-rg"}

When deleting a HostedCluster after the Azure resource group has been
removed externally, the cluster deletion gets stuck because CAPZ
AzureMachine finalizers prevent Machine deletion, which blocks NodePool
and HostedCluster cleanup.

This implements the platform.OrphanDeleter interface for Azure:
- Check if the Azure resource group exists via Azure SDK
- If gone (404), remove finalizers from all deleting AzureMachines
- This unblocks the deletion chain

The implementation uses a fail-safe approach: only remove finalizers
when the Azure API definitively returns 404. For any other error
(permissions, network issues, etc.), assume the resource group exists
to avoid orphaning cloud resources that would continue to incur costs.

This extends CNTRLPLANE-642 with operator-side logic to complement the
CLI-side fix for handling missing resource groups during cluster deletion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 17, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

When deleting an Azure HostedCluster, if the associated resource group does not exist (404), the deletion should complete gracefully. This PR addresses two scenarios:

1. CLI-side fix (hypershift destroy cluster azure)

When a user runs the destroy command and the resource group is already gone, the CLI now handles this gracefully instead of failing with an error.

2. Operator-side fix (HostedCluster deletion)

When deleting a HostedCluster after the Azure resource group has been removed externally (via hypershift destroy infra azure or manual deletion), the cluster deletion would get stuck because:

  • CAPZ has finalizers on AzureMachine objects
  • CAPZ tries to delete the Azure VM but the resource group is gone (404)
  • The finalizer is never removed, blocking Machine → NodePool → HostedCluster deletion

This implements the platform.OrphanDeleter interface for Azure (following the AWS pattern) to remove finalizers from orphaned AzureMachines when the resource group no longer exists.

Changes:

CLI (cmd/):

  • In cmd/cluster/azure/destroy.go: When a user-provided resource group returns 404, log an info message, skip Azure cleanup, and proceed with cluster CR deletion
  • In cmd/infra/azure/destroy.go: Refactor error handling to use the Azure SDK recommended pattern (errors.As with *azcore.ResponseError) instead of fragile string matching
  • Add IsNotFoundError helper function for reusable 404 error detection
  • Add integration tests using Azure SDK mock transport pattern
  • Add shared Azure test utilities in support/testutil/azure.go

Operator (hypershift-operator/):

  • Implement DeleteOrphanedMachines for Azure platform in azure.go
  • Add resourceGroupChecker interface for testability
  • Use fail-safe approach: only remove finalizers when Azure API definitively returns 404
  • Add comprehensive unit tests with mock resource group checker

This follows Azure SDK best practices which states: "Use errors.As() to access this type in the error chain."

Testing approach follows the Azure SDK testing patterns:

  • Mock HTTP transport at the policy.Transporter level
  • Fake azcore.TokenCredential for authentication
  • Inject via ClientOptions without modifying CLI behavior
  • Interface-based mocking for operator unit tests

Test scenarios:

  • Resource group exists → deletion succeeds / no orphan cleanup needed
  • Resource group not found (404) → succeeds without error / removes finalizers from deleting machines
  • Authorization error (403) → returns error / assumes resource group exists (fail-safe)

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The error handling now properly distinguishes between:

  • Resource group not found (404) - treated as non-fatal, logs info message and continues
  • Other Azure API errors - treated as fatal and returned as errors

The operator-side fix uses a fail-safe approach: it only removes finalizers when the Azure API definitively returns 404. For any other error (permissions, network issues, etc.), it assumes the resource group exists to avoid orphaning cloud resources that would continue to incur costs.

The test utilities in support/testutil/azure.go can be reused by other Azure CLI tests.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 17, 2025

@bryan-cox: This pull request references CNTRLPLANE-642 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Fixes Azure HostedCluster deletion when the resource group is already gone (404).

Two fixes:

  1. CLI: hypershift destroy cluster azure now handles missing resource groups gracefully instead of erroring
  2. Operator: Implements DeleteOrphanedMachines for Azure to remove CAPZ finalizers from orphaned AzureMachines, unblocking the Machine → NodePool → HostedCluster deletion chain

Key changes:

  • Refactor Azure 404 error detection to use SDK-recommended errors.As pattern
  • Add resourceGroupChecker interface with fail-safe approach (only act on definitive 404)
  • Add unit/integration tests with Azure SDK mock patterns

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-642

Special notes for your reviewer:

The operator fix uses a fail-safe approach: finalizers are only removed when Azure definitively returns 404. Other errors (403, network issues) assume the resource group exists to avoid orphaning cloud resources.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Member Author

/test all

@bryan-cox
Copy link
Member Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-21 aac2b64 link true /test e2e-aws-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants