Skip to content

Conversation

@bryan-cox
Copy link
Member

@bryan-cox bryan-cox commented Nov 4, 2025

Summary

Fixes Azure Disk and File CSI drivers on Azure self-managed hosted clusters by adding a token-minter sidecar container.

Problem

On Azure self-managed hosted clusters (HyperShift mode), Azure Disk and File CSI driver controllers fail to provision volumes with errors:

  • Azure-Disk: WorkloadIdentityCredential: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory
  • Azure-File: failed to ensure storage account: clientFactory is nil

The CSI driver controllers run in the management cluster but need guest cluster service account tokens for Azure workload identity authentication. The token file at /var/run/secrets/openshift/serviceaccount/token does not exist because there is no mechanism to create it.

Solution

This PR adds a shared WithTokenMinter(serviceAccountName string) deployment hook function in pkg/driver/common/operator/hooks.go that both Azure Disk and File CSI driver operators use to inject a token-minter sidecar container.

The token-minter sidecar:

  • Runs the /usr/bin/control-plane-operator token-minter command
  • Creates service account tokens for the guest cluster namespace openshift-cluster-csi-drivers
  • Writes tokens to /var/run/secrets/openshift/serviceaccount/token in a shared emptyDir volume
  • Uses the service-network-admin-kubeconfig secret to access the guest cluster
  • Reads HYPERSHIFT_IMAGE env var directly (not placeholder) since deployment hooks run after asset replacement

Note: The bound-sa-token emptyDir volume and hosted-kubeconfig secret volume are already added by the HyperShift patch files (controller_add_hypershift_controller.yaml), so the hook only adds the token-minter container.

Platform-Specific Behavior

The hook is added to both Azure Disk and File drivers. The platform-specific behavior is controlled by cluster-storage-operator:

  • Self-managed Azure: cluster-storage-operator passes HYPERSHIFT_IMAGE env var to the CSI driver operators, enabling token-minter functionality
  • ARO HCP: cluster-storage-operator does NOT pass HYPERSHIFT_IMAGE, as ARO HCP uses Secret Provider Class with managed identities instead

This follows the same pattern already used by AWS EBS CSI driver and Azure Cloud Controller Manager.

Changes

  • pkg/driver/common/operator/hooks.go: Added WithTokenMinter(serviceAccountName string) deployment hook (lines 257-301)
  • pkg/driver/common/operator/replacer.go: Fixed copy-paste bug checking wrong variable for HYPERSHIFT_IMAGE (line 64)
  • pkg/driver/azure-disk/azure_disk.go: Use common WithTokenMinter() with azure-disk-csi-driver-controller-sa (line 228)
  • pkg/driver/azure-file/azure_file.go: Use common WithTokenMinter() with azure-file-csi-driver-controller-sa (line 187)

Testing

On an Azure self-managed hosted cluster:

  1. Verify CSI driver controller deployments have the token-minter sidecar
  2. Verify the token file exists in the controller pods
  3. Create PVCs using Azure Disk and Azure File storage classes
  4. Verify PVCs reach Bound status
  5. Verify pods can successfully use the volumes

Related PRs

References

@openshift-ci openshift-ci bot requested review from jsafrane and tsmetana November 4, 2025 15:31
@bryan-cox bryan-cox marked this pull request as draft November 4, 2025 15:32
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2025
@bryan-cox
Copy link
Member Author

/uncc @jsafrane

@openshift-ci openshift-ci bot removed the request for review from jsafrane November 4, 2025 15:32
@bryan-cox
Copy link
Member Author

/uncc @tsmetana

@openshift-ci openshift-ci bot removed the request for review from tsmetana November 4, 2025 15:33
@bryan-cox bryan-cox force-pushed the HOSTEDCP-2033 branch 2 times, most recently from bef9f49 to fd11123 Compare November 4, 2025 15:38
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@bryan-cox bryan-cox changed the title fix(azure): add token-minter for self-managed hosted clusters OCPBUGS-63698: fix(azure): add token-minter for self-managed hosted clusters Nov 4, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 4, 2025
@openshift-ci-robot
Copy link

@bryan-cox: This pull request references Jira Issue OCPBUGS-63698, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes Azure Disk and File CSI drivers on Azure self-managed hosted clusters by adding a token-minter sidecar container.

Problem

On Azure self-managed hosted clusters (HyperShift mode), Azure Disk and File CSI driver controllers fail to provision volumes with errors:

  • Azure-Disk: WorkloadIdentityCredential: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory
  • Azure-File: failed to ensure storage account: clientFactory is nil

The CSI driver controllers run in the management cluster but need guest cluster service account tokens for Azure workload identity authentication. The token file at /var/run/secrets/openshift/serviceaccount/token does not exist because there is no mechanism to create it.

Solution

This PR adds a shared WithTokenMinter(serviceAccountName string) deployment hook function in pkg/driver/common/operator/hooks.go that both Azure Disk and File CSI driver operators use to:

  1. For self-managed Azure clusters: Inject a token-minter sidecar container that creates guest cluster service account tokens
  2. For ARO HCP: Continue using the existing Secret Provider Class approach with managed identities (no changes)

The conditional logic checks for the presence of ARO_HCP_SECRET_PROVIDER_CLASS_FOR_* environment variables:

  • If present → ARO HCP mode (use Secret Provider Class)
  • If absent → Self-managed Azure mode (use token-minter)

The token-minter sidecar:

  • Runs the /usr/bin/control-plane-operator token-minter command
  • Creates service account tokens for the guest cluster namespace openshift-cluster-csi-drivers
  • Writes tokens to /var/run/secrets/openshift/serviceaccount/token in a shared emptyDir volume
  • Uses the service-network-admin-kubeconfig secret to access the guest cluster

Note: The bound-sa-token emptyDir volume and hosted-kubeconfig secret volume are already added by the HyperShift patch files (controller_add_hypershift_controller.yaml), so the hook only adds the token-minter container.

This follows the same pattern already used by AWS EBS CSI driver and Azure Cloud Controller Manager.

Changes

  • pkg/driver/common/operator/hooks.go: Added WithTokenMinter(serviceAccountName string) deployment hook
  • pkg/driver/azure-disk/azure_disk.go: Use common WithTokenMinter() with azure-disk-csi-driver-controller-sa
  • pkg/driver/azure-file/azure_file.go: Use common WithTokenMinter() with azure-file-csi-driver-controller-sa

Testing

On an Azure self-managed hosted cluster:

  1. Verify CSI driver controller deployments have the token-minter sidecar
  2. Verify the token file exists in the controller pods
  3. Create PVCs using Azure Disk and Azure File storage classes
  4. Verify PVCs reach Bound status
  5. Verify pods can successfully use the volumes

References

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Azure Disk and File CSI drivers fail on Azure self-managed hosted
clusters because the service account token at
/var/run/secrets/openshift/serviceaccount/token does not exist.

Add runtime deployment hooks that conditionally inject token-minter
sidecar container for self-managed Azure clusters. The token-minter
creates guest cluster service account tokens that the CSI drivers
use for Azure workload identity authentication.

ARO HCP continues to use Secret Provider Class with managed
identities and is not affected by this change.

Fixes: OCPBUGS-63698
Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
The token-minter image should use the  placeholder
instead of reading os.Getenv() directly. The placeholder is replaced
at runtime by the DefaultReplacements() function when the operator
processes the deployment.

This matches the pattern used in AWS EBS static patches.
Fix copy-paste error in DefaultReplacements() where HYPERSHIFT_IMAGE
placeholder replacement was incorrectly gated on csiDriver != ""
instead of hyperShiftImage != "".

This bug prevented ${HYPERSHIFT_IMAGE} placeholders from being
replaced with the actual image value, causing token-minter containers
to have invalid image references.

Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Deployment hooks run after asset placeholder replacement, so
placeholders added by hooks never get replaced. Fix by directly
reading os.Getenv("HYPERSHIFT_IMAGE") in the hook instead of using
a placeholder string.

Also add conditional behavior: if HYPERSHIFT_IMAGE is not set, skip
adding the token-minter container. This allows the same hook to work
for both self-managed Azure (where cluster-storage-operator sets
HYPERSHIFT_IMAGE) and ARO HCP (where it doesn't).

Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bryan-cox
Once this PR has been reviewed and has the lgtm label, please assign tsmetana for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox
Copy link
Member Author

/test all

@bryan-cox
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 9, 2025
@openshift-ci-robot
Copy link

@bryan-cox: This pull request references Jira Issue OCPBUGS-63698, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Member Author

/retest

@bryan-cox bryan-cox marked this pull request as ready for review December 10, 2025 13:53
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025
@openshift-ci openshift-ci bot requested review from jsafrane and mpatlasov December 10, 2025 13:55
@bryan-cox
Copy link
Member Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azurestack-csi 0af2834 link false /test e2e-azurestack-csi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@duanwei33: This PR was included in a payload test run from openshift/cluster-storage-operator#643
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-ovn-hypershift-guest-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7142dd10-d6fe-11f0-8d59-3afbbfee03e5-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants