Skip to content

fix(hypershift/gcp): reconstruct resource names when SHARED_DIR is empty#77347

Open
cristianoveiga wants to merge 1 commit intoopenshift:mainfrom
cristianoveiga:fix/deprovision-reconstruct-fallback
Open

fix(hypershift/gcp): reconstruct resource names when SHARED_DIR is empty#77347
cristianoveiga wants to merge 1 commit intoopenshift:mainfrom
cristianoveiga:fix/deprovision-reconstruct-fallback

Conversation

@cristianoveiga
Copy link
Copy Markdown
Contributor

Summary

  • When the provision step is aborted (SIGTERM), the SHARED_DIR Kubernetes Secret may not be updated, leaving post steps with no project IDs to clean up. This results in orphaned GCP projects.
  • Since resource names are deterministic (derived from NAMESPACE and UNIQUE_HASH env vars), the deprovision step can now reconstruct them as a fallback using the same naming logic as hypershift-gcp-gke-provision.
  • Also adds GKE_REGION env var to the deprovision ref for region reconstruction.

How SHARED_DIR works

Each step runs as a separate pod. SHARED_DIR is backed by a Kubernetes Secret that is copied into the pod on start and uploaded back after the step exits. If a step is aborted (SIGTERM → SIGKILL), the Secret update may not complete, leaving SHARED_DIR empty for subsequent steps.

Evidence

Test plan

  • Rehearsal: /pj-rehearse pull-ci-openshift-hypershift-main-e2e-gke
  • Verify reconstruction logic matches provision naming: INFRA_ID:0:14-control-plane / hosted-cluster

🤖 Generated with Claude Code

When the provision step is aborted (SIGTERM), the SHARED_DIR Kubernetes
Secret may not be updated, leaving post steps with no project IDs to
clean up. Since resource names are deterministic (derived from NAMESPACE
and UNIQUE_HASH env vars), the deprovision step can reconstruct them
as a fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-gke

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@cristianoveiga: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristianoveiga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@cristianoveiga: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-hypershift-main-e2e-gke openshift/hypershift presubmit Registry content changed
pull-ci-openshift-hypershift-release-5.0-e2e-gke openshift/hypershift presubmit Registry content changed
pull-ci-openshift-hypershift-release-4.23-e2e-gke openshift/hypershift presubmit Registry content changed
pull-ci-openshift-hypershift-release-4.22-e2e-gke openshift/hypershift presubmit Registry content changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

Rehearsal validated — manually aborted the e2e-gke job mid-provision to reproduce the orphaned projects scenario.

Result: The deprovision step successfully reconstructed resource names from env vars and cleaned up all GCP resources:

WARNING: SHARED_DIR files missing - reconstructed resource names from env vars
  CP_PROJECT_ID=ci-op-kjg0m3c3-control-plane
  CP_CLUSTER_NAME=ci-op-kjg0m3c3-6ab55-gke
  GCP_REGION=us-central1
  • Hosted Cluster project deleted
  • GKE cluster deleted (22 polling attempts)
  • Control Plane project deleted
  • DNS cleanup correctly skipped (hosted-cluster-setup never ran)
  • No orphaned projects remain

Deprovision build log

Note: The e2e-gke job itself is expected to fail due to a known pre-existing issue — controlPlaneVersion stays Partial because cloud-network-config-controller is missing its credentials secret for GCP. Fix pending in hypershift#7824. This PR only validates the deprovision cleanup behavior.

@cristianoveiga cristianoveiga marked this pull request as ready for review April 2, 2026 21:39
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2026
@openshift-ci openshift-ci bot requested review from csrwng and patjlm April 2, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants