Skip to content

fix(ci): k8s cluster auth (cherry-pick to release-1.8)#4441

Open
zdrapela wants to merge 10 commits intoredhat-developer:release-1.8from
zdrapela:cherry-pick-cb44f17-release-1.8
Open

fix(ci): k8s cluster auth (cherry-pick to release-1.8)#4441
zdrapela wants to merge 10 commits intoredhat-developer:release-1.8from
zdrapela:cherry-pick-cb44f17-release-1.8

Conversation

@zdrapela
Copy link
Copy Markdown
Member

@zdrapela zdrapela commented Mar 23, 2026

Cherry-pick of cb44f17 to release-1.8.

Changes

K8s cluster authentication (fix(ci): k8s cluster auth)

  • Adds common::kubectl_login() function to authenticate to Kubernetes clusters using service account tokens (sets up kubectl credentials, cluster, and context)
  • Adds common::oc_login() with common::require_vars check and oc whoami verification
  • Adds common::require_vars() utility to validate required environment variables
  • Adds lib/common.sh and lib/log.sh (structured logging with levels, colors, and timestamps)
  • Calls common::kubectl_login in all K8s job files (AKS, EKS, GKE — both helm and operator)
  • Removes re_create_k8s_service_account_and_get_token() from k8s-utils.sh (token is now provided externally)
  • Removes aws_eks_verify_cluster() and aws_eks_get_cluster_info() from aws.sh (replaced by common::kubectl_login auth check)
  • Removes is_openshift(), detect_ocp(), and detect_container_platform() from utils.sh

Environment variable defaults (chore(ci): use IS_OPENSHIFT from CI)

  • Changes env_variables.sh to inherit IS_OPENSHIFT, CONTAINER_PLATFORM, and CONTAINER_PLATFORM_VERSION from the CI environment instead of initializing them as empty strings
    • IS_OPENSHIFT defaults to true, CONTAINER_PLATFORM and CONTAINER_PLATFORM_VERSION default to unknown

Conflict resolution

  • common.sh and log.sh did not exist on release-1.8 — created both files from the commit's version
  • GKE job files had a different structure (IS_OPENSHIFT, gcloud_auth, inline base64 encoding) — kept release-1.8 structure and placed common::kubectl_login after the existing GKE auth setup where both K8S_CLUSTER_URL and K8S_CLUSTER_TOKEN are available
  • AKS/EKS job files auto-merged cleanly
  • Added source "${DIR}/lib/common.sh" to utils.sh

https://redhat.atlassian.net/browse/RHDHBUGS-2863

@zdrapela zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch 2 times, most recently from 120f41f to 1aabedf Compare March 23, 2026 12:04
@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-aks-helm-nightly
/test e2e-eks-helm-nightly
/test e2e-gke-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from 1aabedf to d1bb138 Compare March 23, 2026 13:30
@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from b8074c6 to e09f9ab Compare March 24, 2026 10:14
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-eks-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-eks-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge bot commented Mar 24, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. GKE login called too early🐞 Bug ✓ Correctness
Description
GKE jobs call common::kubectl_login before gcloud ... get-credentials and before
K8S_CLUSTER_URL is set, so common::require_vars fails and the job exits before doing any work.
Code

.ci/pipelines/jobs/gke-helm.sh[R15-18]

  echo "Starting GKE Helm deployment"

+  common::kubectl_login
+
Evidence
common::kubectl_login requires K8S_CLUSTER_TOKEN and K8S_CLUSTER_URL to be set (and
non-empty). In gke-helm.sh, the login is called at line 17, while the kubeconfig is only
configured at lines 27-29 and K8S_CLUSTER_URL is only computed at line 31. Same ordering exists in
gke-operator.sh (login at line 19, credentials at 27-29, URL at 31). This guarantees failure when
K8S_CLUSTER_URL is not pre-injected by CI.

.ci/pipelines/jobs/gke-helm.sh[14-34]
.ci/pipelines/jobs/gke-operator.sh[16-34]
.ci/pipelines/lib/common.sh[40-76]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
GKE jobs call `common::kubectl_login` before they have a kubeconfig context and before `K8S_CLUSTER_URL` is populated, causing `common::require_vars` to fail.

### Issue Context
On GKE, the kubeconfig context is established by `gcloud container clusters get-credentials`, and `K8S_CLUSTER_URL` is derived from the active kubeconfig.

### Fix Focus Areas
- .ci/pipelines/jobs/gke-helm.sh[14-34]
- .ci/pipelines/jobs/gke-operator.sh[16-34]

### Suggested change
Reorder the flow so that:
1) `gcloud_auth` + `gcloud_gke_get_credentials` run first,
2) then set `K8S_CLUSTER_URL` (and ensure `K8S_CLUSTER_TOKEN` is set if required),
3) only then call `common::kubectl_login`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. IS_OPENSHIFT defaults to true 🐞 Bug ✓ Correctness
Description
IS_OPENSHIFT now defaults to true, but AKS/EKS jobs do not override it to false, which makes
Kubernetes jobs take OpenShift-only branches (e.g., applying route.openshift.io/v1 resources) and
fail on non-OpenShift clusters.
Code

.ci/pipelines/env_variables.sh[R173-175]

+IS_OPENSHIFT="${IS_OPENSHIFT:-true}"
+CONTAINER_PLATFORM="${CONTAINER_PLATFORM:-unknown}"
+CONTAINER_PLATFORM_VERSION="${CONTAINER_PLATFORM_VERSION:-unknown}"
Evidence
env_variables.sh now sets IS_OPENSHIFT to true by default. In AKS jobs, there is no
IS_OPENSHIFT=false override (see aks-helm.sh/aks-operator.sh). The function apply_yaml_files
uses IS_OPENSHIFT to choose between Kubernetes Ingress vs OpenShift Route, and when IS_OPENSHIFT
is not empty and not false, it applies topology-test-route.yaml, which is an OpenShift Route
(apiVersion: route.openshift.io/v1) and will not exist on vanilla Kubernetes. Previously-removed
detect_ocp() was responsible for populating this value based on cluster capabilities, so this
default becomes the effective value for AKS/EKS.

.ci/pipelines/env_variables.sh[171-175]
.ci/pipelines/jobs/aks-helm.sh[1-25]
.ci/pipelines/jobs/aks-operator.sh[1-25]
.ci/pipelines/utils.sh[422-504]
.ci/pipelines/resources/topology_test/topology-test-route.yaml[1-14]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`IS_OPENSHIFT` now defaults to `true`, which is incorrect for AKS/EKS/GKE Kubernetes clusters unless explicitly overridden. This causes Kubernetes jobs to take OpenShift-only branches.

### Issue Context
AKS/EKS jobs don't set `IS_OPENSHIFT=false`. The codebase uses `IS_OPENSHIFT` to pick OpenShift Route vs Kubernetes Ingress.

### Fix Focus Areas
- .ci/pipelines/env_variables.sh[173-175]
- .ci/pipelines/jobs/aks-helm.sh[1-25]
- .ci/pipelines/jobs/aks-operator.sh[1-25]
- .ci/pipelines/jobs/eks-helm.sh[14-26]
- .ci/pipelines/jobs/eks-operator.sh[14-26]

### Suggested change
Either:
- revert `IS_OPENSHIFT` default to empty and restore detection logic, or
- explicitly set `IS_OPENSHIFT=false` at the start of all AKS/EKS/GKE Kubernetes jobs (before any branching), keeping `true` only for OCP jobs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Hardcoded EKS kubeconfig path 🐞 Bug ⛯ Reliability
Description
get_cluster_aws_region forces KUBECONFIG to ${SHARED_DIR}/kubeconfig, ignoring the
environment-provided KUBECONFIG and potentially breaking region detection when that file is
missing or different, which then breaks certificate/DNS operations.
Code

.ci/pipelines/cluster/eks/aws.sh[328]

+  cluster_arn=$(KUBECONFIG="${SHARED_DIR}/kubeconfig" kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}' 2> /dev/null)
Evidence
aws.sh states that KUBECONFIG is provided by the test environment, but get_cluster_aws_region
now overrides it to ${SHARED_DIR}/kubeconfig for the kubectl config view call. If CI provides
kubeconfig via KUBECONFIG env var (common pattern) and does not also place a copy at
${SHARED_DIR}/kubeconfig, the function returns failure and get_eks_certificate aborts early (it
treats any non-zero result as fatal). There are no other repo references indicating
${SHARED_DIR}/kubeconfig is created/populated for EKS.

.ci/pipelines/cluster/eks/aws.sh[3-6]
.ci/pipelines/cluster/eks/aws.sh[324-339]
.ci/pipelines/cluster/eks/aws.sh[122-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_cluster_aws_region` hardcodes `KUBECONFIG=${SHARED_DIR}/kubeconfig`, which may not exist and may diverge from the kubeconfig actually used by the job.

### Issue Context
The file itself documents that kubeconfig is provided by the test environment; overriding it introduces an extra assumption.

### Fix Focus Areas
- .ci/pipelines/cluster/eks/aws.sh[324-339]

### Suggested change
Use the current `kubectl` context (no override) or conditionally use `${SHARED_DIR}/kubeconfig` only if it exists, e.g.:
- if `${SHARED_DIR}/kubeconfig` exists, use it;
- else use the existing `KUBECONFIG` (or default) without overriding.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm
/test e2e-eks-helm-nightly
/test e2e-aks-operator-nightly

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: failure.

@zdrapela
Copy link
Copy Markdown
Member Author

/retest

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela
Copy link
Copy Markdown
Member Author

/retest

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/retest

@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from c49189d to 8cd959c Compare March 27, 2026 08:46
@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

The psql command in the create-sonataflow-db-manual Job used
`&& echo ok || echo fail` which always exits 0, masking real
errors like password authentication failures. The Job was marked
Complete by Kubernetes even when psql failed, causing downstream
jobs-service rollout to time out.

Now capture psql output and only treat "already exists" as benign;
all other failures (auth errors, connection refused, etc.) exit 1
so the Job correctly reports failure and the pipeline can detect it.
Two changes to address flaky audit-log tests:

1. Increase --tail from 100 to 500: the 100-line window was too small
   and target log lines were getting pushed out by concurrent test
   activity (permission evaluations, catalog reads) from other spec
   files running in parallel workers.

2. Add 2s delay before first log fetch: gives the backend time to
   flush the audit log entry to pod stdout before oc logs is called,
   eliminating the race between API response and log availability.
@zdrapela zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from 04c6519 to 74cd93a Compare March 27, 2026 12:56
@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@zdrapela
Copy link
Copy Markdown
Member Author

/retest

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

@zdrapela: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-helm-nightly 1aabedf link false /test e2e-gke-helm-nightly
ci/prow/e2e-aks-helm-nightly 1aabedf link false /test e2e-aks-helm-nightly
ci/prow/e2e-eks-helm-nightly 52c0d0e link false /test e2e-eks-helm-nightly
ci/prow/e2e-aks-operator-nightly 52c0d0e link false /test e2e-aks-operator-nightly
ci/prow/e2e-ocp-helm 2d4e404 link true /test e2e-ocp-helm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant