fix(ci): k8s cluster auth (cherry-pick to release-1.8) by zdrapela · Pull Request #4441 · redhat-developer/rhdh

zdrapela · 2026-03-23T11:07:52Z

Cherry-pick of cb44f17 to release-1.8.

Changes

K8s cluster authentication (`fix(ci): k8s cluster auth`)

Adds common::kubectl_login() function to authenticate to Kubernetes clusters using service account tokens (sets up kubectl credentials, cluster, and context)
Adds common::oc_login() with common::require_vars check and oc whoami verification
Adds common::require_vars() utility to validate required environment variables
Adds lib/common.sh and lib/log.sh (structured logging with levels, colors, and timestamps)
Calls common::kubectl_login in all K8s job files (AKS, EKS, GKE — both helm and operator)
Removes re_create_k8s_service_account_and_get_token() from k8s-utils.sh (token is now provided externally)
Removes aws_eks_verify_cluster() and aws_eks_get_cluster_info() from aws.sh (replaced by common::kubectl_login auth check)
Removes is_openshift(), detect_ocp(), and detect_container_platform() from utils.sh

Environment variable defaults (`chore(ci): use IS_OPENSHIFT from CI`)

Changes env_variables.sh to inherit IS_OPENSHIFT, CONTAINER_PLATFORM, and CONTAINER_PLATFORM_VERSION from the CI environment instead of initializing them as empty strings
- IS_OPENSHIFT defaults to true, CONTAINER_PLATFORM and CONTAINER_PLATFORM_VERSION default to unknown

Conflict resolution

common.sh and log.sh did not exist on release-1.8 — created both files from the commit's version
GKE job files had a different structure (IS_OPENSHIFT, gcloud_auth, inline base64 encoding) — kept release-1.8 structure and placed common::kubectl_login after the existing GKE auth setup where both K8S_CLUSTER_URL and K8S_CLUSTER_TOKEN are available
AKS/EKS job files auto-merged cleanly
Added source "${DIR}/lib/common.sh" to utils.sh

https://redhat.atlassian.net/browse/RHDHBUGS-2863

github-actions · 2026-03-23T12:05:09Z

The container image build workflow finished with status: cancelled.

github-actions · 2026-03-23T12:05:15Z

The container image build workflow finished with status: cancelled.

zdrapela · 2026-03-23T12:06:32Z

/test e2e-aks-helm-nightly
/test e2e-eks-helm-nightly
/test e2e-gke-helm-nightly

github-actions · 2026-03-23T12:07:30Z

Image was built and published successfully. It is available at:

github-actions · 2026-03-23T12:59:01Z

Image was built and published successfully. It is available at:

github-actions · 2026-03-23T13:31:16Z

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

zdrapela · 2026-03-24T10:15:53Z

/test e2e-eks-helm-nightly

github-actions · 2026-03-24T10:16:16Z

The container image build workflow finished with status: cancelled.

github-actions · 2026-03-24T11:04:42Z

Image was built and published successfully. It is available at:

zdrapela · 2026-03-24T11:57:32Z

/test e2e-eks-helm-nightly

github-actions · 2026-03-24T11:57:55Z

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

zdrapela · 2026-03-24T13:49:52Z

/agentic_review

rhdh-qodo-merge · 2026-03-24T13:50:00Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~GKE login called too early~~ ☑ 🐞 Bug ✓ Correctness

Description

GKE jobs call common::kubectl_login before gcloud ... get-credentials and before
K8S_CLUSTER_URL is set, so common::require_vars fails and the job exits before doing any work.

Code

.ci/pipelines/jobs/gke-helm.sh[R15-18]
  echo "Starting GKE Helm deployment"

+  common::kubectl_login
+

Evidence

common::kubectl_login requires K8S_CLUSTER_TOKEN and K8S_CLUSTER_URL to be set (and
non-empty). In gke-helm.sh, the login is called at line 17, while the kubeconfig is only
configured at lines 27-29 and K8S_CLUSTER_URL is only computed at line 31. Same ordering exists in
gke-operator.sh (login at line 19, credentials at 27-29, URL at 31). This guarantees failure when
K8S_CLUSTER_URL is not pre-injected by CI.

.ci/pipelines/jobs/gke-helm.sh[14-34]
.ci/pipelines/jobs/gke-operator.sh[16-34]
.ci/pipelines/lib/common.sh[40-76]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
GKE jobs call `common::kubectl_login` before they have a kubeconfig context and before `K8S_CLUSTER_URL` is populated, causing `common::require_vars` to fail.

### Issue Context
On GKE, the kubeconfig context is established by `gcloud container clusters get-credentials`, and `K8S_CLUSTER_URL` is derived from the active kubeconfig.

### Fix Focus Areas
- .ci/pipelines/jobs/gke-helm.sh[14-34]
- .ci/pipelines/jobs/gke-operator.sh[16-34]

### Suggested change
Reorder the flow so that:
1) `gcloud_auth` + `gcloud_gke_get_credentials` run first,
2) then set `K8S_CLUSTER_URL` (and ensure `K8S_CLUSTER_TOKEN` is set if required),
3) only then call `common::kubectl_login`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. IS_OPENSHIFT defaults to true 🐞 Bug ✓ Correctness

Description

IS_OPENSHIFT now defaults to true, but AKS/EKS jobs do not override it to false, which makes
Kubernetes jobs take OpenShift-only branches (e.g., applying route.openshift.io/v1 resources) and
fail on non-OpenShift clusters.

Code

.ci/pipelines/env_variables.sh[R173-175]

+IS_OPENSHIFT="${IS_OPENSHIFT:-true}"
+CONTAINER_PLATFORM="${CONTAINER_PLATFORM:-unknown}"
+CONTAINER_PLATFORM_VERSION="${CONTAINER_PLATFORM_VERSION:-unknown}"

Evidence

env_variables.sh now sets IS_OPENSHIFT to true by default. In AKS jobs, there is no
IS_OPENSHIFT=false override (see aks-helm.sh/aks-operator.sh). The function apply_yaml_files
uses IS_OPENSHIFT to choose between Kubernetes Ingress vs OpenShift Route, and when IS_OPENSHIFT
is not empty and not false, it applies topology-test-route.yaml, which is an OpenShift Route
(apiVersion: route.openshift.io/v1) and will not exist on vanilla Kubernetes. Previously-removed
detect_ocp() was responsible for populating this value based on cluster capabilities, so this
default becomes the effective value for AKS/EKS.

.ci/pipelines/env_variables.sh[171-175]
.ci/pipelines/jobs/aks-helm.sh[1-25]
.ci/pipelines/jobs/aks-operator.sh[1-25]
.ci/pipelines/utils.sh[422-504]
.ci/pipelines/resources/topology_test/topology-test-route.yaml[1-14]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`IS_OPENSHIFT` now defaults to `true`, which is incorrect for AKS/EKS/GKE Kubernetes clusters unless explicitly overridden. This causes Kubernetes jobs to take OpenShift-only branches.

### Issue Context
AKS/EKS jobs don't set `IS_OPENSHIFT=false`. The codebase uses `IS_OPENSHIFT` to pick OpenShift Route vs Kubernetes Ingress.

### Fix Focus Areas
- .ci/pipelines/env_variables.sh[173-175]
- .ci/pipelines/jobs/aks-helm.sh[1-25]
- .ci/pipelines/jobs/aks-operator.sh[1-25]
- .ci/pipelines/jobs/eks-helm.sh[14-26]
- .ci/pipelines/jobs/eks-operator.sh[14-26]

### Suggested change
Either:
- revert `IS_OPENSHIFT` default to empty and restore detection logic, or
- explicitly set `IS_OPENSHIFT=false` at the start of all AKS/EKS/GKE Kubernetes jobs (before any branching), keeping `true` only for OCP jobs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Hardcoded EKS kubeconfig path 🐞 Bug ⛯ Reliability

Description

get_cluster_aws_region forces KUBECONFIG to ${SHARED_DIR}/kubeconfig, ignoring the
environment-provided KUBECONFIG and potentially breaking region detection when that file is
missing or different, which then breaks certificate/DNS operations.

Code

.ci/pipelines/cluster/eks/aws.sh[328]

+  cluster_arn=$(KUBECONFIG="${SHARED_DIR}/kubeconfig" kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}' 2> /dev/null)

Evidence

aws.sh states that KUBECONFIG is provided by the test environment, but get_cluster_aws_region
now overrides it to ${SHARED_DIR}/kubeconfig for the kubectl config view call. If CI provides
kubeconfig via KUBECONFIG env var (common pattern) and does not also place a copy at
${SHARED_DIR}/kubeconfig, the function returns failure and get_eks_certificate aborts early (it
treats any non-zero result as fatal). There are no other repo references indicating
${SHARED_DIR}/kubeconfig is created/populated for EKS.

.ci/pipelines/cluster/eks/aws.sh[3-6]
.ci/pipelines/cluster/eks/aws.sh[324-339]
.ci/pipelines/cluster/eks/aws.sh[122-147]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_cluster_aws_region` hardcodes `KUBECONFIG=${SHARED_DIR}/kubeconfig`, which may not exist and may diverge from the kubeconfig actually used by the job.

### Issue Context
The file itself documents that kubeconfig is provided by the test environment; overriding it introduces an extra assumption.

### Fix Focus Areas
- .ci/pipelines/cluster/eks/aws.sh[324-339]

### Suggested change
Use the current `kubectl` context (no override) or conditionally use `${SHARED_DIR}/kubeconfig` only if it exists, e.g.:
- if `${SHARED_DIR}/kubeconfig` exists, use it;
- else use the existing `KUBECONFIG` (or default) without overriding.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

zdrapela · 2026-03-24T13:58:18Z

/test e2e-ocp-helm
/test e2e-eks-helm-nightly
/test e2e-aks-operator-nightly

.ci/pipelines/jobs/gke-helm.sh

.ci/pipelines/env_variables.sh

github-actions · 2026-03-25T07:42:18Z

The container image build workflow finished with status: failure.

zdrapela · 2026-03-25T13:54:49Z

/retest

github-actions · 2026-03-26T09:10:55Z

Image was built and published successfully. It is available at:

github-actions · 2026-03-26T09:33:45Z

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

zdrapela · 2026-03-26T10:54:01Z

/retest

github-actions · 2026-03-26T19:26:59Z

Image was built and published successfully. It is available at:

zdrapela · 2026-03-27T06:40:33Z

/retest

github-actions · 2026-03-27T08:36:13Z

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

github-actions · 2026-03-27T08:46:30Z

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

The psql command in the create-sonataflow-db-manual Job used `&& echo ok || echo fail` which always exits 0, masking real errors like password authentication failures. The Job was marked Complete by Kubernetes even when psql failed, causing downstream jobs-service rollout to time out. Now capture psql output and only treat "already exists" as benign; all other failures (auth errors, connection refused, etc.) exit 1 so the Job correctly reports failure and the pipeline can detect it.

Two changes to address flaky audit-log tests: 1. Increase --tail from 100 to 500: the 100-line window was too small and target log lines were getting pushed out by concurrent test activity (permission evaluations, catalog reads) from other spec files running in parallel workers. 2. Add 2s delay before first log fetch: gives the backend time to flush the audit log entry to pod stdout before oc logs is called, eliminating the race between API response and log availability.

github-actions · 2026-03-27T12:56:29Z

The container image build workflow finished with status: cancelled.

github-actions · 2026-03-27T13:49:13Z

Image was built and published successfully. It is available at:

This reverts commit 74cd93a.

github-actions · 2026-03-30T10:40:09Z

Image was built and published successfully. It is available at:

github-actions · 2026-03-31T09:11:40Z

Image was built and published successfully. It is available at:

zdrapela · 2026-03-31T12:45:10Z

/retest

zdrapela · 2026-03-31T13:30:52Z

/retest

openshift-ci · 2026-03-31T14:59:48Z

@zdrapela: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gke-helm-nightly	`1aabedf`	link	false	`/test e2e-gke-helm-nightly`
ci/prow/e2e-aks-helm-nightly	`1aabedf`	link	false	`/test e2e-aks-helm-nightly`
ci/prow/e2e-eks-helm-nightly	`52c0d0e`	link	false	`/test e2e-eks-helm-nightly`
ci/prow/e2e-aks-operator-nightly	`52c0d0e`	link	false	`/test e2e-aks-operator-nightly`
ci/prow/e2e-ocp-helm	`2d4e404`	link	true	`/test e2e-ocp-helm`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from josephca and subhashkhileri March 23, 2026 11:07

zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch 2 times, most recently from 120f41f to 1aabedf Compare March 23, 2026 12:04

zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from 1aabedf to d1bb138 Compare March 23, 2026 13:30

zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from b8074c6 to e09f9ab Compare March 24, 2026 10:14

rhdh-qodo-merge bot reviewed Mar 24, 2026

View reviewed changes

.ci/pipelines/jobs/gke-helm.sh Show resolved Hide resolved

.ci/pipelines/env_variables.sh Show resolved Hide resolved

zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from c49189d to 8cd959c Compare March 27, 2026 08:46

zdrapela added 8 commits March 27, 2026 13:55

fix(ci): k8s cluster auth

211f2f6

chore(ci): use IS_OPENSHIFT from CI

5aae175

chore(ci): parse the cluster region from kubeconfig

bc6e4b9

Update openshift-ci-tests.sh

83cd112

chore(ci): add common lib

ff0f37e

chore(ci): deprecate gcloud auth

ceb2656

zdrapela force-pushed the cherry-pick-cb44f17-release-1.8 branch from 04c6519 to 74cd93a Compare March 27, 2026 12:56

Revert "fix(e2e): reduce flakiness of auditor-rbac log validation"

6589aa4

This reverts commit 74cd93a.

Merge branch 'release-1.8' into cherry-pick-cb44f17-release-1.8

2d4e404

Conversation

zdrapela commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

K8s cluster authentication (fix(ci): k8s cluster auth)

Environment variable defaults (chore(ci): use IS_OPENSHIFT from CI)

Conflict resolution

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

zdrapela commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

zdrapela commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

zdrapela commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

zdrapela commented Mar 24, 2026

Uh oh!

rhdh-qodo-merge bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

zdrapela commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

zdrapela commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

zdrapela commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

zdrapela commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

zdrapela commented Mar 31, 2026

Uh oh!

zdrapela commented Mar 31, 2026

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zdrapela commented Mar 23, 2026 •

edited

Loading

K8s cluster authentication (`fix(ci): k8s cluster auth`)

Environment variable defaults (`chore(ci): use IS_OPENSHIFT from CI`)

rhdh-qodo-merge bot commented Mar 24, 2026 •

edited

Loading