Skip to content

fix: EBS CSI driver crashes on AL2023 — missing IRSA role on addon#3213

Open
asmacdo wants to merge 1 commit intonebari-dev:mainfrom
asmacdo:fix/ebs-csi-irsa-wiring
Open

fix: EBS CSI driver crashes on AL2023 — missing IRSA role on addon#3213
asmacdo wants to merge 1 commit intonebari-dev:mainfrom
asmacdo:fix/ebs-csi-irsa-wiring

Conversation

@asmacdo
Copy link
Copy Markdown
Contributor

@asmacdo asmacdo commented Mar 26, 2026

Summary

The EBS CSI addon was created without service_account_role_arn in #3166, leaving the controller without AWS credentials (no IRSA, no IMDS fallback). This caused the controller to crashloop with no EC2 IMDS role found, preventing all PVC provisioning. One-line fix: wire the existing IRSA role to the addon.

There is no viable manual workaround — attaching the role via aws eks update-addon creates drift that causes nebari to fail with Cross-account pass role is not allowed on the next deploy.

After applying this fix (and correcting various drift from manual debugging), the staging cluster was fully recovered. Keycloak data was lost due to PVC recreation during recovery but that's a non-issue on staging.

cc @viniciusdc

Debug logs

EBS CSI controller had no AWS creds

kubectl logs -n kube-system ebs-csi-controller-<pod> --all-containers --tail=50
Failed health check (verify network connection and IAM credentials):
dry-run EC2 API call failed:
operation error EC2: DescribeAvailabilityZones,
get identity: get credentials:
failed to refresh cached credentials,
no EC2 IMDS role found

CSI controller crashlooping

kubectl get pods -n kube-system | grep ebs
ebs-csi-controller-xxxxx   1/6   CrashLoopBackOff

PVCs stuck waiting for provisioner

kubectl describe pvc <pvc-name>
Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com'

Addon missing IRSA

aws eks describe-addon --cluster-name dandi-hub-staging --addon-name aws-ebs-csi-driver
"status": "DEGRADED"
// no serviceAccountRoleArn present

Manual role attach causes drift on next deploy

aws eks update-addon --service-account-role-arn arn:aws:iam::<acct>:role/...

Controller recovers, but next nebari deploy fails:

Error: updating EKS Add-On: AccessDeniedException: Cross-account pass role is not allowed.

Test plan

  • Tested against live AWS EKS deployment upgrading from AL2 to AL2023
  • Confirmed removing and re-adding IRSA reproduces and fixes the issue deterministically

🤖 Generated with Claude Code

The IRSA IAM role for the EBS CSI driver was created in nebari-dev#3166 but never
connected to the addon via service_account_role_arn. Without this, the
controller pods fall back to IMDS for credentials, which fails with
IMDSv2 (hop limit 1), causing CrashLoopBackOff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@asmacdo asmacdo requested a review from a team as a code owner March 26, 2026 23:20
@asmacdo asmacdo requested review from dcmcand and marcelovilla and removed request for a team March 26, 2026 23:20
@asmacdo
Copy link
Copy Markdown
Contributor Author

asmacdo commented Mar 26, 2026

@Adam-D-Lewis maybe you'd like to have a look also?

@asmacdo
Copy link
Copy Markdown
Contributor Author

asmacdo commented Mar 27, 2026

One reason this was tricky to debug: our working prod cluster also has no IRSA on the EBS CSI addon, and it works fine on AL2. So the missing IRSA looked "normal" when comparing (I assume theres a node policy fallback?). It's unclear exactly why that stops working on latest main + AL2023, but the IRSA wiring was is what seems to be intended by #3166, and adding it to the EBS CSI addon did fix the issue.

@asmacdo
Copy link
Copy Markdown
Contributor Author

asmacdo commented Mar 27, 2026

I've successfully finished upgrading production cluster with this patch. I had a number of issues, mostly related to AZs since the main node group was destroyed. But I can confirm this works :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New 🚦

Development

Successfully merging this pull request may close these issues.

1 participant