fix: EBS CSI driver crashes on AL2023 — missing IRSA role on addon#3213
Open
asmacdo wants to merge 1 commit intonebari-dev:mainfrom
Open
fix: EBS CSI driver crashes on AL2023 — missing IRSA role on addon#3213asmacdo wants to merge 1 commit intonebari-dev:mainfrom
asmacdo wants to merge 1 commit intonebari-dev:mainfrom
Conversation
The IRSA IAM role for the EBS CSI driver was created in nebari-dev#3166 but never connected to the addon via service_account_role_arn. Without this, the controller pods fall back to IMDS for credentials, which fails with IMDSv2 (hop limit 1), causing CrashLoopBackOff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@Adam-D-Lewis maybe you'd like to have a look also? |
Contributor
Author
|
One reason this was tricky to debug: our working prod cluster also has no IRSA on the EBS CSI addon, and it works fine on AL2. So the missing IRSA looked "normal" when comparing (I assume theres a node policy fallback?). It's unclear exactly why that stops working on latest main + AL2023, but the IRSA wiring was is what seems to be intended by #3166, and adding it to the EBS CSI addon did fix the issue. |
Contributor
Author
|
I've successfully finished upgrading production cluster with this patch. I had a number of issues, mostly related to AZs since the main node group was destroyed. But I can confirm this works :) |
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The EBS CSI addon was created without
service_account_role_arnin #3166, leaving the controller without AWS credentials (no IRSA, no IMDS fallback). This caused the controller to crashloop withno EC2 IMDS role found, preventing all PVC provisioning. One-line fix: wire the existing IRSA role to the addon.There is no viable manual workaround — attaching the role via
aws eks update-addoncreates drift that causes nebari to fail withCross-account pass role is not allowedon the next deploy.After applying this fix (and correcting various drift from manual debugging), the staging cluster was fully recovered. Keycloak data was lost due to PVC recreation during recovery but that's a non-issue on staging.
cc @viniciusdc
Debug logs
EBS CSI controller had no AWS creds
CSI controller crashlooping
PVCs stuck waiting for provisioner
Addon missing IRSA
Manual role attach causes drift on next deploy
Controller recovers, but next
nebari deployfails:Test plan
🤖 Generated with Claude Code