LCORE-2323: In the CUDA lab, pin NVIDIA driver and kernel versions, improve error handling by syedriko · Pull Request #192 · lightspeed-core/rag-content

syedriko · 2026-05-21T17:19:35Z

Description

Exclude the kernel from 'dnf update' to avoid picking up a kernel that is

incompatible with the NVIDIA driver
does not have a published kernel-devel package
The kernel can change when the AMI is upgraded and that's when the NVIDIA driver and availability of kernel-devel will need to be evaluated.

Improved error handling and failure detection in the on-the-host script.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes # https://redhat.atlassian.net/browse/LCORE-2323

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Enhanced installation process with improved kernel module validation and state management.
- Added explicit module presence checks before CDI config generation.
Bug Fixes
- Improved error handling for installation termination and state persistence.
Documentation
- Added architectural decision record documenting CUDA lab workflow kernel pinning approach and rationale.

coderabbitai · 2026-05-21T17:19:54Z

Warning

Rate limit exceeded

@syedriko has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 041dd5d9-5765-4b97-bf83-96b4c844d166

📥 Commits

Reviewing files that changed from the base of the PR and between c487026 and 805489d.

📒 Files selected for processing (2)

scripts/cuda/README.md
scripts/cuda/install_cuda_rhel9_ec2.sh

Walkthrough

This PR improves CUDA lab EC2 installation reliability by pinning the kernel version, explicitly managing the NVIDIA driver version, and adding robust error handling and kernel module validation across the multi-phase installation workflow.

Changes

CUDA Installation Reliability: Kernel Pinning & Driver Version Management

Layer / File(s)	Summary
Documentation & Driver Version Pin `scripts/cuda/README.md`, `scripts/cuda/install_cuda_rhel9_ec2.sh`	Added ADR section explaining the decision to pin the kernel and manage driver version explicitly; bumped NVIDIA_TESLA_VERSION to 580.159.03.
Exit Trap Handler & Error Cleanup `scripts/cuda/install_cuda_rhel9_ec2.sh`	Introduced `on_install_exit()` function as an EXIT-trap safety net to mark install state failed on non-zero exit; updated ERR-trap cleanup to also disable EXIT trap.
Phase A: Kernel Exclusion & DKMS Preparation `scripts/cuda/install_cuda_rhel9_ec2.sh`	Updated `phase_a` to exclude kernel packages from `dnf update` and explicitly install kernel-devel/headers for the running kernel to ensure DKMS compatibility before reboot.
Phase B: Driver Installation & Module Validation `scripts/cuda/install_cuda_rhel9_ec2.sh`	Updated `phase_b` to conditionally install kernel-devel/headers, download the NVIDIA driver local-repo RPM, build DKMS module, and verify successful module installation for the running kernel.
Phase C: Kernel Module Loading Verification `scripts/cuda/install_cuda_rhel9_ec2.sh`	Added verification in `phase_c` to confirm the `nvidia` kernel module is loaded (attempting `modprobe nvidia` if needed) before CDI generation; fails with diagnostic guidance if module is absent.
State Machine Trap & Exit Management `scripts/cuda/install_cuda_rhel9_ec2.sh`	Updated `advance_install` to install the EXIT trap at startup and manage ERR/EXIT trap removal consistently across all state branches to ensure correct persisted-state transitions across reboots.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

are-ces
tisnik

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main objectives: pinning NVIDIA driver and kernel versions, and improving error handling—all of which are reflected in the changeset modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/cuda/install_cuda_rhel9_ec2.sh`:
- Around line 189-197: The DKMS gate in phase_b currently accepts either "built"
or "installed" by checking dkms_status with grep -qE 'installed|built'; change
the check to require only "installed" (e.g., match the literal "installed"
token) so that dkms_status="$(dkms status ...)" is considered OK only when the
module is actually installed into /lib/modules for ${kver}; update the related
log_install message (phase_b: DKMS OK) and the failure message to reflect that
only "installed" passes, referencing the dkms_status variable,
NVIDIA_TESLA_VERSION, kver, and the phase_b logic where the check occurs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 156a0385-7d67-48d3-b818-8a93650f902e

📥 Commits

Reviewing files that changed from the base of the PR and between af7cb0f and c487026.

📒 Files selected for processing (2)

scripts/cuda/README.md
scripts/cuda/install_cuda_rhel9_ec2.sh

…mprove error handling

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread scripts/cuda/install_cuda_rhel9_ec2.sh Outdated

LCORE-2323: In the CUDA lab, pin NVIDIA driver and kernel versions, i…

805489d

…mprove error handling

syedriko force-pushed the syedriko-lcore-2323 branch from c487026 to 805489d Compare May 21, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-2323: In the CUDA lab, pin NVIDIA driver and kernel versions, improve error handling#192

LCORE-2323: In the CUDA lab, pin NVIDIA driver and kernel versions, improve error handling#192
syedriko wants to merge 1 commit into
lightspeed-core:mainfrom
syedriko:syedriko-lcore-2323

syedriko commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

syedriko commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

syedriko commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading