Skip to content

fix(provisioner): expose CRI-O bundled runtimes on PATH for AL2023#824

Merged
ArangoGutierrez merged 1 commit into
NVIDIA:mainfrom
ArangoGutierrez:fix/al2023-crio-runtime-symlinks
May 26, 2026
Merged

fix(provisioner): expose CRI-O bundled runtimes on PATH for AL2023#824
ArangoGutierrez merged 1 commit into
NVIDIA:mainfrom
ArangoGutierrez:fix/al2023-crio-runtime-symlinks

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Problem

The rpm-al2023 E2E job has been failing on every merge to main since 2026-05-25 12:15Z (4 consecutive CI runs, all on different commits, the first of which is the docs-only #819 — proof this is not a code regression introduced by holodeck).

CI log (run #26402728532, job e2e-test / E2E Test (rpm-al2023)):

+ sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --enable-cdi=false
time="2026-05-25T14:02:55Z" level=warning msg="Could not infer options from runtimes [runc crun]"
time="2026-05-25T14:02:55Z" level=info msg="Wrote updated config to /etc/crio/crio.conf.d/99-nvidia.toml"
time="2026-05-25T14:02:55Z" level=info msg="It is recommended that crio daemon be restarted."
+ sudo systemctl restart crio
Job for crio.service failed because the control process exited with error code.

Root cause

nvidia-container-toolkit 1.19.1 changed nvidia-ctk runtime configure --runtime=crio to require pre-existing runc/crun entries in crio's config in order to preserve them when writing /etc/crio/crio.conf.d/99-nvidia.toml. The opensuse CRI-O RPM on AL2023 bundles its OCI runtimes under /usr/libexec/crio/ rather than installing them on PATH, so nvidia-ctk cannot discover them and writes a drop-in that omits the runtime tables. The subsequent systemctl restart crio then fails because crio has no usable runtimes.

RHEL/Fedora and Debian are unaffected because their CRI-O packages install runtimes on PATH to begin with.

Fix

Symlink the bundled binaries (runc, crun, conmon, pinns) from /usr/libexec/crio/ into /usr/bin/ in the AL2023 branch of pkg/provisioner/templates/crio.go, after cri-o is installed but before nvidia-container-toolkit runs. The symlinks are idempotent ([[ -x src && ! -e dst ]]).

Testing

  • Unit: TestCriO_Execute_PackageTemplate_OSFamilyBranching extended to assert the AL2023 branch installs symlinks covering at least runc and crun.
  • Full go test ./pkg/... ./internal/...: green locally.
  • E2E: this PR will exercise rpm-al2023 (and the rest of the matrix) in CI.

Scope

Targeted, AL2023-only, 25-line diff. No behavioral change for any other OS family. Required for cutting v0.3.4.

The opensuse CRI-O RPM on Amazon Linux 2023 bundles its OCI runtimes
(runc, crun, conmon, pinns) under /usr/libexec/crio/ rather than on
PATH. nvidia-container-toolkit 1.19.1 changed 'nvidia-ctk runtime
configure --runtime=crio' to require pre-existing runc/crun entries
in crio's config to preserve them when writing the
/etc/crio/crio.conf.d/99-nvidia.toml drop-in. Without those entries,
nvidia-ctk logs:

  level=warning msg="Could not infer options from runtimes [runc crun]"

and writes a drop-in that omits the runtime tables. The subsequent
'systemctl restart crio' then fails because crio has no usable
runtimes, breaking every AL2023 E2E job (rpm-al2023).

Symlinking the bundled runtimes into /usr/bin/ in the AL2023 branch
of the CRI-O install template lets nvidia-ctk discover them and
preserve them in the merged config. RHEL and Debian remain
unaffected because their CRI-O packages install runtimes on PATH
to begin with.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 26411556733

Coverage remained the same at 48.745%

Details

  • Coverage remained the same as the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 11558
Covered Lines: 5634
Line Coverage: 48.75%
Coverage Strength: 0.54 hits per line

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review May 25, 2026 19:17
@ArangoGutierrez ArangoGutierrez enabled auto-merge (squash) May 26, 2026 08:44
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

Friendly bump @mchmarny — this PR fixes the AL2023 crio.service regression that's been breaking every merge to main since 2026-05-25 (~4 consecutive red runs on rpm-al2023). Auto-merge (squash) is armed; just needs your nod. Required to unblock cutting v0.3.4. The diff is 25 lines and the root cause is documented in the PR description.

@ArangoGutierrez ArangoGutierrez merged commit 483116e into NVIDIA:main May 26, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants