Skip to content

feat: onboard NVIDIA GPU support for ACL#8112

Open
henryli001 wants to merge 1 commit intomainfrom
lihl/acl-gpu-aks
Open

feat: onboard NVIDIA GPU support for ACL#8112
henryli001 wants to merge 1 commit intomainfrom
lihl/acl-gpu-aks

Conversation

@henryli001
Copy link
Contributor

@henryli001 henryli001 commented Mar 17, 2026

What this PR does / why we need it:

This PR onboards the managed GPU provisioning flow for Azure Container Linux on AKS

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings March 17, 2026 23:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds NVIDIA GPU enablement for ACL (Azure Container Linux) by installing GPU components via systemd sysexts during node provisioning and extending validation through E2E and VHD build-time adjustments.

Changes:

  • Add ACL-specific GPU sysext install flow (driver/toolkit/fabric-manager) and shared GPU helper functions in CSE scripts.
  • Update VHD build/test scripts for ACL specifics (chrony configuration, udev rules, VHD content test behavior).
  • Add ACL GPU E2E scenarios and adjust supporting infra (firewall rules), plus update some build pipeline/packer configuration.

Reviewed changes

Copilot reviewed 21 out of 88 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vhdbuilder/scripts/linux/flatcar/tool_installs_flatcar.sh Adds ACL-only chrony configuration using an /etc config + systemd drop-in.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Changes waagent invocation during deprovision in Flatcar ARM64 build.
vhdbuilder/packer/vhd-image-builder-acl.json Alters ACL source SIG image configuration (now hardcoded).
vhdbuilder/packer/test/linux-vhd-content-test.sh Adds an ACL/Flatcar-specific skip in umask validation and passes OS_SKU into the test.
vhdbuilder/packer/pre-install-dependencies.sh Installs Azure disk udev rules for ACL when missing.
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Includes shared cse_install.sh in Mariner ShellSpec tests (for moved GPU helpers).
pkg/agent/testdata/CustomizedImage/CustomData Updates generated CustomData snapshot content.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Removes GPU helper functions (moved to shared cse_install.sh).
parts/linux/cloud-init/artifacts/cse_main.sh Adds ACL path to install NVIDIA Fabric Manager via sysext.
parts/linux/cloud-init/artifacts/cse_install.sh Adds shared GPU helper functions (open-driver selection + persistence mode unit).
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds a new error code for missing VERSION_ID used in sysext resolution.
parts/linux/cloud-init/artifacts/cse_config.sh Adds ACL GPU driver/toolkit installation via sysext and GRID licensing retry logic.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Extends sysext matching and adds ACL GPU sysext install functions based on VERSION_ID.
e2e/scenario_test.go Adds ACL GPU E2E scenarios for NC(v3), H100(v5), and A10 GRID.
e2e/config/config.go Changes default E2E settings (location, KEEP_VMSS, and default TAGS_TO_RUN).
e2e/aks_model.go Adds firewall allow rules for ACR + blob redirect domains needed for sysext pulls.
.pipelines/templates/.builder-release-template.yaml Comments out the “Test, Scan, and Cleanup” pipeline step.
.pipelines/.vsts-vhd-builder-release.yaml Updates ACL build variables for source image name/version.

Copilot AI review requested due to automatic review settings March 18, 2026 02:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds ACL (Azure Container Linux) GPU enablement by installing NVIDIA drivers/toolkit via systemd sysexts, wiring Fabric Manager support for ACL, and adding E2E coverage for ACL GPU VM sizes.

Changes:

  • Add ACL GPU driver + NVIDIA container toolkit sysext install flow (including Fabric Manager sysext and GRID license handling).
  • Adjust VHD build/provisioning scripts for ACL (chrony config, udev rule bootstrap) and add ACL GPU E2E scenarios.
  • Update pipeline/test defaults and VHD builder configs to use ACL preview sources (and other build/test behavior changes).

Reviewed changes

Copilot reviewed 21 out of 88 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
vhdbuilder/scripts/linux/flatcar/tool_installs_flatcar.sh Implements ACL-only chrony configuration via systemd drop-in + /etc config.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Adjusts waagent deprovision command.
vhdbuilder/packer/vhd-image-builder-acl.json Changes source SIG image selection configuration.
vhdbuilder/packer/test/linux-vhd-content-test.sh Skips umask validation for ACL (currently keyed off Flatcar).
vhdbuilder/packer/pre-install-dependencies.sh Adds ACL udev rule bootstrap before starting disk_queue.
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Includes shared cse_install.sh to pick up moved GPU helper functions.
pkg/agent/testdata/CustomizedImage/CustomData Regenerates embedded testdata payload.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Removes GPU helper functions moved into shared cse_install.sh.
parts/linux/cloud-init/artifacts/cse_main.sh Calls ACL Fabric Manager sysext install when needed.
parts/linux/cloud-init/artifacts/cse_install.sh Adds shared GPU helper functions (open-vs-proprietary selection + persistence daemon unit).
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds a new error code for missing VERSION_ID.
parts/linux/cloud-init/artifacts/cse_config.sh Adds ACL GPU sysext install path + GRID license retry logic.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Adds ACL GPU sysext pulling/tag resolution logic and Fabric Manager sysext install function.
e2e/scenario_test.go Adds ACL GPU E2E scenarios (NCv3, H100, A10/GRID).
e2e/config/config.go Changes local-run defaults (location, KeepVMSS, TagsToRun).
e2e/aks_model.go Allows firewall egress to *.azurecr.io and *.blob.core.windows.net for sysext pulls.
.pipelines/templates/.builder-release-template.yaml Comments out “Test, Scan, and Cleanup” pipeline step.
.pipelines/.vsts-vhd-builder-release.yaml Updates ACL SIG source image name/version used by the build stage.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ACL (Azure Container Linux) NVIDIA GPU enablement by switching ACL GPU driver installation to systemd sysexts (plus related provisioning updates), and extends validation via E2E scenarios and VHD build/test adjustments.

Changes:

  • Add ACL GPU driver/toolkit/fabric-manager installation via sysext pulls and hook into existing CSE GPU flow.
  • Add/adjust VHD build + validation logic for ACL/Flatcar (chrony config, udev rules, content tests).
  • Add E2E GPU scenarios for ACL (and an AzureLinux A10 GRID scenario), plus firewall allowances for sysext pulls.

Reviewed changes

Copilot reviewed 21 out of 88 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
vhdbuilder/scripts/linux/flatcar/tool_installs_flatcar.sh Implements ACL-specific chrony configuration via systemd drop-in.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Adjusts waagent deprovision command invocation path.
vhdbuilder/packer/vhd-image-builder-acl.json Updates ACL packer builder source SIG image configuration.
vhdbuilder/packer/test/linux-vhd-content-test.sh Adds ACL/Flatcar conditional logic to umask validation.
vhdbuilder/packer/pre-install-dependencies.sh Adds ACL-specific Azure disk udev rules injection.
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Includes shared cse_install.sh in Mariner spec.
pkg/agent/testdata/CustomizedImage/CustomData Regenerated snapshot test data.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Moves shared NVIDIA helper funcs out to common install script.
parts/linux/cloud-init/artifacts/cse_main.sh Calls ACL fabric-manager sysext install path when needed.
parts/linux/cloud-init/artifacts/cse_install.sh Adds shared GPU helper functions (open-driver selection, persistenced).
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds new error code for missing VERSION_ID for sysext tag resolution.
parts/linux/cloud-init/artifacts/cse_config.sh Adds ACL GPU driver/toolkit sysext flow + GRID service start logic.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Implements ACL GPU sysext pull + driver selection logic.
e2e/scenario_test.go Adds multiple ACL GPU E2E scenarios and an AzureLinux A10 GRID test.
e2e/config/config.go Changes several E2E defaults (location, keep resources, default tags).
e2e/aks_model.go Adds firewall allow rules for ACR + blob egress needed by sysext pulls.
.pipelines/templates/.builder-release-template.yaml Disables the “Test, Scan, and Cleanup” pipeline step.
.pipelines/.vsts-vhd-builder-release.yaml Updates ACL build stage source image name/version variables.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds NVIDIA GPU enablement for Azure Container Linux (ACL) by extending the provisioning scripts to install GPU components via systemd sysexts, updating VHD build/test logic for ACL specifics, and adding/adjusting E2E coverage and build pipeline settings.

Changes:

  • Add ACL GPU driver/toolkit/fabric-manager installation via sysexts and wire it into GPU provisioning flow.
  • Update VHD build scripts/tests for ACL behaviors (chrony config, udev rules, umask test behavior).
  • Add/adjust E2E scenarios and network egress rules needed for ACL GPU sysext pulls; update pipeline inputs for ACL base image.

Reviewed changes

Copilot reviewed 21 out of 88 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
vhdbuilder/scripts/linux/flatcar/tool_installs_flatcar.sh Adds ACL-only chronyd configuration via /etc config + systemd drop-in.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Changes waagent invocation path for Flatcar ARM64 deprovision step.
vhdbuilder/packer/vhd-image-builder-acl.json Alters ACL packer builder source image configuration (currently hardcoded).
vhdbuilder/packer/test/linux-vhd-content-test.sh Adjusts testUmaskSettings to accept OS SKU and adds an ACL skip (currently too broad).
vhdbuilder/packer/pre-install-dependencies.sh Adds ACL udev rules installation for /dev/disk/azure/* symlinks.
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Includes cse_install.sh in spec (to access moved shared GPU helper functions).
pkg/agent/testdata/CustomizedImage/CustomData Updates generated snapshot testdata (large binary diffs).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Moves shared GPU helper functions out of Mariner-specific install script.
parts/linux/cloud-init/artifacts/cse_main.sh Calls ACL fabric-manager sysext install when fabric manager is needed on ACL.
parts/linux/cloud-init/artifacts/cse_install.sh Adds shared GPU helper functions (should_use_nvidia_open_drivers, enableNvidiaPersistenceMode).
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds new error code for missing VERSION_ID for sysext tag resolution.
parts/linux/cloud-init/artifacts/cse_config.sh Wires ACL GPU install flow (toolkit/driver sysext + persistence mode) and adjusts GRID service startup.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Implements ACL GPU sysext pulling and tag resolution using /etc/os-release VERSION_ID.
e2e/scenario_test.go Adds ACL GPU E2E scenarios (NC, A100, H100, A10/GRID) and AzureLinuxV3 A10 GRID scenario.
e2e/config/config.go Changes defaults for E2E location, VMSS retention, and tag filtering (currently risky defaults).
e2e/aks_model.go Adds firewall egress allowlist for *.azurecr.io and *.blob.core.windows.net for ACL GPU sysext pulls.
.pipelines/templates/.builder-release-template.yaml Comments out the shared post-build test/scan/cleanup step (currently disables validation).
.pipelines/.vsts-vhd-builder-release.yaml Updates ACL SIG source image name/version variables.

Copilot AI review requested due to automatic review settings March 18, 2026 22:08
@henryli001 henryli001 marked this pull request as ready for review March 18, 2026 22:12
Copy link
Collaborator

@djsly djsly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of hardcoding mrc.microsfot.com

@@ -18,6 +18,7 @@ Describe 'cse_install_mariner.sh'
}
}
BeforeAll 'setup'
Include "./parts/linux/cloud-init/artifacts/cse_install.sh"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Devinwong would like to check with you on whether this change is needed?

I modified cse_install.sh in this PR to include some functions that will be called by both Azure Linux and ACL.

Copilot AI review requested due to automatic review settings March 19, 2026 22:16
@henryli001
Copy link
Contributor Author

not a fan of hardcoding mrc.microsfot.com

not a fan of hardcoding mrc.microsfot.com

Pushed another change to replace the hardcoded string with MCR_REPOSITORY_BASE that is defined here

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR onboards managed NVIDIA GPU provisioning for Azure Container Linux (ACL) on AKS by installing GPU drivers and related components via systemd-sysext, while consolidating shared GPU helper functions across AzureLinux/Mariner and ACL.

Changes:

  • Add ACL GPU provisioning via sysext (driver/toolkit/fabric-manager) and wire it into the main CSE GPU flow.
  • Move GPU helper functions into a shared install script used by multiple distros and update Mariner implementation accordingly.
  • Add/extend E2E scenarios covering ACL GPU SKUs (NC, A100, A10/GRID) and AzureLinux A10/GRID.

Reviewed changes

Copilot reviewed 18 out of 80 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Includes shared cse_install.sh so Mariner spec tests can access moved GPU helper functions.
pkg/agent/testdata/CustomizedImage/CustomData Updates embedded cloud-init testdata blob, likely reflecting new/updated provisioning scripts.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Removes GPU helper functions now shared from cse_install.sh.
parts/linux/cloud-init/artifacts/cse_main.sh Adds ACL-specific Fabric Manager sysext install path.
parts/linux/cloud-init/artifacts/cse_install.sh Adds shared GPU helper functions (should_use_nvidia_open_drivers, enableNvidiaPersistenceMode).
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds error code for missing VERSION_ID needed for sysext tag resolution.
parts/linux/cloud-init/artifacts/cse_config.sh Wires ACL GPU flow using sysext installs; adjusts GRID licensing service start behavior.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Implements ACL GPU sysext resolution using /etc/os-release VERSION_ID, plus improved local/remote sysext matching.
e2e/scenario_test.go Adds ACL GPU E2Es (NC, A100, A10/GRID) and AzureLinux A10/GRID scenario.
.pipelines/.vsts-vhd-builder-release.yaml Updates source SIG image name/version used by the ACL builder stage.

Comment on lines +169 to +170
echo "Failed to determine GPU driver type"
exit $ERR_MISSING_CUDA_PACKAGE
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_use_nvidia_open_drivers returns 2 specifically for “unable to determine VM SKU”, but this path exits with ERR_MISSING_CUDA_PACKAGE, which is misleading and can cause incorrect failure categorization/telemetry. Prefer propagating the function’s error (or add a dedicated error code like ERR_GPU_DRIVER_SELECTION_FAIL) and emit an error message that matches the underlying cause (e.g., IMDS SKU lookup failure).

Suggested change
echo "Failed to determine GPU driver type"
exit $ERR_MISSING_CUDA_PACKAGE
echo "Failed to determine GPU driver type for this VM: unable to determine VM SKU (should_use_nvidia_open_drivers returned ${driver_ret})"
exit "${driver_ret}"

Copilot uses AI. Check for mistakes.
Comment on lines +41 to 43
# Match either arch-specific tags (v{ver}[.~-]*-azlinux3-{arch}) or exact version tags ({ver})
retrycmd_silent 120 5 20 oras repo tags --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "${seURL}" | grep -Ex "(v${desiredVer//./\\.}[.~-].*-azlinux3-${seArch}|${desiredVer//./\\.})" | sort -V | tail -n1
test ${PIPESTATUS[0]} -eq 0
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated regex allows both arch-specific tags and “exact version” tags, but the single-pass sort -V | tail -n1 selection can become ambiguous if both forms exist (it may choose the wrong one depending on tag set/sort behavior). To make selection deterministic, consider doing a two-pass lookup: first attempt the arch-specific pattern; only if that yields no match, fallback to the exact-version tag.

Suggested change
# Match either arch-specific tags (v{ver}[.~-]*-azlinux3-{arch}) or exact version tags ({ver})
retrycmd_silent 120 5 20 oras repo tags --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "${seURL}" | grep -Ex "(v${desiredVer//./\\.}[.~-].*-azlinux3-${seArch}|${desiredVer//./\\.})" | sort -V | tail -n1
test ${PIPESTATUS[0]} -eq 0
local tags archPattern exactPattern match
# Fetch all tags once; retrycmd_silent handles retries and logging.
tags=$(retrycmd_silent 120 5 20 oras repo tags --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "${seURL}")
if [ $? -ne 0 ]; then
# Propagate failure from oras/registry access.
return 1
fi
# First pass: prefer arch-specific tags (v{ver}[.~-]*-azlinux3-{arch}).
archPattern="^v${desiredVer//./\\.}[.~-].*-azlinux3-${seArch}$"
match=$(printf '%s\n' "${tags}" | grep -E "${archPattern}" | sort -V | tail -n1)
if [ -n "${match}" ]; then
echo "${match}"
return 0
fi
# Second pass: fall back to exact-version tags ({ver}) if no arch-specific tag exists.
exactPattern="^${desiredVer//./\\.}$"
match=$(printf '%s\n' "${tags}" | grep -E "${exactPattern}" | sort -V | tail -n1)
echo "${match}"

Copilot uses AI. Check for mistakes.
Comment on lines +893 to +895
PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service"
touch ${PERSISTENCED_SERVICE_FILE_PATH}
cat << EOF > ${PERSISTENCED_SERVICE_FILE_PATH}
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERSISTENCED_SERVICE_FILE_PATH is written as a global variable and is used unquoted in touch/redirection. Making it local and quoting expansions improves safety and avoids accidental global state leakage when cse_install.sh is sourced by multiple distro scripts.

Suggested change
PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service"
touch ${PERSISTENCED_SERVICE_FILE_PATH}
cat << EOF > ${PERSISTENCED_SERVICE_FILE_PATH}
local PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service"
touch "${PERSISTENCED_SERVICE_FILE_PATH}"
cat << EOF > "${PERSISTENCED_SERVICE_FILE_PATH}"

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 79 changed files in this pull request and generated 3 comments.

Comment on lines +911 to +912
systemctl enable nvidia-persistenced.service || exit 1
systemctl restart nvidia-persistenced.service || exit 1
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableNvidiaPersistenceMode calls systemctl enable/restart and exit 1 on failure. Since this function is now shared and used for ACL, exiting with a generic code loses the repo’s standardized error codes and skips the retry/timeout wrappers (systemctlEnableAndStart, systemctl_*). Consider using the helper wrappers and returning a specific error code (e.g. ERR_SYSTEMCTL_START_FAIL) so failures are actionable in CSE telemetry.

Suggested change
systemctl enable nvidia-persistenced.service || exit 1
systemctl restart nvidia-persistenced.service || exit 1
if ! systemctlEnableAndStart nvidia-persistenced.service; then
return $ERR_SYSTEMCTL_START_FAIL
fi

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +35
# MCR artifacts may place files in an arch subdirectory (e.g. amd64/name.raw),
# so search up to 2 levels deep.
match=$(find "${downloadDir}" -maxdepth 2 -name "${seName}.raw" -type f 2>/dev/null | head -n1)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matchLocalSysext fallback (find ... -name "${seName}.raw" | head -n1) can return an arbitrary file and does not filter by the requested systemd arch (seArch is x86-64 on amd64). If both arch subdirs exist (e.g. amd64/ + arm64/), this can select the wrong sysext and break provisioning. Consider preferring an arch-specific path/pattern (e.g. ${downloadDir}/${seArch}/${seName}.raw or filtering find results), and make the selection deterministic (e.g. sort -V | tail -n1).

Suggested change
# MCR artifacts may place files in an arch subdirectory (e.g. amd64/name.raw),
# so search up to 2 levels deep.
match=$(find "${downloadDir}" -maxdepth 2 -name "${seName}.raw" -type f 2>/dev/null | head -n1)
# Prefer an arch-specific subdirectory (${downloadDir}/${seArch}) when present,
# then fall back to an arch-neutral file directly under ${downloadDir}. In both
# cases, pick the highest version deterministically.
match=$(find "${downloadDir}/${seArch}" -maxdepth 1 -name "${seName}.raw" -type f 2>/dev/null | sort -V | tail -n1)
if [ -f "${match}" ]; then
echo "${match}"
return
fi
match=$(find "${downloadDir}" -maxdepth 1 -name "${seName}.raw" -type f 2>/dev/null | sort -V | tail -n1)

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 21, 2026 00:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants