Skip to content

Conversation

@mipresmsft
Copy link

What this PR does / why we need it:

Adding support for AMDAMA (Supernova) GPUs.

@mipresmsft mipresmsft changed the title Moved changes to sync with main branch. Adding support for AMDAMA (Supernova) GPUs. Jan 21, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 21, 2026

PR Title Lint Failed ❌

Current Title: Adding support for AMDAMA (Supernova) GPUs.

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

return true
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method duplicated 3 times. Should be no more than 2. (Ideally 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also. would you recommend breaking form the model to make lint happy?

"userAssignedIdentityID": config.UserAssignedIdentityClientID,
"isVHD": isVHD(profile),
"gpuNode": strconv.FormatBool(config.EnableNvidia),
"amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unused?

})
}

func Test_AzureLinuxV3_MA35D(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AzureLinux is the only OS that support this VMSku correct ?

}

setupAmdAma() {
if [ "$(isARM64)" -eq 1 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.

return
fi

if isMarinerOrAzureLinux "$OS"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?

One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard


if isMarinerOrAzureLinux "$OS"; then
# Install driver
sudo tdnf install -y azurelinux-repos-amd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need sudo, since the CSE extension is running as root. we recently had an outage cause some scripts were calling SUDO using an expirer root password..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the helper function for any dnf command. which handle retries, etc

sudo tdnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine into a single dnf call ?

AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE
# Install core package
sudo tdnf install -y libzip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tdnf calls will not work on VM with restricted egress network.

# Install core package
sudo tdnf install -y libzip
sudo tdnf install -y azurelinux-repos-extended
sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.

if [ "${AMDAMA_NODE}" = "true" ]; then
logs_to_events "AKS.CSE.setupAmdAma" setupAmdAma
else
logs_to_events "AKS.CSE.setupAmdAma" "echo AMD AMA HW not found!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to log this eventk, knowing it will be the case 99.9% of the time ?

}

// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants