-
Notifications
You must be signed in to change notification settings - Fork 244
Adding support for AMDAMA (Supernova) GPUs. #7697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method duplicated 3 times. Should be no more than 2. (Ideally 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also. would you recommend breaking form the model to make lint happy?
| "userAssignedIdentityID": config.UserAssignedIdentityClientID, | ||
| "isVHD": isVHD(profile), | ||
| "gpuNode": strconv.FormatBool(config.EnableNvidia), | ||
| "amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unused?
| }) | ||
| } | ||
|
|
||
| func Test_AzureLinuxV3_MA35D(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AzureLinux is the only OS that support this VMSku correct ?
| } | ||
|
|
||
| setupAmdAma() { | ||
| if [ "$(isARM64)" -eq 1 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?
One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard
|
|
||
| if isMarinerOrAzureLinux "$OS"; then | ||
| # Install driver | ||
| sudo tdnf install -y azurelinux-repos-amd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need sudo, since the CSE extension is running as root. we recently had an outage cause some scripts were calling SUDO using an expirer root password..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use the helper function for any dnf command. which handle retries, etc
| sudo tdnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine into a single dnf call ?
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE | ||
| # Install core package | ||
| sudo tdnf install -y libzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all tdnf calls will not work on VM with restricted egress network.
| # Install core package | ||
| sudo tdnf install -y libzip | ||
| sudo tdnf install -y azurelinux-repos-extended | ||
| sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.
| if [ "${AMDAMA_NODE}" = "true" ]; then | ||
| logs_to_events "AKS.CSE.setupAmdAma" setupAmdAma | ||
| else | ||
| logs_to_events "AKS.CSE.setupAmdAma" "echo AMD AMA HW not found!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to log this eventk, knowing it will be the case 99.9% of the time ?
| } | ||
|
|
||
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.
What this PR does / why we need it:
Adding support for AMDAMA (Supernova) GPUs.