Skip to content

Conversation

@suhuruli
Copy link
Collaborator

@suhuruli suhuruli commented Jul 2, 2025

This pull request introduces support for provisioning Azure Linux nodes in the GPU Provisioner and updates related functionality across documentation, implementation, and testing. Key changes include adding configuration options for Azure Linux in KAITO workloads, modifying the logic for determining the OS SKU, and expanding test coverage to ensure compatibility.

Documentation Updates:

  • Added a new document, docs/azure-linux-support.md, detailing the configuration, supported image families, migration steps, and benefits of using Azure Linux with the GPU Provisioner. This includes examples for both NodeClaim and Workspace specifications.

Implementation Enhancements:

  • Updated func newAgentPoolObject in pkg/providers/instance/instance.go to determine the OS SKU based on NodeClaim labels or annotations, defaulting to Ubuntu if unspecified or invalid. Labels take precedence over annotations.
  • I considered doing this with the AKSNodeClass but opted for labels and annotations instead. Happy to change to looking at the AKSNodeClass reference for the ImageFamily.

Testing Improvements:

  • Added a new test, TestNewAgentPoolObjectWithImageFamily, in pkg/providers/instance/instance_test.go to validate the OS SKU determination logic for various scenarios, including label precedence, unknown image families, and default behavior.

@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Jul 2, 2025

Title

Add Azure Linux support in GPU Provisioner


Description

  • Added support for Azure Linux in GPU Provisioner

  • Updated newAgentPoolObject to determine OS SKU from NodeClaim labels or annotations

  • Added tests for OS SKU determination logic

  • Created documentation for Azure Linux support


Changes walkthrough 📝

Relevant files
Enhancement
instance.go
Add OS SKU determination logic                                                     

pkg/providers/instance/instance.go

  • Added logic to determine OS SKU from NodeClaim labels or annotations
  • Defaulted to Ubuntu if no image family is specified
  • +33/-1   
    Tests
    instance_test.go
    Add OS SKU determination tests                                                     

    pkg/providers/instance/instance_test.go

    • Added test cases for OS SKU determination logic
    +161/-0 
    suite_test.go
    Add Azure Linux E2E tests                                                               

    test/e2e/suites/suite_test.go

  • Added E2E tests for provisioning Azure Linux nodes
  • Included tests for label and annotation precedence
  • +508/-0 
    Documentation
    azure-linux-annotation-nodeclaim.yaml
    Add Azure Linux annotation example                                             

    examples/azure-linux-annotation-nodeclaim.yaml

    • Added example NodeClaim using Azure Linux via annotation
    +44/-0   
    azure-linux-examples.md
    Add Azure Linux documentation                                                       

    examples/azure-linux-examples.md

    • Created documentation with examples for Azure Linux support
    +255/-0 
    azure-linux-nodeclaim.yaml
    Add Azure Linux label example                                                       

    examples/azure-linux-nodeclaim.yaml

    • Added example NodeClaim using Azure Linux via label
    +37/-0   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Jul 2, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The determineOSSKU function does not handle the case where nodeClaim is nil. This could lead to a panic if nodeClaim is unexpectedly nil.

    func determineOSSKU(nodeClaim *karpenterv1.NodeClaim) *armcontainerservice.OSSKU {
    	// Helper function to convert image family to OSSKU
    	convertImageFamilyToOSSKU := func(imageFamily, source string) *armcontainerservice.OSSKU {
    		switch strings.ToLower(imageFamily) {
    		case "azurelinux":
    			return to.Ptr(armcontainerservice.OSSKUAzureLinux)
    		case "ubuntu", "ubuntu2204":
    			return to.Ptr(armcontainerservice.OSSKUUbuntu)
    		default:
    			klog.Warningf("Unknown imageFamily %s in NodeClaim %s, defaulting to Ubuntu", imageFamily, source)
    			return to.Ptr(armcontainerservice.OSSKUUbuntu)
    		}
    	}
    	// First check for a direct label on the NodeClaim
    	if imageFamily, ok := nodeClaim.Labels["kaito.sh/node-image-family"]; ok {
    		return convertImageFamilyToOSSKU(imageFamily, "label")
    	}
    	// Check annotations as fallback
    	if imageFamily, ok := nodeClaim.Annotations["kaito.sh/node-image-family"]; ok {
    		return convertImageFamilyToOSSKU(imageFamily, "annotation")
    	}
    
    	// Default to Ubuntu if no image family is specified
    	return to.Ptr(armcontainerservice.OSSKUUbuntu)
    }
    Redundant Tests

    There are multiple tests that check the deletion of nodes (It("should terminate node when delete triggered"), It("should terminate node when delete triggered (Azure Linux)"), It("should terminate node when delete triggered (Azure Linux - annotation)")). These tests seem redundant and could be consolidated to reduce duplication.

    It("should terminate node when delete triggered", func() {
    	nodeClaimLabels := map[string]string{
    		"karpenter.sh/provisioner-name": "default",
    		"kaito.sh/workspace":            "none",
    	}
    
    	nc := test.NodeClaim(karpenterv1.NodeClaim{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:   "wctestnc5",
    			Labels: nodeClaimLabels,
    		},
    		Spec: karpenterv1.NodeClaimSpec{
    			NodeClassRef: &karpenterv1.NodeClassReference{
    				Name: "default",
    				Kind: "AKSNodeClass",
    			},
    			Resources: karpenterv1.ResourceRequirements{
    				Requests: v1.ResourceList{
    					v1.ResourceStorage: lo.FromPtr(resource.NewQuantity(120*1024*1024*1024, resource.DecimalSI)),
    				},
    			},
    			Requirements: []karpenterv1.NodeSelectorRequirementWithMinValues{
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelInstanceTypeStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"Standard_NC12s_v3"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      karpenterv1.NodePoolLabelKey,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"kaito"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelOSStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"linux"},
    					},
    				},
    			},
    			Taints: []v1.Taint{
    				{
    					Key:    "sku",
    					Value:  "gpu",
    					Effect: v1.TaintEffectNoSchedule,
    				},
    			},
    		},
    	})
    
    	DeferCleanup(func() {
    		env.ExpectDeleted(nc)
    		env.EventuallyExpectCreatedNodeClaimCount("==", 0)
    		env.EventuallyExpectNodeCount("==", 0)
    	})
    
    	env.ExpectCreated(nc)
    	env.EventuallyExpectCreatedNodeClaimCount("==", 1)
    	env.EventuallyExpectNodeClaimsReady(nc)
    	env.EventuallyExpectNodeCount("==", 1)
    	node := env.EventuallyExpectInitializedNodeCount("==", 1)[0]
    
    	// delete node for triggering terminate all resrouces like NodeClaim, CloudProvider Instance
    	env.ExpectDeleted(node)
    })
    
    It("should terminate node when delete triggered (Azure Linux)", func() {
    	nodeClaimLabels := map[string]string{
    		"karpenter.sh/provisioner-name": "default",
    		"kaito.sh/workspace":            "azure-linux-test",
    		"kaito.sh/node-image-family":    "AzureLinux",
    	}
    
    	nc := test.NodeClaim(karpenterv1.NodeClaim{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:   "azlinuxtestnc2",
    			Labels: nodeClaimLabels,
    		},
    		Spec: karpenterv1.NodeClaimSpec{
    			NodeClassRef: &karpenterv1.NodeClassReference{
    				Name: "default",
    				Kind: "AKSNodeClass",
    			},
    			Resources: karpenterv1.ResourceRequirements{
    				Requests: v1.ResourceList{
    					v1.ResourceStorage: lo.FromPtr(resource.NewQuantity(120*1024*1024*1024, resource.DecimalSI)),
    				},
    			},
    			Requirements: []karpenterv1.NodeSelectorRequirementWithMinValues{
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelInstanceTypeStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"Standard_NC12s_v3"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      karpenterv1.NodePoolLabelKey,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"kaito"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelOSStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"linux"},
    					},
    				},
    			},
    			Taints: []v1.Taint{
    				{
    					Key:    "sku",
    					Value:  "gpu",
    					Effect: v1.TaintEffectNoSchedule,
    				},
    			},
    		},
    	})
    
    	DeferCleanup(func() {
    		env.ExpectDeleted(nc)
    		env.EventuallyExpectCreatedNodeClaimCount("==", 0)
    		env.EventuallyExpectNodeCount("==", 0)
    	})
    
    	env.ExpectCreated(nc)
    	env.EventuallyExpectCreatedNodeClaimCount("==", 1)
    	env.EventuallyExpectNodeClaimsReady(nc)
    	env.EventuallyExpectNodeCount("==", 1)
    	node := env.EventuallyExpectInitializedNodeCount("==", 1)[0]
    
    	// delete node for triggering terminate all resrouces like NodeClaim, CloudProvider Instance
    	env.ExpectDeleted(node)
    })
    
    It("should terminate node when delete triggered (Azure Linux - annotation)", func() {
    	nodeClaimLabels := map[string]string{
    		"karpenter.sh/provisioner-name": "default",
    		"kaito.sh/workspace":            "azure-linux-annotation-test",
    	}
    
    	nodeClaimAnnotations := map[string]string{
    		"kaito.sh/node-image-family": "AzureLinux",
    	}
    
    	nc := test.NodeClaim(karpenterv1.NodeClaim{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:        "azlinuxannotationtestnc2",
    			Labels:      nodeClaimLabels,
    			Annotations: nodeClaimAnnotations,
    		},
    		Spec: karpenterv1.NodeClaimSpec{
    			NodeClassRef: &karpenterv1.NodeClassReference{
    				Name: "default",
    				Kind: "AKSNodeClass",
    			},
    			Resources: karpenterv1.ResourceRequirements{
    				Requests: v1.ResourceList{
    					v1.ResourceStorage: lo.FromPtr(resource.NewQuantity(120*1024*1024*1024, resource.DecimalSI)),
    				},
    			},
    			Requirements: []karpenterv1.NodeSelectorRequirementWithMinValues{
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelInstanceTypeStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"Standard_NC12s_v3"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      karpenterv1.NodePoolLabelKey,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"kaito"},
    					},
    				},
    				{
    					NodeSelectorRequirement: v1.NodeSelectorRequirement{
    						Key:      v1.LabelOSStable,
    						Operator: v1.NodeSelectorOpIn,
    						Values:   []string{"linux"},
    					},
    				},
    			},
    			Taints: []v1.Taint{
    				{
    					Key:    "sku",
    					Value:  "gpu",
    					Effect: v1.TaintEffectNoSchedule,
    				},
    			},
    		},
    	})
    
    	DeferCleanup(func() {
    		env.ExpectDeleted(nc)
    		env.EventuallyExpectCreatedNodeClaimCount("==", 0)
    		env.EventuallyExpectNodeCount("==", 0)
    	})
    
    	env.ExpectCreated(nc)
    	env.EventuallyExpectCreatedNodeClaimCount("==", 1)
    	env.EventuallyExpectNodeClaimsReady(nc)
    	env.EventuallyExpectNodeCount("==", 1)
    	node := env.EventuallyExpectInitializedNodeCount("==", 1)[0]
    
    	// delete node for triggering terminate all resrouces like NodeClaim, CloudProvider Instance
    	env.ExpectDeleted(node)
    })

    @codecov-commenter
    Copy link

    codecov-commenter commented Jul 2, 2025

    Codecov Report

    All modified and coverable lines are covered by tests ✅

    Project coverage is 70.45%. Comparing base (18cafb0) to head (1249efd).

    Additional details and impacted files
    @@            Coverage Diff             @@
    ##             main     #250      +/-   ##
    ==========================================
    + Coverage   69.24%   70.45%   +1.20%     
    ==========================================
      Files           4        4              
      Lines         465      484      +19     
    ==========================================
    + Hits          322      341      +19     
      Misses        123      123              
      Partials       20       20              

    ☔ View full report in Codecov by Sentry.
    📢 Have feedback on the report? Share it here.

    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Jul 2, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Handle nil NodeClaim

    Ensure that determineOSSKU handles all possible edge cases, such as nil nodeClaim or
    missing labels/annotations.

    pkg/providers/instance/instance.go [369]

    +if nodeClaim == nil {
    +    klog.Warningf("NodeClaim is nil, defaulting to Ubuntu")
    +    return armcontainerservice.AgentPool{
    +        Properties: &armcontainerservice.ManagedClusterAgentPoolProfileProperties{
    +            NodeLabels:   labels,
    +            NodeTaints:   taintsStr,
    +            Type:         to.Ptr(scaleSetsType),
    +            VMSize:       to.Ptr(vmSize),
    +            OSType:       to.Ptr(armcontainerservice.OSTypeLinux),
    +            OSSKU:        to.Ptr(armcontainerservice.OSSKUUbuntu),
    +            Count:        to.Ptr(int32(1)),
    +            OSDiskSizeGB: to.Ptr(diskSizeGB),
    +        },
    +    }, nil
    +}
     OSSKU:        determineOSSKU(nodeClaim),
    Suggestion importance[1-10]: 8

    __

    Why: Handling nil nodeClaim prevents potential runtime errors and ensures robustness.

    Medium
    Log full NodeClaim name

    Consider logging the full NodeClaim name for better debugging.

    pkg/providers/instance/instance.go [430-431]

     default:
    -    klog.Warningf("Unknown imageFamily %s in NodeClaim %s, defaulting to Ubuntu", imageFamily, source)
    +    klog.Warningf("Unknown imageFamily %s in NodeClaim %s/%s, defaulting to Ubuntu", imageFamily, nodeClaim.Namespace, nodeClaim.Name)
         return to.Ptr(armcontainerservice.OSSKUUbuntu)
    Suggestion importance[1-10]: 5

    __

    Why: Logging the full NodeClaim name improves debugging by providing more context.

    Low
    Add OS SKU verification

    Add a section on how to verify the OS SKU of the created agent pools.

    examples/azure-linux-examples.md [169]

     ## Troubleshooting
     
    +### Verify Agent Pool OS SKU
    +
    +```bash
    +# Check agent pool configuration
    +az aks agentpool show \
    +  --resource-group <resource-group> \
    +  --cluster-name <cluster-name> \
    +  --name <pool-name> \
    +  --query "osSkU"
    +
    +# Expected output: "AzureLinux"
    +```
    +
    +### Common Issues
    +
    Suggestion importance[1-10]: 5

    __

    Why: Adding a section on verifying the OS SKU helps users confirm the correct configuration.

    Low
    Add migration steps

    Include steps to update existing deployments to use Azure Linux.

    examples/azure-linux-examples.md [210]

     ## Migration Guide
     
    +### Update Existing Deployments
    +
    +1. **Edit your NodeClaim or Workspace**:
    +   ```yaml
    +   # Add this label to your NodeClaim or Workspace labelSelector
    +   kaito.sh/node-image-family: "AzureLinux"
    +   ```
    +
    +2. **Roll out the changes**:
    +   ```bash
    +   kubectl apply -f <your-nodeclaim-or-workspace-file>.yaml
    +   ```
    +
    +3. **Verify the change**:
    +   ```bash
    +   # Check that new nodes use Azure Linux
    +   kubectl get nodes -o custom-columns=NAME:.metadata.name,OS-IMAGE:.status.nodeInfo.osImage
    +   ```
    +
    +4. **Test your workloads**:
    +   - Ensure your containerized workloads work correctly on Azure Linux
    +   - Most workloads should work without changes
    +
    +### From Ubuntu to Azure Linux
    +
    Suggestion importance[1-10]: 5

    __

    Why: Including migration steps assists users in updating existing deployments to use Azure Linux.

    Low

    @suhuruli suhuruli changed the title Azure Linux support in gpu provisioner feat: Azure Linux support in gpu provisioner Jul 2, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants