Add tier-based GPU selection for Lambda Labs #70

chucklever · 2025-12-16T16:21:43Z

Add support for tier-based GPU instance selection for Lambda Labs, similar to the existing DataCrunch implementation. This allows users to specify a maximum GPU tier and the system will automatically select the highest available GPU within that tier.

Add support for tier-based GPU instance selection for Lambda Labs, similar to the existing DataCrunch implementation. This allows users to specify a maximum GPU tier and the system will automatically select the highest available GPU within that tier. The implementation adds capacity checking and tier selection scripts that query the Lambda Labs API to find available instances. Single GPU tier groups fall back from GH200 to H100 to A100 to A6000 to A10. Multi-GPU tier groups fall back from 8x B200 to 8x H100 to 8x A100 to 8x V100. New Kconfig options provide tier-based selections like H100_OR_LESS and 8X_H100_OR_LESS. The terraform ansible tasks detect these wildcard types and invoke the tier selection script to find available capacity before provisioning. Defconfigs are provided for common tier combinations to simplify usage. Users can now run commands like make defconfig-lambdalabs-h100-or-less to get the best available single GPU up to H100 tier. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Address review feedback regarding inconsistent error handling in the check_availability function. The function contract implies returning int values for both success and failure, but error paths were calling sys.exit() directly. Change the error handling to return non-zero integers instead of calling sys.exit(), making the function consistent and easier to test. Remove the unused instance_data binding from get_instance_types_with_capacity() since only capacity_map is used. Add exception handling around the API call so the function never raises unhandled exceptions. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Address review feedback about duplicate code in check_availability(). The logic to build a region_map dictionary from gpu_instances appeared twice identically, violating the DRY principle. Extract this common pattern into a private _build_region_map() helper function that takes gpu_instances and returns the region-to-instance-type mapping. Both the JSON output and text output code paths now call this helper instead of duplicating the iteration logic. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

The tier selection script outputs "instance_type region" which is then parsed by splitting on whitespace and accessing indices [0] and [1]. If the script produces unexpected output such as an empty line or a single word, the split operation produces a list with fewer than two elements, causing Ansible to fail with a cryptic index error. Add an explicit validation task using ansible.builtin.assert to verify the output contains exactly two whitespace-separated values before attempting to parse it. This provides a clear error message showing the actual output when the format is invalid, making debugging easier. Generated-by: Claude AI Signed-off-by: Chuck Lever <cel@kernel.org>

mcgrof and others added 4 commits December 16, 2025 11:05

chucklever merged commit 8a5387d into main Dec 16, 2025
20 of 22 checks passed

chucklever deleted the cel/lambdalabs-gpu-selection branch December 16, 2025 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tier-based GPU selection for Lambda Labs #70

Add tier-based GPU selection for Lambda Labs #70

Uh oh!

chucklever commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add tier-based GPU selection for Lambda Labs #70

Add tier-based GPU selection for Lambda Labs #70

Uh oh!

Conversation

chucklever commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants