Skip to content

Conversation

@chucklever
Copy link
Contributor

Add support for tier-based GPU instance selection for Lambda Labs, similar to the existing DataCrunch implementation. This allows users to specify a maximum GPU tier and the system will automatically select the highest available GPU within that tier.

mcgrof and others added 4 commits December 16, 2025 11:05
Add support for tier-based GPU instance selection for Lambda Labs, similar
to the existing DataCrunch implementation. This allows users to specify
a maximum GPU tier and the system will automatically select the highest
available GPU within that tier.

The implementation adds capacity checking and tier selection scripts that
query the Lambda Labs API to find available instances. Single GPU tier
groups fall back from GH200 to H100 to A100 to A6000 to A10. Multi-GPU
tier groups fall back from 8x B200 to 8x H100 to 8x A100 to 8x V100.

New Kconfig options provide tier-based selections like H100_OR_LESS and
8X_H100_OR_LESS. The terraform ansible tasks detect these wildcard types
and invoke the tier selection script to find available capacity before
provisioning.

Defconfigs are provided for common tier combinations to simplify usage.
Users can now run commands like make defconfig-lambdalabs-h100-or-less
to get the best available single GPU up to H100 tier.

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Address review feedback regarding inconsistent error handling in the
check_availability function. The function contract implies returning int
values for both success and failure, but error paths were calling
sys.exit() directly.

Change the error handling to return non-zero integers instead of calling
sys.exit(), making the function consistent and easier to test. Remove
the unused instance_data binding from get_instance_types_with_capacity()
since only capacity_map is used. Add exception handling around the API
call so the function never raises unhandled exceptions.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Address review feedback about duplicate code in check_availability(). The
logic to build a region_map dictionary from gpu_instances appeared twice
identically, violating the DRY principle.

Extract this common pattern into a private _build_region_map() helper
function that takes gpu_instances and returns the region-to-instance-type
mapping. Both the JSON output and text output code paths now call this
helper instead of duplicating the iteration logic.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The tier selection script outputs "instance_type region" which is then
parsed by splitting on whitespace and accessing indices [0] and [1].
If the script produces unexpected output such as an empty line or a
single word, the split operation produces a list with fewer than two
elements, causing Ansible to fail with a cryptic index error.

Add an explicit validation task using ansible.builtin.assert to verify
the output contains exactly two whitespace-separated values before
attempting to parse it. This provides a clear error message showing the
actual output when the format is invalid, making debugging easier.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <cel@kernel.org>
@chucklever chucklever merged commit 8a5387d into main Dec 16, 2025
20 of 22 checks passed
@chucklever chucklever deleted the cel/lambdalabs-gpu-selection branch December 16, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants