Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -526,7 +526,7 @@ Below are sections which get into technical details of how kdevops works.
* [Linux distribution support](docs/linux-distro-support.md)
* [Overriding all Ansible role options with one file](docs/ansible-override.md)
* [kdevops Vagrant support](docs/kdevops-vagrant.md)
* [kdevops terraform support - cloud setup with kdevops](docs/kdevops-terraform.md)
* [kdevops terraform and cloud provider support](docs/kdevops-terraform.md) - AWS, Azure, GCE, OCI, Lambda Labs, DataCrunch
* [kdevops local Ansible roles](docs/ansible-roles.md)
* [Tutorial on building your own custom Vagrant boxes](docs/custom-vagrant-boxes.md)

Expand Down
33 changes: 32 additions & 1 deletion docs/kdevops-terraform.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ a Terraform plan.
Terraform is used to deploy your development hosts on cloud virtual machines.
Below are the list of clouds providers currently supported:

**Traditional Cloud Providers:**
* azure - Microsoft Azure
* aws - Amazon Web Service
* gce - Google Cloud Compute
* oci - Oracle Cloud Infrastructure

**Neoclouds (GPU-optimized):**
* datacrunch - DataCrunch GPU Cloud
* lambdalabs - Lambda Labs GPU Cloud

Expand Down Expand Up @@ -271,7 +274,18 @@ If your Ansible controller (where you run "make bringup") and your
test instances operate inside the same subnet, you can disable the
TERRAFORM_OCI_ASSIGN_PUBLIC_IP option for better network security.

### DataCrunch - GPU Cloud Provider
## Neoclouds

A neocloud is a new type of specialized cloud provider that focuses on offering
high-performance computing, particularly GPU-as-a-Service, to handle demanding
AI and machine learning workloads. Unlike traditional, general-purpose cloud
providers like AWS or Azure, neoclouds are purpose-built for AI with
infrastructure optimized for raw speed, specialized hardware like dense GPU
clusters, and tailored services like fast deployment and simplified pricing.

kdevops supports the following neocloud providers:

### DataCrunch

kdevops supports DataCrunch, a cloud provider specialized in GPU computing
with competitive pricing for NVIDIA A100, H100, B200, and B300 instances.
Expand Down Expand Up @@ -450,3 +464,20 @@ provider_installation {
```

For more information, visit: https://datacrunch.io/

### Lambda Labs

kdevops supports Lambda Labs, a cloud provider focused on GPU instances for
machine learning workloads with competitive pricing.

For detailed documentation on Lambda Labs integration, including tier-based
GPU selection, smart instance selection, and dynamic Kconfig generation, see:

* [Lambda Labs Dynamic Cloud Kconfig](dynamic-cloud-kconfig.md) - Dynamic configuration generation for Lambda Labs
* [Lambda Labs CLI Reference](lambda-cli.1) - Man page for the lambda-cli tool

Lambda Labs offers various GPU instance types including A10, A100, and H100
configurations. kdevops provides smart selection features that automatically
choose the cheapest available instance type and region.

For more information, visit: https://lambdalabs.com/
103 changes: 103 additions & 0 deletions terraform/lambdalabs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Dynamic Configuration](#dynamic-configuration)
- [Tier-Based GPU Selection](#tier-based-gpu-selection)
- [SSH Key Security](#ssh-key-security)
- [Configuration Options](#configuration-options)
- [Provider Limitations](#provider-limitations)
Expand Down Expand Up @@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list

For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).

## Tier-Based GPU Selection

Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
a single instance type, you can specify a maximum tier and kdevops will automatically select
the highest available GPU within that tier.

### How It Works

1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
2. **Capacity Check**: The system queries Lambda Labs API for available instances
3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
4. **Auto-Provision**: Deploys to the first region with available capacity

### Single GPU Tier Groups

| Tier Group | Fallback Order | Use Case |
|------------|----------------|----------|
| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |

### Multi-GPU (8x) Tier Groups

| Tier Group | Fallback Order | Use Case |
|------------|----------------|----------|
| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |

### Quick Start with Tier Selection

```bash
# Single GPU - best available up to H100
make defconfig-lambdalabs-h100-or-less
make bringup

# Single GPU - best available up to GH200
make defconfig-lambdalabs-gh200-or-less
make bringup

# 8x GPU - best available up to H100
make defconfig-lambdalabs-8x-h100-or-less
make bringup
```

### Checking Capacity

Before deploying, you can check current GPU availability:

```bash
# Check all available GPU instances
python3 scripts/lambdalabs_check_capacity.py

# Check specific instance type
python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5

# JSON output for scripting
python3 scripts/lambdalabs_check_capacity.py --json
```

### Tier Selection Script

The tier selection script finds the best available GPU:

```bash
# Find best single GPU up to H100
python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose

# Find best 8x GPU up to H100
python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose

# List all available tier groups
python3 scripts/lambdalabs_select_tier.py --list-tiers
```

Example output:
```
Checking tier group: h100-or-less
Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10

Checking tier 'h100-sxm': gpu_1x_h100_sxm5
Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1

Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
gpu_1x_h100_sxm5 us-west-1
```

### Benefits of Tier-Based Selection

- **Higher Success Rate**: Automatically falls back to available GPUs
- **No Manual Intervention**: System handles capacity changes
- **Best Performance**: Always gets the highest tier available
- **Simple Configuration**: One defconfig covers multiple GPU types

## SSH Key Security

### Automatic Unique Keys (Default - Recommended)
Expand Down Expand Up @@ -168,6 +264,11 @@ The default configuration automatically:
|--------|-------------|----------|
| `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
| `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |

### Manual Configuration

Expand Down Expand Up @@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
|--------|---------|
| `lambdalabs_api.py` | Main API integration, generates Kconfig |
| `lambdalabs_smart_inference.py` | Smart instance/region selection |
| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
| `lambdalabs_ssh_keys.py` | SSH key management |
| `lambdalabs_list_instances.py` | List running instances |
| `lambdalabs_credentials.py` | Manage API credentials |
Expand Down
Loading