-
Notifications
You must be signed in to change notification settings - Fork 14
docs(guides): add holodeck + AICR integration preview (provisioning + snapshot/recipe) #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ArangoGutierrez
merged 14 commits into
NVIDIA:main
from
ArangoGutierrez:docs-aicr-integration-demo
May 25, 2026
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
c685a86
examples(aicr-demo): add g6e.xlarge L40S environment
ArangoGutierrez 6f4d0b4
examples(aicr-demo): add minimal single-node SlurmCluster
ArangoGutierrez fd56471
docs(guides): add aicr-integration.md skeleton
ArangoGutierrez 3d38b9c
docs(guides): add aicr-integration phase 1 (holodeck provisioning)
ArangoGutierrez 9cadf9c
docs(guides): add aicr-integration phase 2.1 (snapshot)
ArangoGutierrez 4c65ec6
docs(guides): add aicr-integration phase 2.2 (slurm track)
ArangoGutierrez fe7f86e
docs(guides): add aicr-integration phase 2.3 (dynamo track)
ArangoGutierrez 70d4648
docs(guides): add aicr-integration phase 2.4 (validate)
ArangoGutierrez 2608915
docs(guides): add aicr-integration closing sections
ArangoGutierrez 151a6bb
docs(guides,examples): index the aicr-integration guide + example
ArangoGutierrez 4ab8c79
examples(aicr-demo): drop slurm-cluster.yaml from v1 scope
ArangoGutierrez 11e5ccb
docs(guides): reduce aicr-integration v1 to provisioning + recipe pre…
ArangoGutierrez cfb1bbc
docs(guides): quote hyphenated yq key in aicr-integration snapshot ex…
ArangoGutierrez 636dfbb
docs(guides): address PR #819 review feedback
ArangoGutierrez File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,278 @@ | ||
| # Holodeck + AICR: Provisioning and Cluster Snapshot (Preview) | ||
|
|
||
| > **Status: preview.** This guide currently covers the Day-0 half of | ||
| > the holodeck → AICR flow: provisioning a GPU cluster with Holodeck, | ||
| > then capturing a snapshot and generating a recipe with AICR. The | ||
| > end-to-end Slurm and Dynamo deploy paths require upstream changes | ||
| > that are still in flight (see [What's coming](#whats-coming)). | ||
|
|
||
| ## What you'll build | ||
|
|
||
| A single-node AWS `g6e.xlarge` instance (1× NVIDIA L40S), a kubeadm | ||
| Kubernetes cluster on top of it, an AICR snapshot describing that | ||
| cluster, and an AICR recipe matched to the snapshot — all in ~20 | ||
| minutes for about $2 of AWS spend. | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| R[Reader] -->|env.yaml| H[holodeck create --provision] | ||
| H -->|VPC + EC2 + kubeadm + drivers| C[g6e.xlarge + L40S + K8s 1.35] | ||
| C -->|aicr snapshot| S[snapshot.yaml] | ||
| S -->|aicr recipe| RX[recipe.yaml] | ||
| ``` | ||
|
|
||
| The reduced v1 stops at recipe generation. The bundle/deploy/validate | ||
| finale ships once the upstream catalog and platform gaps close. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - `holodeck` v0.2.18+ installed (`make build && sudo mv ./bin/holodeck /usr/local/bin/`) | ||
| - `aicr` v0.12.0+ installed (`brew install NVIDIA/aicr/aicr` — see | ||
| [AICR installation](https://github.com/NVIDIA/aicr/blob/main/docs/user/installation.md)) | ||
| - AWS account with credentials in your environment and `g6e` quota in | ||
| `us-west-2` (request via the EC2 service quotas console) | ||
| - `kubectl` and `yq` on your path | ||
| - ~$2 of AWS spend budget (g6e.xlarge is roughly $1.86/hr on-demand | ||
| in `us-west-2`) | ||
|
|
||
| ## Phase 1 — Provision with Holodeck | ||
|
|
||
| ### 1.1 Configure | ||
|
|
||
| Open [`examples/aicr-demo/environment.yaml`](../../examples/aicr-demo/environment.yaml): | ||
|
|
||
| ```yaml | ||
| apiVersion: holodeck.nvidia.com/v1alpha1 | ||
| kind: Environment | ||
| metadata: | ||
| name: aicr-demo-l40s | ||
| spec: | ||
| provider: aws | ||
| auth: | ||
| keyName: <your key name here> | ||
| privateKey: <your key path here> | ||
| instance: | ||
| type: g6e.xlarge # 1x NVIDIA L40S (48 GiB VRAM), 4 vCPU, 32 GiB host RAM | ||
| region: us-west-2 | ||
| os: ubuntu-22.04 | ||
| image: { architecture: x86_64 } | ||
| containerRuntime: { install: true, name: containerd } | ||
| nvidiaContainerToolkit: { install: true } | ||
| nvidiaDriver: { install: true } | ||
| kubernetes: | ||
| install: true | ||
| installer: kubeadm | ||
| version: v1.35.0 | ||
| crictlVersion: v1.35.0 | ||
| ``` | ||
|
|
||
| What matters in this YAML: | ||
|
|
||
| - `provider: aws` + `auth` (`keyName`, `privateKey`) — the only fields you edit. | ||
| - `instance.type: g6e.xlarge` — cheapest cloud SKU with an L40S. | ||
| - `os: ubuntu-22.04` — AMI auto-resolved by region; SSH user auto-detected. | ||
| - `kubernetes.version: v1.35.0` — the current line; pair with a matching | ||
| `crictlVersion`. | ||
| - `containerd` + `nvidiaDriver` + `nvidiaContainerToolkit` — Day 0 | ||
| ends at host-level GPU access. Kubernetes-level GPU resources land | ||
| in Phase 2 once AICR installs the GPU Operator. | ||
|
|
||
| Copy the example into your working directory and fill in your AWS key: | ||
|
|
||
| ```bash | ||
| cp examples/aicr-demo/environment.yaml ./my-env.yaml | ||
| $EDITOR ./my-env.yaml # set auth.keyName and auth.privateKey | ||
| ``` | ||
|
|
||
| ### 1.2 Create the cluster | ||
|
|
||
| ```bash | ||
| holodeck create -f ./my-env.yaml --provision | ||
| ``` | ||
|
|
||
| The `--provision` flag is required: without it, `holodeck create` only | ||
| spins up the EC2 instance and stops there. With it, holodeck creates | ||
| a VPC, a security group, an EC2 instance, then runs the Ansible plays | ||
| that install driver, container runtime, toolkit, and kubeadm. Total | ||
| wall-clock is typically 6–8 minutes for create + provision; longer if | ||
| your AWS account is provisioning a fresh AMI. | ||
|
|
||
| Monitor progress: | ||
|
|
||
| ```bash | ||
| holodeck list | ||
| holodeck status <instance-id> | ||
| ``` | ||
|
|
||
| For a pre-flight check that does not touch AWS: | ||
|
|
||
| ```bash | ||
| holodeck dryrun -f ./my-env.yaml | ||
| ``` | ||
|
|
||
| On success, the instance shows `true` under the `PROVISIONED` column | ||
| of `holodeck list`. Note the instance ID (an 8-char hex string) | ||
| printed at the end. | ||
|
|
||
| ### 1.3 Fetch the kubeconfig | ||
|
|
||
| ```bash | ||
| holodeck get kubeconfig -o ./kubeconfig <instance-id> | ||
| ``` | ||
|
|
||
| Two things to know about the kubeconfig: | ||
|
|
||
| - The flag-then-positional order matters: `-o ./kubeconfig` must | ||
| come before the instance ID. The reverse order is rejected with | ||
| "instance ID is required". | ||
| - The server URL points at the instance's public DNS name (the | ||
| apiserver cert SAN now includes it). `kubectl` from your laptop | ||
| works without `--insecure-skip-tls-verify`. | ||
|
|
||
| ### 1.4 Verify the cluster | ||
|
|
||
| Point `kubectl` at the new cluster and confirm the node is Ready: | ||
|
|
||
| ```bash | ||
| export KUBECONFIG=$PWD/kubeconfig | ||
| kubectl get nodes -o wide | ||
| ``` | ||
|
|
||
| You should see a single node (control-plane and worker on the same | ||
| host) running `v1.35.0`. | ||
|
|
||
| > Note on GPU verification at Day 0: the `nvidia` `RuntimeClass` is | ||
| > not installed by holodeck. A Day-0 `runtimeClassName: nvidia` pod | ||
| > will fail with "RuntimeClass nvidia not found". Kubernetes-level | ||
| > GPU access — both the runtime class and the `nvidia.com/gpu` | ||
| > resource — is the GPU Operator's job, installed by AICR in Phase 2. | ||
|
|
||
| ## Phase 2 — Capture with AICR | ||
|
|
||
| ### 2.1 Snapshot the cluster | ||
|
|
||
| ```bash | ||
| aicr snapshot --output snapshot.yaml | ||
| ``` | ||
|
|
||
| A snapshot is AICR's read of your live cluster — node provider, GPU | ||
| model, kernel, container runtime, OS, K8s server version, installed | ||
| operators. It is the input AICR uses to derive a matching recipe. | ||
|
|
||
| Skim a few key fields: | ||
|
|
||
| ```bash | ||
| yq '.measurements[] | select(.type=="GPU") | .subtypes[] | .data.gpu.model // .data.gpu."product-architecture"' snapshot.yaml | ||
| yq '.measurements[] | select(.type=="Kubernetes") | .subtypes[0].data.server_version' snapshot.yaml | ||
| ``` | ||
|
|
||
| You should see `NVIDIA L40S` with architecture `Ada Lovelace`, and | ||
| `v1.35.0` for Kubernetes. Snapshot capture takes under a minute and | ||
| runs as a Job in your cluster's `default` namespace (the agent pod is | ||
| cleaned up after the snapshot is collected). | ||
|
|
||
| ### 2.2 Generate a recipe | ||
|
|
||
| ```bash | ||
| aicr recipe --snapshot snapshot.yaml \ | ||
| --intent inference --platform dynamo \ | ||
| --output recipe.yaml | ||
| ``` | ||
|
|
||
| The accelerator (L40S here) is inferred from the snapshot — no | ||
| `--accelerator` flag is needed. AICR matches the snapshot against its | ||
| overlay catalog and emits a recipe describing the components it would | ||
| install for the requested intent + platform. Inspect the resulting | ||
| component list: | ||
|
|
||
| ```bash | ||
| yq '.componentRefs[].name' recipe.yaml | ||
| ``` | ||
|
|
||
| On L40S today, you should see ten components in the recipe: | ||
| `cert-manager`, `gpu-operator`, `k8s-ephemeral-storage-metrics`, | ||
| `kai-scheduler`, `kube-prometheus-stack`, `nfd`, `nodewright-operator`, | ||
| `nvidia-dra-driver-gpu`, `nvsentinel`, `prometheus-adapter`. This is | ||
| the `base` overlay plus `monitoring-hpa`. | ||
|
|
||
| > Note: AICR's overlay catalog currently has rich coverage for H100, | ||
| > GB200, and B200 hardware, with L40S coverage still in flight. On | ||
| > L40S today the recipe is the base set — the `dynamo-platform` | ||
| > component is not yet matched. See [What's coming](#whats-coming). | ||
|
|
||
| The snapshot → recipe step demonstrates the matching mechanic that | ||
| makes AICR reproducible: given the same snapshot, you get the same | ||
| recipe, and that recipe is what `aicr bundle` would turn into a | ||
| deployable Helm chart sequence in the full demo. | ||
|
|
||
| ## What's coming | ||
|
|
||
| The full end-to-end demo (Slurm batch job + Dynamo chat-completions | ||
| finale) is gated on three upstream changes: | ||
|
|
||
| - **AICR PR #866 (Slinky/Slurm)** — adds `--platform slurm`. Currently | ||
| on the feature branch `feat/slinky-slurm-operator`, not yet merged | ||
| to `main`. Enables the Phase 2 Slurm track. | ||
| - **AICR overlay catalog — L40 coverage** — `--accelerator l40` today | ||
| matches only the `base` + `monitoring-hpa` overlays, so the inference | ||
| recipe does not include `dynamo-platform`. Tracked upstream. | ||
| - **GPU Operator on kubeadm + holodeck driver** — the `gdrcopy-validation` | ||
| init container of `nvidia-operator-validator` requires the `gdrdrv` | ||
|
Comment on lines
+203
to
+220
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "What's coming" section is a strong addition — naming the three upstream gates (Slurm PR #866, L40 overlay coverage, |
||
| kernel module, which is not installed by holodeck's Ansible plays. | ||
| Until either side closes that gap, the bundle deploy stalls on the | ||
| GPU Operator step. | ||
|
|
||
| When these land, the doc grows back the bundle/deploy/validate sections | ||
| and the Slurm + Dynamo tracks of the original design. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **`g6e` quota denied.** New AWS accounts often don't have GPU quota in | ||
| `us-west-2`. Request it via the Service Quotas console: search for | ||
| "Running On-Demand G and VT instances" → request 4 vCPUs minimum. | ||
|
|
||
| **`holodeck create` hangs on the Ansible play.** The driver install | ||
| step can take 4–6 minutes. Tail the holodeck process; if it's still | ||
| making SSH calls, wait. If it errors with "kernel headers not found", | ||
| re-check `os: ubuntu-22.04` matches a current Ubuntu version with | ||
| stable NVIDIA driver kernel-module packaging. | ||
|
|
||
| **`holodeck update <id> --reprovision` says "instance ID is required".** | ||
| The flag must precede the positional ID: | ||
| `holodeck update --reprovision <id>`. The CLI parser does not accept | ||
| the reverse order despite what the `--help` examples show. Same | ||
| ordering applies to `holodeck get kubeconfig -o <path> <id>`. | ||
|
|
||
| **`holodeck create` fails with "could not detect public IP".** | ||
| Holodeck derives the security-group ingress CIDR from the caller's | ||
| public IP and aborts rather than opening the SG to the world when | ||
| detection fails. Either set `spec.ingressIpRanges` explicitly in | ||
| your `env.yaml`, or re-run `holodeck create` from a network with a | ||
| routable public IP. | ||
|
|
||
| **`aicr snapshot` fails to read kernel config.** The agent emits a | ||
| non-fatal `failed to read kconfig` warning on holodeck-provisioned | ||
| nodes that don't ship `/proc/config.gz` or kernel headers. The | ||
| snapshot still completes; the missing kconfig data only matters for | ||
| recipes that constrain on specific kernel features. | ||
|
|
||
| ## Next steps + cleanup | ||
|
|
||
| Tear down when you're done: | ||
|
|
||
| ```bash | ||
| holodeck delete <instance-id> | ||
| ``` | ||
|
|
||
| The teardown also removes the security group, subnet, route table, | ||
| internet gateway, and VPC that holodeck created. | ||
|
|
||
| Where to go next: | ||
|
|
||
| - [Multi-node clusters](multinode-clusters.md) — scale beyond | ||
| single-node for real Slurm and inference workloads. | ||
| - AICR variants on managed Kubernetes — see | ||
| [cuj1-eks.md](https://github.com/NVIDIA/aicr/blob/main/demos/cuj1-eks.md) | ||
| and [cuj1-gke.md](https://github.com/NVIDIA/aicr/blob/main/demos/cuj1-gke.md). | ||
| - [AICR component catalog](https://github.com/NVIDIA/aicr/blob/main/docs/user/component-catalog.md) | ||
| — every operator a recipe can install once the L40 overlays land. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # Holodeck environment for the AICR integration guide | ||
| # (docs/guides/aicr-integration.md). | ||
| # | ||
| # Provisions a single-node g6e.xlarge (1x NVIDIA L40S, 48 GiB VRAM) | ||
| # on AWS, installs kubeadm-based Kubernetes, NVIDIA driver, container | ||
| # toolkit, and containerd. End state: a GPU-ready K8s cluster that | ||
| # AICR can snapshot and generate a recipe against. | ||
|
|
||
| apiVersion: holodeck.nvidia.com/v1alpha1 | ||
| kind: Environment | ||
| metadata: | ||
| name: aicr-demo-l40s | ||
| description: "Day-0 cluster for the AICR integration demo" | ||
| spec: | ||
| provider: aws | ||
| auth: | ||
| keyName: <your key name here> | ||
| privateKey: <your key path here> | ||
| instance: | ||
| type: g6e.xlarge # 1x NVIDIA L40S (48 GiB VRAM), 4 vCPU, 32 GiB host RAM | ||
| region: us-west-2 | ||
| os: ubuntu-22.04 | ||
| image: | ||
| architecture: x86_64 | ||
| containerRuntime: | ||
| install: true | ||
| name: containerd | ||
| nvidiaContainerToolkit: | ||
| install: true | ||
| nvidiaDriver: | ||
| install: true | ||
| kubernetes: | ||
| install: true | ||
| installer: kubeadm | ||
| version: v1.35.0 # AICR requires K8s 1.34+ | ||
| crictlVersion: v1.35.0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the "What's coming" section at line 213 talks about
--accelerator l40, but this command doesn't pass--acceleratorat all. Worth one sentence here clarifying that the accelerator is inferred from the snapshot — otherwise readers may think they need to add--accelerator l40after seeing the L40 note below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a one-sentence clarification in 106cfc4 before the AICR-matching explanation — accelerator is inferred from the snapshot, no
--acceleratorflag in this command. Keeps the--accelerator l40reference in "What's coming" scoped to the underlying overlay-catalog gate.