feat(tasks): add cluster:gpu task and wire e2e:gpu to use it#547
Open
feat(tasks): add cluster:gpu task and wire e2e:gpu to use it#547
Conversation
pimlock
requested changes
Mar 23, 2026
Collaborator
pimlock
left a comment
There was a problem hiding this comment.
Thank you, this is going to be very handy for GPU cluster dev.
Could you please replace the separate cluster:gpu task with an env var?
tasks/test.toml
Outdated
| ["e2e:python:gpu"] | ||
| description = "Run Python GPU e2e tests" | ||
| depends = ["python:proto", "cluster"] | ||
| depends = ["python:proto", "cluster:gpu"] |
Collaborator
There was a problem hiding this comment.
Overall, we prefer not adding new tasks, unless really necessary. In this case, we could drop the extra task and add env var in here directly (this is supported by mise).
Suggested change
| depends = ["python:proto", "cluster:gpu"] | |
| depends = ["python:proto", "CLUSTER_GPU=1 cluster"] |
tasks/cluster.toml
Outdated
Comment on lines
+10
to
+14
| ["cluster:gpu"] | ||
| description = "Bootstrap or incremental deploy with NVIDIA GPU passthrough enabled" | ||
| env = { CLUSTER_GPU = "1" } | ||
| run = "tasks/scripts/cluster.sh" | ||
|
|
Collaborator
There was a problem hiding this comment.
We can remove this as a separate task and use the env var instead.
Member
Author
There was a problem hiding this comment.
This has been removed and replaced with setting the envvar in the depends for the e2e:python:gpu task.
Pass CLUSTER_GPU=1 inline in e2e:python:gpu's depends so that the cluster is bootstrapped with --gpu when GPU e2e tests are run. Add --gpu flag handling to cluster-bootstrap.sh and default OPENSHELL_E2E_GPU_IMAGE to an empty string so the server resolves the default sandbox image when no override is provided. Signed-off-by: Evan Lezar <elezar@nvidia.com>
3bd624e to
c0f9dcf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
cluster:gpumise task that bootstraps the cluster with NVIDIA GPU passthrough enabled (--gpuflag), and updatese2e:python:gputo depend on it instead of the plainclustertask. Also updatesgpu_sandbox_specin the e2e conftest to default to an empty image string, deferring image resolution to the server.Related Issue
Changes
tasks/cluster.toml: addcluster:gputask withCLUSTER_GPU=1env vartasks/scripts/cluster-bootstrap.sh: pass--gputoopenshell gateway startwhenCLUSTER_GPU=1tasks/test.toml: wiree2e:python:gputo depend oncluster:gpuinstead ofclustere2e/python/conftest.py: default GPU sandbox image to""so the server resolves the configured default; allow override viaOPENSHELL_E2E_GPU_IMAGETesting
mise run pre-commitpassesUnit tests added/updated (not applicable)E2E tests added/updated (if applicable)Checklist
Architecture docs updated (if applicable)