In order to use XPK for GPU, you can do so by using device-type flag.
-
Cluster Create (provision reserved capacity):
# Find your reservations gcloud compute reservations list --project=$PROJECT_ID # Run cluster create with reservation. xpk cluster create \ --cluster xpk-test --device-type=h100-80gb-8 \ --num-nodes=2 \ --reservation=$RESERVATION_ID
-
Cluster Delete (deprovision capacity):
xpk cluster delete \ --cluster xpk-test
-
Cluster List (see provisioned capacity):
xpk cluster list
-
Cluster Describe (see capacity):
xpk cluster describe \ --cluster xpk-test
-
Cluster Cacheimage (enables faster start times):
xpk cluster cacheimage \ --cluster xpk-test --docker-image gcr.io/your_docker_image \ --device-type=h100-80gb-8
-
Install NVIDIA GPU device drivers
# List available driver versions gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list" # Install the default driver gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu" # OR install a specific version of the driver gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION"
-
Run a workload:
# Submit a workload xpk workload create \ --cluster xpk-test --device-type h100-80gb-8 \ --workload xpk-test-workload \ --command="echo hello world"
-
Workload Delete (delete training job):
xpk workload delete \ --workload xpk-test-workload --cluster xpk-test
This will only delete
xpk-test-workloadworkload inxpk-testcluster. -
Workload Delete (delete all training jobs in the cluster):
xpk workload delete \ --cluster xpk-test
This will delete all the workloads in
xpk-testcluster. Deletion will only begin if you typeyoryesat the prompt. -
Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
- Filter by Job:
filter-by-job
xpk workload delete \ --cluster xpk-test --filter-by-job=$USERThis will delete all the workloads in
xpk-testcluster whose names start with$USER. Deletion will only begin if you typeyoryesat the prompt.- Filter by Status:
filter-by-status
xpk workload delete \ --cluster xpk-test --filter-by-status=QUEUED
This will delete all the workloads in
xpk-testcluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you typeyoryesat the prompt. Status can be:EVERYTHING,FINISHED,RUNNING,QUEUED,FAILED,SUCCESSFUL. - Filter by Job: