[gpu] strict driver and cuda version assignment #1298

cjac · 2025-02-12T06:29:56Z

[gpu] strict driver and cuda version assignment

This change re-applies the changes from PR #1275 and fixes some issues users were experiencing when specifying parameters.

gpu/install_gpu_driver.sh

updated supported versions
moved all code into functions, which are called at the footer of
the installer
install cuda and driver exclusively from run files
extract cuda and driver version from urls if supplied
support supplying cuda version as x.y.z instead of just x.y
build nccl from source
poll dpkg lock status for up to 60 seconds
cache build artifacts from kernel driver and nccl
use consistent arguments to curl
create is_complete and mark_complete functions to allow re-running
Tested more CUDA minor versions
Printing warnings when combination provided is known to fail
only install build dependencies on build cache miss
added optional pytorch install option
renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
verified that proprietary kernel drivers work with older dataproc images
clear dkms key immediately after use
cache .run files to GCS to reduce fetches from origin
Install nvidia container toolkit and select container runtime
tested installer on clusters without GPUs attached
fixed a problem with ops agent not installing ; using venv
caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
suggesting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
Hold all NVIDIA-related packages from upgrading unintenionally
skipping proxy setup if http-proxy metadata not set
added function to check secure-boot and os version compatability
harden sshd config
install spark rapids acceleration libraries

gpu/manual-test-runner.sh

order commands correctly

gpu/run-bazel-tests.sh

do not retry flakey tests

gpu/test_gpu.py

clearer test skipping logic
added instructions on how to test pytorch
remove skip of rocky9 tests
There are now three tests run from the verify_instance_spark function
- Run the SparkPi example with no parameters specified
- Run the JavaIndexToStringExample with many parameters specified
- Run the JavaIndexToStringExample with few parameters specified

Roll forward GoogleCloudDataproc#1275 gpu/install_gpu_driver.sh * updated supported versions * moved all code into functions, which are called at the footer of the installer * install cuda and driver exclusively from run files * extract cuda and driver version from urls if supplied * support supplying cuda version as x.y.z instead of just x.y * build nccl from source * poll dpkg lock status for up to 60 seconds * cache build artifacts from kernel driver and nccl * use consistent arguments to curl * create is_complete and mark_complete functions to allow re-running * Tested more CUDA minor versions * Printing warnings when combination provided is known to fail * only install build dependencies on build cache miss * added optional pytorch install option * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum * verified that proprietary kernel drivers work with older dataproc images * clear dkms key immediately after use * cache .run files to GCS to reduce fetches from origin * Install nvidia container toolkit and select container runtime * tested installer on clusters without GPUs attached * fixed a problem with ops agent not installing ; using venv * Older CapacityScheduler does not permit use of gpu resources ; switch to FairScheduler on 2.0 and below * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * Hold all NVIDIA-related packages from upgrading unintenionally * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability * harden sshd config * install spark rapids acceleration libraries gpu/manual-test-runner.sh * order commands correctly gpu/run-bazel-tests.sh * do not retry flakey tests gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * remove skip of rocky9 tests

cjac · 2025-02-12T06:30:26Z

/gcbrun

cjac · 2025-02-13T00:56:28Z

/gcbrun

cjac · 2025-02-13T01:27:53Z

/gcbrun

gpu/install_gpu_driver.sh * Do not use fair scheduler for 2.0 clusters * comment out spark-defaults.conf config options as guidance for tuning gpu/test_gpu.py * There are now three tests run from the verify_instance_spark function * * Run the SparkPi example with no parameters specified * * Run the JavaIndexToStringExample with many parameters specified * * Run the JavaIndexToStringExample with few parameters specified cloudbuild/presubmit.sh * added a continue to skip run of all tests * to be removed before merge

cjac · 2025-02-13T02:12:41Z

/gcbrun

cjac · 2025-02-13T02:41:34Z

This change consists of two commits:

10570e2

This is the commit which re-applies the changes from PR #1275

98573ff

This commit removes the selection of the Fair scheduler on 2.0 images. Newer releases of Dataproc 2.0 include patches to the Capacity scheduler which allow GPU to be specified as a resource, negating the need to include this work-around.

This commit also replaces new configuration parameters with comments. This will serve the purpose of helping the customer know what parameters can be tuned while not enforcing application-specific configuration globally.

This commit adds new tests to exercise the use case which, on failure, caused the build to be rolled back.

prince-cs

LGTM

cjac · 2025-02-13T05:15:35Z

Thank you, Prince! Is there any way you can exercise the internal tests without us having to merge the release and publish to GCS?

prince-cs · 2025-02-13T05:16:48Z

Thank you, Prince! Is there any way you can exercise the internal tests without us having to merge the release and publish to GCS?

Sure. Let me try that.

prince-cs · 2025-02-13T07:05:08Z

Cluster is up and running.

cjac · 2025-02-13T08:41:46Z

@jayadeep-jayaraman would you give the second commit a review? Chris reviewed the first commit, but we found that our internal tests were failing, so the change was rolled back. The second commit resolves the issues we were seeing, as confirmed by Prince.

cnauroth

LGTM, @cjac . I left one comment. Thank you!

cloudbuild/presubmit.sh

cjac · 2025-02-14T03:33:42Z

While testing the P4, P100 and A100 today, I found a problem in this change which will adversely affect spark when multiple GPUs are attached. I will open an issue to track the fix. Please do not roll init actions out to prod yet.

cjac self-assigned this Feb 12, 2025

cjac marked this pull request as draft February 12, 2025 06:30

cjac force-pushed the gpu-20250210 branch from 2e4c9ce to 1b6813d Compare February 12, 2025 23:16

cjac force-pushed the gpu-20250210 branch from 1b6813d to f17be00 Compare February 13, 2025 01:27

cjac force-pushed the gpu-20250210 branch from f17be00 to 98573ff Compare February 13, 2025 02:12

cjac requested review from cnauroth and prince-cs February 13, 2025 02:41

prince-cs reviewed Feb 13, 2025

View reviewed changes

cjac requested a review from jayadeep-jayaraman February 13, 2025 08:38

cjac marked this pull request as ready for review February 13, 2025 08:38

cnauroth approved these changes Feb 13, 2025

View reviewed changes

cloudbuild/presubmit.sh Outdated Show resolved Hide resolved

removed full-test-suite skip

67bb50a

cjac merged commit 7e87522 into GoogleCloudDataproc:master Feb 13, 2025
1 of 2 checks passed

cjac mentioned this pull request Feb 14, 2025

[template] create templates for use in generating actions #1282

Draft

cjac changed the title ~~Roll forward PR #1275 with fixes~~ [gpu] strict driver and cuda version assignment Dec 6, 2025

[gpu] strict driver and cuda version assignment #1298

[gpu] strict driver and cuda version assignment #1298

Uh oh!

Conversation

cjac commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Feb 12, 2025

Uh oh!

cjac commented Feb 13, 2025

Uh oh!

cjac commented Feb 13, 2025

Uh oh!

cjac commented Feb 13, 2025

Uh oh!

cjac commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prince-cs left a comment

Choose a reason for hiding this comment

Uh oh!

cjac commented Feb 13, 2025

Uh oh!

prince-cs commented Feb 13, 2025

Uh oh!

prince-cs commented Feb 13, 2025

Uh oh!

cjac commented Feb 13, 2025

Uh oh!

cnauroth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cjac commented Feb 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjac commented Feb 12, 2025 •

edited

Loading

cjac commented Feb 13, 2025 •

edited

Loading