Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Feb 12, 2025

[gpu] strict driver and cuda version assignment

This change re-applies the changes from PR #1275 and fixes some issues users were experiencing when specifying parameters.

gpu/install_gpu_driver.sh

  • updated supported versions

  • moved all code into functions, which are called at the footer of
    the installer

  • install cuda and driver exclusively from run files

  • extract cuda and driver version from urls if supplied

  • support supplying cuda version as x.y.z instead of just x.y

  • build nccl from source

  • poll dpkg lock status for up to 60 seconds

  • cache build artifacts from kernel driver and nccl

  • use consistent arguments to curl

  • create is_complete and mark_complete functions to allow re-running

  • Tested more CUDA minor versions

  • Printing warnings when combination provided is known to fail

  • only install build dependencies on build cache miss

  • added optional pytorch install option

  • renamed metadata attribute cert_modulus_md5sum to modulus_md5sum

  • verified that proprietary kernel drivers work with older dataproc images

  • clear dkms key immediately after use

  • cache .run files to GCS to reduce fetches from origin

  • Install nvidia container toolkit and select container runtime

  • tested installer on clusters without GPUs attached

  • fixed a problem with ops agent not installing ; using venv

  • caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript

  • suggesting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf

  • Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS

  • Hold all NVIDIA-related packages from upgrading unintenionally

  • skipping proxy setup if http-proxy metadata not set

  • added function to check secure-boot and os version compatability

  • harden sshd config

  • install spark rapids acceleration libraries

gpu/manual-test-runner.sh

  • order commands correctly

gpu/run-bazel-tests.sh

  • do not retry flakey tests

gpu/test_gpu.py

  • clearer test skipping logic
  • added instructions on how to test pytorch
  • remove skip of rocky9 tests
  • There are now three tests run from the verify_instance_spark function
    • Run the SparkPi example with no parameters specified
    • Run the JavaIndexToStringExample with many parameters specified
    • Run the JavaIndexToStringExample with few parameters specified

Roll forward GoogleCloudDataproc#1275

gpu/install_gpu_driver.sh
  * updated supported versions
  * moved all code into functions, which are called at the footer of
    the installer
  * install cuda and driver exclusively from run files
  * extract cuda and driver version from urls if supplied
  * support supplying cuda version as x.y.z instead of just x.y
  * build nccl from source
  * poll dpkg lock status for up to 60 seconds
  * cache build artifacts from kernel driver and nccl
  * use consistent arguments to curl
  * create is_complete and mark_complete functions to allow re-running
  * Tested more CUDA minor versions
  * Printing warnings when combination provided is known to fail
  * only install build dependencies on build cache miss
  * added optional pytorch install option
  * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
  * verified that proprietary kernel drivers work with older dataproc images
  * clear dkms key immediately after use
  * cache .run files to GCS to reduce fetches from origin
  * Install nvidia container toolkit and select container runtime
  * tested installer on clusters without GPUs attached
  * fixed a problem with ops agent not installing ; using venv
  * Older CapacityScheduler does not permit use of gpu resources ;
    switch to FairScheduler on 2.0 and below
  * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
  * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * Hold all NVIDIA-related packages from upgrading unintenionally
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability
  * harden sshd config
  * install spark rapids acceleration libraries

gpu/manual-test-runner.sh
  * order commands correctly

gpu/run-bazel-tests.sh
  * do not retry flakey tests

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
  * remove skip of rocky9 tests
@cjac cjac self-assigned this Feb 12, 2025
@cjac cjac marked this pull request as draft February 12, 2025 06:30
@cjac
Copy link
Contributor Author

cjac commented Feb 12, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

/gcbrun

gpu/install_gpu_driver.sh

* Do not use fair scheduler for 2.0 clusters
* comment out spark-defaults.conf config options as guidance for tuning

gpu/test_gpu.py

* There are now three tests run from the verify_instance_spark function
* * Run the SparkPi example with no parameters specified
* * Run the JavaIndexToStringExample with many parameters specified
* * Run the JavaIndexToStringExample with few parameters specified

cloudbuild/presubmit.sh

* added a continue to skip run of all tests
* to be removed before merge
@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

This change consists of two commits:

This is the commit which re-applies the changes from PR #1275

This commit removes the selection of the Fair scheduler on 2.0 images. Newer releases of Dataproc 2.0 include patches to the Capacity scheduler which allow GPU to be specified as a resource, negating the need to include this work-around.

This commit also replaces new configuration parameters with comments. This will serve the purpose of helping the customer know what parameters can be tuned while not enforcing application-specific configuration globally.

This commit adds new tests to exercise the use case which, on failure, caused the build to be rolled back.

@cjac cjac requested review from cnauroth and prince-cs February 13, 2025 02:41
Copy link
Collaborator

@prince-cs prince-cs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

Thank you, Prince! Is there any way you can exercise the internal tests without us having to merge the release and publish to GCS?

@prince-cs
Copy link
Collaborator

Thank you, Prince! Is there any way you can exercise the internal tests without us having to merge the release and publish to GCS?

Sure. Let me try that.

@prince-cs
Copy link
Collaborator

Cluster is up and running.

@cjac cjac marked this pull request as ready for review February 13, 2025 08:38
@cjac
Copy link
Contributor Author

cjac commented Feb 13, 2025

@jayadeep-jayaraman would you give the second commit a review? Chris reviewed the first commit, but we found that our internal tests were failing, so the change was rolled back. The second commit resolves the issues we were seeing, as confirmed by Prince.

Copy link
Member

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @cjac . I left one comment. Thank you!

@cjac cjac merged commit 7e87522 into GoogleCloudDataproc:master Feb 13, 2025
1 of 2 checks passed
@cjac
Copy link
Contributor Author

cjac commented Feb 14, 2025

While testing the P4, P100 and A100 today, I found a problem in this change which will adversely affect spark when multiple GPUs are attached. I will open an issue to track the fix. Please do not roll init actions out to prod yet.

@cjac cjac changed the title Roll forward PR #1275 with fixes [gpu] strict driver and cuda version assignment Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants