-
Notifications
You must be signed in to change notification settings - Fork 515
[gpu] strict driver and cuda version assignment #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] strict driver and cuda version assignment #1298
Conversation
Roll forward GoogleCloudDataproc#1275 gpu/install_gpu_driver.sh * updated supported versions * moved all code into functions, which are called at the footer of the installer * install cuda and driver exclusively from run files * extract cuda and driver version from urls if supplied * support supplying cuda version as x.y.z instead of just x.y * build nccl from source * poll dpkg lock status for up to 60 seconds * cache build artifacts from kernel driver and nccl * use consistent arguments to curl * create is_complete and mark_complete functions to allow re-running * Tested more CUDA minor versions * Printing warnings when combination provided is known to fail * only install build dependencies on build cache miss * added optional pytorch install option * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum * verified that proprietary kernel drivers work with older dataproc images * clear dkms key immediately after use * cache .run files to GCS to reduce fetches from origin * Install nvidia container toolkit and select container runtime * tested installer on clusters without GPUs attached * fixed a problem with ops agent not installing ; using venv * Older CapacityScheduler does not permit use of gpu resources ; switch to FairScheduler on 2.0 and below * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * Hold all NVIDIA-related packages from upgrading unintenionally * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability * harden sshd config * install spark rapids acceleration libraries gpu/manual-test-runner.sh * order commands correctly gpu/run-bazel-tests.sh * do not retry flakey tests gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * remove skip of rocky9 tests
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
gpu/install_gpu_driver.sh * Do not use fair scheduler for 2.0 clusters * comment out spark-defaults.conf config options as guidance for tuning gpu/test_gpu.py * There are now three tests run from the verify_instance_spark function * * Run the SparkPi example with no parameters specified * * Run the JavaIndexToStringExample with many parameters specified * * Run the JavaIndexToStringExample with few parameters specified cloudbuild/presubmit.sh * added a continue to skip run of all tests * to be removed before merge
|
/gcbrun |
|
This change consists of two commits: This is the commit which re-applies the changes from PR #1275 This commit removes the selection of the Fair scheduler on 2.0 images. Newer releases of Dataproc 2.0 include patches to the Capacity scheduler which allow GPU to be specified as a resource, negating the need to include this work-around. This commit also replaces new configuration parameters with comments. This will serve the purpose of helping the customer know what parameters can be tuned while not enforcing application-specific configuration globally. This commit adds new tests to exercise the use case which, on failure, caused the build to be rolled back. |
prince-cs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thank you, Prince! Is there any way you can exercise the internal tests without us having to merge the release and publish to GCS? |
Sure. Let me try that. |
|
Cluster is up and running. |
|
@jayadeep-jayaraman would you give the second commit a review? Chris reviewed the first commit, but we found that our internal tests were failing, so the change was rolled back. The second commit resolves the issues we were seeing, as confirmed by Prince. |
cnauroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @cjac . I left one comment. Thank you!
|
While testing the P4, P100 and A100 today, I found a problem in this change which will adversely affect spark when multiple GPUs are attached. I will open an issue to track the fix. Please do not roll init actions out to prod yet. |
[gpu] strict driver and cuda version assignment
This change re-applies the changes from PR #1275 and fixes some issues users were experiencing when specifying parameters.
gpu/install_gpu_driver.sh
updated supported versions
moved all code into functions, which are called at the footer of
the installer
install cuda and driver exclusively from run files
extract cuda and driver version from urls if supplied
support supplying cuda version as x.y.z instead of just x.y
build nccl from source
poll dpkg lock status for up to 60 seconds
cache build artifacts from kernel driver and nccl
use consistent arguments to curl
create is_complete and mark_complete functions to allow re-running
Tested more CUDA minor versions
Printing warnings when combination provided is known to fail
only install build dependencies on build cache miss
added optional pytorch install option
renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
verified that proprietary kernel drivers work with older dataproc images
clear dkms key immediately after use
cache .run files to GCS to reduce fetches from origin
Install nvidia container toolkit and select container runtime
tested installer on clusters without GPUs attached
fixed a problem with ops agent not installing ; using venv
caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
suggesting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
Hold all NVIDIA-related packages from upgrading unintenionally
skipping proxy setup if http-proxy metadata not set
added function to check secure-boot and os version compatability
harden sshd config
install spark rapids acceleration libraries
gpu/manual-test-runner.sh
gpu/run-bazel-tests.sh
gpu/test_gpu.py