Skip to content

Commit 6a0578c

Browse files
committed
Docs(gpu): Comprehensive update to README.md
This commit brings the gpu/README.md documentation in line with the current capabilities and behavior of the install_gpu_driver.sh initialization action. The previous README was significantly outdated. Key updates include: - General: - Removed "Beta mode" notice. - Updated overall description of the script's purpose and scope. - Usage and Examples: - Added "Default Versions and Supported Configurations" section, clarifying default CUDA selection based on Dataproc image version and listing supported OS families. - Updated `gcloud` command examples to use current best practices (regionalized bucket paths, common GPU types like T4). - Added new examples for specifying custom driver/CUDA URLs and for MIG-enabled cluster creation. - Correctly detailed the use of `invocation-type=custom-images` metadata, emphasizing its use by image building tools like `generate_custom_image.py` rather than direct user specification during cluster creation from scratch. - Metadata Parameters: - Completely overhauled and expanded the "Metadata Parameters" section to comprehensively list all currently supported options, including: - `cuda-version`, `gpu-driver-version`, `cuda-url`, `gpu-driver-url` (clarifying HTTP/HTTPS URL input for `cuda-url`/`gpu-driver-url`). - `gpu-driver-provider`, `cudnn-version`, `nccl-version`. - `install-gpu-agent` (noting new default of `true`). - `include-pytorch`, `gpu-conda-env`. - `container-runtime`, `http-proxy`. - Secure Boot signing parameters (`private_secret_name`, etc.). - Provided descriptions, potential values, and defaults for each. - Feature Documentation: - Updated "GPU Scheduling in YARN" to reflect current practices. - Added information on NVIDIA Container Toolkit installation. - Significantly expanded the "Secure Boot and Kernel Module Signing" section with details on MOK management, use of Secret Manager, and the `--no-shielded-secure-boot` workaround. - Added "Custom Image Creation" section explaining deferred configuration. - Verification and Notes: - Updated verification steps for drivers, CUDA, and the GPU agent. - Added an "Important Notes" section covering: - OS and Dataproc image version support. - Handling of OS package managers and kernel module building. - SSHD hardening. - Performance considerations, GCS caching mechanism for artifacts, the benefits of cache pre-warming (including reduced initial run times from up to 150 min to 12-20 min), and the security implication of not needing build tools when cache hits occur. - Details about the PyTorch Conda environment. - Removed outdated information regarding very old CUDA versions, Dataproc 1.x specific behaviors, and previous GPU agent script handling.
1 parent 40552a7 commit 6a0578c

File tree

1 file changed

+145
-269
lines changed

1 file changed

+145
-269
lines changed

0 commit comments

Comments
 (0)