Commit 6a0578c
committed
Docs(gpu): Comprehensive update to README.md
This commit brings the gpu/README.md documentation in line with the
current capabilities and behavior of the install_gpu_driver.sh
initialization action. The previous README was significantly outdated.
Key updates include:
- General:
- Removed "Beta mode" notice.
- Updated overall description of the script's purpose and scope.
- Usage and Examples:
- Added "Default Versions and Supported Configurations" section,
clarifying default CUDA selection based on Dataproc image version
and listing supported OS families.
- Updated `gcloud` command examples to use current best practices
(regionalized bucket paths, common GPU types like T4).
- Added new examples for specifying custom driver/CUDA URLs and for
MIG-enabled cluster creation.
- Correctly detailed the use of `invocation-type=custom-images`
metadata, emphasizing its use by image building tools like
`generate_custom_image.py` rather than direct user specification
during cluster creation from scratch.
- Metadata Parameters:
- Completely overhauled and expanded the "Metadata Parameters" section
to comprehensively list all currently supported options, including:
- `cuda-version`, `gpu-driver-version`, `cuda-url`, `gpu-driver-url`
(clarifying HTTP/HTTPS URL input for `cuda-url`/`gpu-driver-url`).
- `gpu-driver-provider`, `cudnn-version`, `nccl-version`.
- `install-gpu-agent` (noting new default of `true`).
- `include-pytorch`, `gpu-conda-env`.
- `container-runtime`, `http-proxy`.
- Secure Boot signing parameters (`private_secret_name`, etc.).
- Provided descriptions, potential values, and defaults for each.
- Feature Documentation:
- Updated "GPU Scheduling in YARN" to reflect current practices.
- Added information on NVIDIA Container Toolkit installation.
- Significantly expanded the "Secure Boot and Kernel Module Signing"
section with details on MOK management, use of Secret Manager,
and the `--no-shielded-secure-boot` workaround.
- Added "Custom Image Creation" section explaining deferred configuration.
- Verification and Notes:
- Updated verification steps for drivers, CUDA, and the GPU agent.
- Added an "Important Notes" section covering:
- OS and Dataproc image version support.
- Handling of OS package managers and kernel module building.
- SSHD hardening.
- Performance considerations, GCS caching mechanism for artifacts,
the benefits of cache pre-warming (including reduced initial run
times from up to 150 min to 12-20 min), and the security
implication of not needing build tools when cache hits occur.
- Details about the PyTorch Conda environment.
- Removed outdated information regarding very old CUDA versions,
Dataproc 1.x specific behaviors, and previous GPU agent script handling.1 parent 40552a7 commit 6a0578c
1 file changed
+145
-269
lines changed
0 commit comments