Commit 25df48a
committed
### GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability
This update significantly improves the `install_gpu_driver.sh` script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.
**1. `gpu/README.md`:**
* **Comprehensive Documentation for Secure Boot & Proxy:**
* Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the `GoogleCloudDataproc/custom-images` repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
* Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with `--shielded-secure-boot`. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the `GoogleCloudDataproc/cloud-dataproc` repository for VPC, subnet, and proxy configuration.
* Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.
**2. `gpu/install_gpu_driver.sh`:**
* **Conda/Mamba Environment (`install_pytorch`):**
* The package list for the Conda environment now omits the explicit CUDA runtime specification, allowing the solver more flexibility based on other dependencies and the `cuda-version` constraint.
* Mamba is now used preferentially for faster environment creation, with a fallback to Conda.
* Implements a cache/environment clearing logic: If the main driver installation is marked complete (`install_gpu_driver-main` sentinel exists) but the PyTorch environment setup is not (`pytorch` sentinel missing), it purges the GCS cache and local Conda environment to ensure a clean rebuild.
* Enhanced error handling for Conda/Mamba environment creation.
* **NVIDIA Driver Handling:**
* `set_driver_version`: Uses `curl -I` for lightweight URL HEAD requests.
* `build_driver_from_github`: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
* `execute_github_driver_build`: Refactored to accept tarball paths as arguments. `popd` removed to balance `pushd` in caller. Removed a debug echo of the `sign-file` exit code.
* Added `make -j$(nproc)` to `modules_install` for parallelization.
* Post-install verification loop checks `modinfo` for `signer:` to confirm modules are signed.
* **Secure Boot Check:** Changed the script to issue a warning rather than exit if Secure Boot is enabled on unsupported Debian versions (<= 2.1).
* **Completion Sentinel:** Added `mark_complete install_gpu_driver-main` at the end of the `main` function.
* **Proxy Configuration (`set_proxy`):** Conditionally adds `gcloud config set proxy/...` commands if the gcloud SDK version is >= 547.0.0. Corrected `sed` command for DNF proxy configuration.1 parent ba91cf7 commit 25df48a
2 files changed
Lines changed: 243 additions & 89 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | | - | |
| 5 | + | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
252 | 252 | | |
253 | 253 | | |
254 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
255 | 331 | | |
256 | 332 | | |
257 | 333 | | |
| |||
0 commit comments