Skip to content

Commit 25df48a

Browse files
committed
### GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability
This update significantly improves the `install_gpu_driver.sh` script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability. **1. `gpu/README.md`:** * **Comprehensive Documentation for Secure Boot & Proxy:** * Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the `GoogleCloudDataproc/custom-images` repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process. * Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with `--shielded-secure-boot`. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the `GoogleCloudDataproc/cloud-dataproc` repository for VPC, subnet, and proxy configuration. * Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes. **2. `gpu/install_gpu_driver.sh`:** * **Conda/Mamba Environment (`install_pytorch`):** * The package list for the Conda environment now omits the explicit CUDA runtime specification, allowing the solver more flexibility based on other dependencies and the `cuda-version` constraint. * Mamba is now used preferentially for faster environment creation, with a fallback to Conda. * Implements a cache/environment clearing logic: If the main driver installation is marked complete (`install_gpu_driver-main` sentinel exists) but the PyTorch environment setup is not (`pytorch` sentinel missing), it purges the GCS cache and local Conda environment to ensure a clean rebuild. * Enhanced error handling for Conda/Mamba environment creation. * **NVIDIA Driver Handling:** * `set_driver_version`: Uses `curl -I` for lightweight URL HEAD requests. * `build_driver_from_github`: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds. * `execute_github_driver_build`: Refactored to accept tarball paths as arguments. `popd` removed to balance `pushd` in caller. Removed a debug echo of the `sign-file` exit code. * Added `make -j$(nproc)` to `modules_install` for parallelization. * Post-install verification loop checks `modinfo` for `signer:` to confirm modules are signed. * **Secure Boot Check:** Changed the script to issue a warning rather than exit if Secure Boot is enabled on unsupported Debian versions (<= 2.1). * **Completion Sentinel:** Added `mark_complete install_gpu_driver-main` at the end of the `main` function. * **Proxy Configuration (`set_proxy`):** Conditionally adds `gcloud config set proxy/...` commands if the gcloud SDK version is >= 547.0.0. Corrected `sed` command for DNF proxy configuration.
1 parent ba91cf7 commit 25df48a

2 files changed

Lines changed: 243 additions & 89 deletions

File tree

gpu/README.md

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
GPUs require special drivers and software which are not pre-installed on
44
[Dataproc](https://cloud.google.com/dataproc) clusters by default.
5-
This initialization action installs GPU driver for NVIDIA GPUs on master and
6-
worker nodes in a Dataproc cluster.
5+
This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and
6+
-w nodes in a Dataproc cluster.
77

88
## Default versions
99

@@ -252,6 +252,82 @@ not suitable), special considerations apply if Secure Boot is enabled.
252252
or `dmesg` output for errors like "Operation not permitted" or messages
253253
related to signature verification failure.
254254

255+
## Building Custom Images with Secure Boot and Proxy Support
256+
257+
For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the [GoogleCloudDataproc/custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, specifically the scripts within the `examples/secure-boot/` directory.
258+
259+
**Base Image:** Typically Dataproc 2.2-debian12 or newer.
260+
261+
**Process Overview:**
262+
263+
1. **Clone `custom-images` Repository:**
264+
```bash
265+
git clone https://github.com/GoogleCloudDataproc/custom-images.git
266+
cd custom-images
267+
```
268+
269+
2. **Configure Build:** Set up `env.json` with your project, network, and bucket details. See the `examples/secure-boot/env.json.sample` in the `custom-images` repo.
270+
271+
3. **Prepare Signing Keys:** Ensure Secure Boot signing keys are available in GCP Secret Manager. Use `examples/secure-boot/create-key-pair.sh` from the `custom-images` repo to create/manage these.
272+
273+
4. **Build Docker Image:** Build the builder environment: `docker build -t dataproc-secure-boot-builder:latest .`
274+
275+
5. **Run Image Generation:** Use `generate_custom_image.py` within the Docker container, typically orchestrated by `examples/secure-boot/pre-init.sh`. The core customization script `examples/secure-boot/install_gpu_driver.sh` handles driver installation, proxy setup, and module signing.
276+
277+
* Refer to the [Secure Boot example documentation](https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot) for detailed `docker run` commands and metadata requirements (proxy settings, secret names, etc.).
278+
279+
### Launching a Cluster with the Secure Boot Custom Image
280+
281+
Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled.
282+
283+
**Important:** To launch a Dataproc cluster with the `--shielded-secure-boot` flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules.
284+
285+
**Network and Cluster Setup:**
286+
287+
To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository:
288+
289+
1. **Clone `cloud-dataproc` Repository:**
290+
```bash
291+
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
292+
cd cloud-dataproc/gcloud
293+
```
294+
295+
2. **Configure Environment:**
296+
* Copy `env.json.sample` to `env.json`.
297+
* Edit `env.json` with your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example:
298+
```json
299+
{
300+
"PROJECT_ID": "YOUR_GCP_PROJECT_ID",
301+
"REGION": "us-west4",
302+
"ZONE": "us-west4-a",
303+
"BUCKET": "YOUR_STAGING_BUCKET",
304+
"TEMP_BUCKET": "YOUR_TEMP_BUCKET",
305+
"CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME",
306+
"PURPOSE": "secure-boot-cluster",
307+
// Add these for a private, proxied environment
308+
"PRIVATE_RANGE": "10.43.79.0/24",
309+
"SWP_RANGE": "10.44.79.0/24",
310+
"SWP_IP": "10.43.79.245",
311+
"SWP_PORT": "3128",
312+
"SWP_HOSTNAME": "swp.your-project.example.com"
313+
// ... other variables as needed
314+
}
315+
```
316+
* Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process.
317+
318+
3. **Create the Private Environment and Cluster:**
319+
This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided.
320+
```bash
321+
bash bin/create-dpgce-private
322+
```
323+
324+
**Verification:**
325+
326+
1. SSH into the -m node of the created cluster.
327+
2. Check driver status: `sudo nvidia-smi`
328+
3. Verify module signature: `sudo modinfo nvidia | grep signer` (should show your custom CA).
329+
4. Check for errors: `dmesg | grep -iE "Secure Boot|NVRM|nvidia"`
330+
255331
### Verification
256332

257333
1. Once the cluster has been created, you can access the Dataproc cluster and

0 commit comments

Comments
 (0)