[feat] : validate additional enabled drivers #2014
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dependencies
Depends on: NVIDIA/gpu-driver-container#529
Depends on: NVIDIA/k8s-device-plugin#1550
Description
Problem
GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.
Proposed solution
During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.
We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.
Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
make coverage)Test details: