[feat] : validate additional enabled drivers #2014

rahulait · 2025-12-25T05:26:51Z

Dependencies

Depends on: NVIDIA/gpu-driver-container#529
Depends on: NVIDIA/k8s-device-plugin#1550

Description

Problem

GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.

Proposed solution

During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.

We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.

Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)

Testing

Unit tests (make coverage)
Manual cluster testing (describe below)
N/A or Other (docs, CI config, etc.)

Test details:

Signed-off-by: Rahul Sharma <[email protected]>

copy-pr-bot · 2025-12-25T05:26:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

validate additional enabled drivers

4457fca

Signed-off-by: Rahul Sharma <[email protected]>

rahulait marked this pull request as ready for review December 25, 2025 05:42

rahulait requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners December 25, 2025 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] : validate additional enabled drivers #2014

[feat] : validate additional enabled drivers #2014

Uh oh!

rahulait commented Dec 25, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[feat] : validate additional enabled drivers #2014

Are you sure you want to change the base?

[feat] : validate additional enabled drivers #2014

Uh oh!

Conversation

rahulait commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Description

Problem

Proposed solution

Checklist

Testing

Uh oh!

copy-pr-bot bot commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahulait commented Dec 25, 2025 •

edited

Loading