Skip to content

[release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang#1182

Open
zeeke wants to merge 2 commits into
openshift:release-4.21from
zeeke:worktree-backport-sysfs-timeout-4.21
Open

[release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang#1182
zeeke wants to merge 2 commits into
openshift:release-4.21from
zeeke:worktree-backport-sysfs-timeout-4.21

Conversation

@zeeke
Copy link
Copy Markdown
Contributor

@zeeke zeeke commented Mar 24, 2026

Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the device is in a bad state. For example, the following error has been hit on a Intel XXV710 NIC:

Feb 16 13:53:01 worker0 kernel: 06c73374b594186: left promiscuous mode
Feb 16 13:53:01 worker0 kernel: i40e 0000:3b:00.0: Setting MAC 5e:28:32:f0:80:20 on VF 1
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Bring down and up the VF interface to make this change effective.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: 152a5b6a3b44739: left promiscuous mode

Replace direct os.WriteFile calls in SetSriovNumVfs with a new WriteFileWithTimeout utility that runs the write in a goroutine and returns a timeout error after 2 minutes.

Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the
device is in a bad state. For example, the following error has been hit on a `Intel XXV710` NIC:

```
Feb 16 13:53:01 worker0 kernel: 06c73374b594186: left promiscuous mode
Feb 16 13:53:01 worker0 kernel: i40e 0000:3b:00.0: Setting MAC 5e:28:32:f0:80:20 on VF 1
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Bring down and up the VF interface to make this change effective.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: 152a5b6a3b44739: left promiscuous mode
```

Replace direct `os.WriteFile` calls in SetSriovNumVfs with a new `WriteFileWithTimeout` utility that
runs the write in a goroutine and returns a timeout error after 2 minutes.

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
@zeeke
Copy link
Copy Markdown
Contributor Author

zeeke commented Mar 24, 2026

/jira cherrypick OCPBUGS-78767

@openshift-ci openshift-ci Bot requested review from Billy99 and MrSanketkumar March 24, 2026 09:04
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2026
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 24, 2026

@zeeke: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SchSeba
Copy link
Copy Markdown
Contributor

SchSeba commented Mar 26, 2026

/title OCPBUGS-78767 [release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants