[release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang#1182
[release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang#1182zeeke wants to merge 2 commits into
Conversation
Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the device is in a bad state. For example, the following error has been hit on a `Intel XXV710` NIC: ``` Feb 16 13:53:01 worker0 kernel: 06c73374b594186: left promiscuous mode Feb 16 13:53:01 worker0 kernel: i40e 0000:3b:00.0: Setting MAC 5e:28:32:f0:80:20 on VF 1 Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Bring down and up the VF interface to make this change effective. Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending. Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending. Feb 16 13:53:02 worker0 kernel: 152a5b6a3b44739: left promiscuous mode ``` Replace direct `os.WriteFile` calls in SetSriovNumVfs with a new `WriteFileWithTimeout` utility that runs the write in a goroutine and returns a timeout error after 2 minutes. Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
|
/jira cherrypick OCPBUGS-78767 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zeeke The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
|
@zeeke: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/title OCPBUGS-78767 [release-4.21]: fix: add timeout to sysfs writes to prevent daemon hang |
Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the device is in a bad state. For example, the following error has been hit on a
Intel XXV710NIC:Replace direct
os.WriteFilecalls in SetSriovNumVfs with a newWriteFileWithTimeoututility that runs the write in a goroutine and returns a timeout error after 2 minutes.