I've encountered an error while following the "Quick Start: Single-Node Deployment Guide" on a machine with an older kernel which does not support pidfd_getfd:
Kernel:
Linux zktitan 5.4.0-204-generic #224-Ubuntu SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Error:
Exception: Call to collective_rpc method failed: The kernel on this machine does not support the pidfd_getfd syscall needed to use IPC for CUDA tensors when expandable_segments:True is set. Consider using expandable_segments:False via torch.cuda.memory._set_allocator_settings('expandable_segments:False') for this allocation.
What I tried
Running with the environment variable PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False" had no effect; the same error still occurs.
Probably cause
I found that expandable_segments:True is hardcoded in two files:
roll/platforms/cuda.py (line 41)
mcore_adapter/src/platforms/cuda.py (line 40)
Manually editing these files to False resolves the issue and allows execution.
Question
Is there a better way to configure this setting (e.g., via an environment variable or a configuration flag) so that I don’t need to modify the source code manually? Thanks!
I've encountered an error while following the "Quick Start: Single-Node Deployment Guide" on a machine with an older kernel which does not support
pidfd_getfd:Kernel:
Linux zktitan 5.4.0-204-generic #224-Ubuntu SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/LinuxError:
Exception: Call to collective_rpc method failed: The kernel on this machine does not support the pidfd_getfd syscall needed to use IPC for CUDA tensors when expandable_segments:True is set. Consider using expandable_segments:False via torch.cuda.memory._set_allocator_settings('expandable_segments:False') for this allocation.
What I tried
Running with the environment variable
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False"had no effect; the same error still occurs.Probably cause
I found that
expandable_segments:Trueis hardcoded in two files:roll/platforms/cuda.py(line 41)mcore_adapter/src/platforms/cuda.py(line 40)Manually editing these files to
Falseresolves the issue and allows execution.Question
Is there a better way to configure this setting (e.g., via an environment variable or a configuration flag) so that I don’t need to modify the source code manually? Thanks!