expandable_segments:True hardcoded in CUDA platform files causes error on older kernel; PYTORCH_CUDA_ALLOC_CONF ignored

I've encountered an error while following the "Quick Start: Single-Node Deployment Guide" on a machine with an older kernel which does not support `pidfd_getfd`:

**Kernel**:  
`Linux zktitan 5.4.0-204-generic #224-Ubuntu SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux`

**Error**:

Exception: Call to collective_rpc method failed: The kernel on this machine does not support the pidfd_getfd syscall needed to use IPC for CUDA tensors when expandable_segments:True is set. Consider using expandable_segments:False via torch.cuda.memory._set_allocator_settings('expandable_segments:False') for this allocation.


**What I tried**  
Running with the environment variable `PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False"` had no effect; the same error still occurs.

**Probably cause**  
I found that `expandable_segments:True` is hardcoded in two files:  
- `roll/platforms/cuda.py` (line 41)  
- `mcore_adapter/src/platforms/cuda.py` (line 40)  

Manually editing these files to `False` resolves the issue and allows execution.

**Question**  
Is there a better way to configure this setting (e.g., via an environment variable or a configuration flag) so that I don’t need to modify the source code manually? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expandable_segments:True hardcoded in CUDA platform files causes error on older kernel; PYTORCH_CUDA_ALLOC_CONF ignored #408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

expandable_segments:True hardcoded in CUDA platform files causes error on older kernel; PYTORCH_CUDA_ALLOC_CONF ignored #408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions