Skip to content

expandable_segments:True hardcoded in CUDA platform files causes error on older kernel; PYTORCH_CUDA_ALLOC_CONF ignored #408

@cchen-hhz

Description

@cchen-hhz

I've encountered an error while following the "Quick Start: Single-Node Deployment Guide" on a machine with an older kernel which does not support pidfd_getfd:

Kernel:
Linux zktitan 5.4.0-204-generic #224-Ubuntu SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Error:

Exception: Call to collective_rpc method failed: The kernel on this machine does not support the pidfd_getfd syscall needed to use IPC for CUDA tensors when expandable_segments:True is set. Consider using expandable_segments:False via torch.cuda.memory._set_allocator_settings('expandable_segments:False') for this allocation.

What I tried
Running with the environment variable PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False" had no effect; the same error still occurs.

Probably cause
I found that expandable_segments:True is hardcoded in two files:

  • roll/platforms/cuda.py (line 41)
  • mcore_adapter/src/platforms/cuda.py (line 40)

Manually editing these files to False resolves the issue and allows execution.

Question
Is there a better way to configure this setting (e.g., via an environment variable or a configuration flag) so that I don’t need to modify the source code manually? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions