Skip to content

HIP_VISIBLE_DEVICES not set on Linux for ROCm backends, causing multi-GPU ComfyUI backends to fail #1211

@marcushoff

Description

@marcushoff

Expected Behavior

When configuring multiple ComfyUI backends with different GPU IDs on Linux with ROCm, expect to run a backend per GPU.

Actual Behavior

First ComfyUI backend starts, the second and subsequent ComfyUI backends fail to start with the error:

RuntimeError: No HIP GPUs are available

Steps to Reproduce

Environment

  • OS: Linux (Debian 12)
  • GPU: AMD GPUs with ROCm
  • SwarmUI Version: 0.9.7.4
  • ComfyUI: Self-starting backend with ROCm support

Steps to Reproduce

  1. Configure a ComfyUI backend with GPU_ID: 0 - works fine
  2. Configure a second ComfyUI backend with GPU_ID: 1 - fails to start
  3. Check logs show: RuntimeError: No HIP GPUs are available

Debug Logs

16:58:50.608 [Warning] User local requested edit of backend 3.
16:58:50.616 [Info] ComfyUI backend 3 shutting down...
16:58:52.479 [Init] Initializing backend #3 - ComfyUI Self-Starting...
16:58:52.989 [Init] Self-Start ComfyUI-3 on port 7823 is loading...
16:58:54.641 [Warning] [ComfyUI-3/STDERR] Traceback (most recent call last):
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/main.py", line 177, in
16:58:54.641 [Warning] [ComfyUI-3/STDERR] import execution
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/execution.py", line 15, in
16:58:54.641 [Warning] [ComfyUI-3/STDERR] import comfy.model_management
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/comfy/model_management.py", line 239, in
16:58:54.641 [Warning] [ComfyUI-3/STDERR] total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
16:58:54.641 [Warning] [ComfyUI-3/STDERR] ^^^^^^^^^^^^^^^^^^
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/comfy/model_management.py", line 189, in get_torch_device
16:58:54.641 [Warning] [ComfyUI-3/STDERR] return torch.device(torch.cuda.current_device())
16:58:54.641 [Warning] [ComfyUI-3/STDERR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/venv/lib/python3.11/site-packages/torch/cuda/init.py", line 1150, in current_device
16:58:54.641 [Warning] [ComfyUI-3/STDERR] _lazy_init()
16:58:54.641 [Warning] [ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/venv/lib/python3.11/site-packages/torch/cuda/init.py", line 480, in _lazy_init
16:58:54.641 [Warning] [ComfyUI-3/STDERR] torch._C._cuda_init()
16:58:54.641 [Warning] [ComfyUI-3/STDERR] RuntimeError: No HIP GPUs are available
16:58:55.048 [Info] Self-Start ComfyUI-3 unexpectedly exited (ExitCode=unknown) (if something failed, change setting LogLevel to Debug to see why!)
16:58:55.048 [Info] Self-Start ComfyUI-3 had errors before shutdown:
[ComfyUI-3/STDERR] Set cuda device to: 1
[ComfyUI-3/STDERR] Adding extra search path checkpoints /SwarmUI/Models/Stable-Diffusion
[ComfyUI-3/STDERR] Adding extra search path vae /SwarmUI/Models/VAE
[ComfyUI-3/STDERR] Adding extra search path loras /SwarmUI/Models/Lora
[ComfyUI-3/STDERR] Adding extra search path loras /SwarmUI/Models/LyCORIS
[ComfyUI-3/STDERR] Adding extra search path upscale_models /SwarmUI/Models/ESRGAN
[ComfyUI-3/STDERR] Adding extra search path upscale_models /SwarmUI/Models/RealESRGAN
[ComfyUI-3/STDERR] Adding extra search path upscale_models /SwarmUI/Models/SwinIR
[ComfyUI-3/STDERR] Adding extra search path upscale_models /SwarmUI/Models/upscale-models
[ComfyUI-3/STDERR] Adding extra search path upscale_models /SwarmUI/Models/upscale_models
[ComfyUI-3/STDERR] Adding extra search path embeddings /SwarmUI/Models/Embeddings
[ComfyUI-3/STDERR] Adding extra search path embeddings /SwarmUI/Models/embeddings
[ComfyUI-3/STDERR] Adding extra search path hypernetworks /SwarmUI/Models/hypernetworks
[ComfyUI-3/STDERR] Adding extra search path controlnet /SwarmUI/Models/controlnet
[ComfyUI-3/STDERR] Adding extra search path controlnet /SwarmUI/Models/model_patches
[ComfyUI-3/STDERR] Adding extra search path controlnet /SwarmUI/Models/ControlNet
[ComfyUI-3/STDERR] Adding extra search path model_patches /SwarmUI/Models/controlnet
[ComfyUI-3/STDERR] Adding extra search path model_patches /SwarmUI/Models/model_patches
[ComfyUI-3/STDERR] Adding extra search path model_patches /SwarmUI/Models/ControlNet
[ComfyUI-3/STDERR] Adding extra search path clip /SwarmUI/Models/text_encoders
[ComfyUI-3/STDERR] Adding extra search path clip /SwarmUI/Models/clip
[ComfyUI-3/STDERR] Adding extra search path clip /SwarmUI/Models/CLIP
[ComfyUI-3/STDERR] Adding extra search path clip_vision /SwarmUI/Models/clip_vision
[ComfyUI-3/STDERR] Adding extra search path unet /SwarmUI/Models/unet
[ComfyUI-3/STDERR] Adding extra search path diffusion_models /SwarmUI/Models/diffusion_models
[ComfyUI-3/STDERR] Adding extra search path gligen /SwarmUI/Models/gligen
[ComfyUI-3/STDERR] Adding extra search path ipadapter /SwarmUI/Models/ipadapter
[ComfyUI-3/STDERR] Adding extra search path yolov8 /SwarmUI/Models/yolov8
[ComfyUI-3/STDERR] Adding extra search path tensorrt /SwarmUI/Models/tensorrt
[ComfyUI-3/STDERR] Adding extra search path clipseg /SwarmUI/Models/clipseg
[ComfyUI-3/STDERR] Adding extra search path style_models /SwarmUI/Models/style_models
[ComfyUI-3/STDERR] Adding extra search path latent_upscale_models /SwarmUI/Models/latent_upscale_models
[ComfyUI-3/STDERR] Adding extra search path custom_nodes /SwarmUI/src/BuiltinExtensions/ComfyUIBackend/DLNodes
[ComfyUI-3/STDERR] Adding extra search path custom_nodes /SwarmUI/src/BuiltinExtensions/ComfyUIBackend/ExtraNodes
[ComfyUI-3/STDERR] Checkpoint files will always be loaded safely.
[ComfyUI-3/STDERR] /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
[ComfyUI-3/STDERR] /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
[ComfyUI-3/STDERR] /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
[ComfyUI-3/STDERR] /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
[ComfyUI-3/STDERR] Traceback (most recent call last):
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/main.py", line 177, in
[ComfyUI-3/STDERR] import execution
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/execution.py", line 15, in
[ComfyUI-3/STDERR] import comfy.model_management
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/comfy/model_management.py", line 239, in
[ComfyUI-3/STDERR] total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
[ComfyUI-3/STDERR] ^^^^^^^^^^^^^^^^^^
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/comfy/model_management.py", line 189, in get_torch_device
[ComfyUI-3/STDERR] return torch.device(torch.cuda.current_device())
[ComfyUI-3/STDERR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/venv/lib/python3.11/site-packages/torch/cuda/init.py", line 1150, in current_device
[ComfyUI-3/STDERR] _lazy_init()
[ComfyUI-3/STDERR] File "/SwarmUI/dlbackend/ComfyUI/venv/lib/python3.11/site-packages/torch/cuda/init.py", line 480, in _lazy_init
[ComfyUI-3/STDERR] torch._C._cuda_init()
[ComfyUI-3/STDERR] RuntimeError: No HIP GPUs are available

Other

Root Cause

In NetworkBackendUtils.cs, the HIP_VISIBLE_DEVICES environment variable is only set on Windows (line 374-377), but not on Linux. However, ComfyUI with ROCm requires HIP_VISIBLE_DEVICES to be set on Linux as well.

Additionally, when ROCR_VISIBLE_DEVICES is set to a single GPU ID (e.g., ROCR_VISIBLE_DEVICES=1), that GPU becomes visible as device 0 to the process. Therefore, HIP_VISIBLE_DEVICES must be set to 0 (not the original GPU ID) to correctly access the restricted GPU.

Current Code

// In NetworkBackendUtils.cs, line ~372-378
PythonLaunchHelper.CleanEnvironmentOfPythonMess(start, $"({nameSimple} launch) ");
start.Environment["CUDA_VISIBLE_DEVICES"] = $"{gpuId}";
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
    start.Environment["HIP_VISIBLE_DEVICES"] = $"{gpuId}";
}
start.Environment["ROCR_VISIBLE_DEVICES"] = $"{gpuId}";

Proposed Solution

Set HIP_VISIBLE_DEVICES=0 on Linux when ROCR_VISIBLE_DEVICES restricts to a single GPU, since that GPU becomes device 0 in the process's view:

// In NetworkBackendUtils.cs, line ~372-378
PythonLaunchHelper.CleanEnvironmentOfPythonMess(start, $"({nameSimple} launch) ");
start.Environment["CUDA_VISIBLE_DEVICES"] = $"{gpuId}";
// For ROCm/HIP, when ROCR_VISIBLE_DEVICES restricts to a single GPU, that GPU becomes device 0
// So we need to set HIP_VISIBLE_DEVICES=0 to access it
start.Environment["HIP_VISIBLE_DEVICES"] = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? $"{gpuId}" : "0";
start.Environment["ROCR_VISIBLE_DEVICES"] = $"{gpuId}";

Testing

After applying this fix on server:

  • Backend with GPU_ID: 0 starts successfully ✓
  • Backend with GPU_ID: 1 starts successfully ✓
  • Backend with GPU_ID: 2 starts successfully ✓
  • Backend with GPU_ID: 3 starts successfully ✓

All four backends can now run simultaneously, each using a different GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions