Does llama.cpp ACTUALLY support pipeline parallelism? #20252
Replies: 5 comments 24 replies
-
|
Yes it is supported. You can read more about how it works in #6017. If you configure it correctly, the PP performance scales nearly linear with the number of devices, even for single request. |
Beta Was this translation helpful? Give feedback.
-
|
I just tried this on 4xA40 GPUs, and I can see good scaling. |
Beta Was this translation helpful? Give feedback.
-
|
@marlin-oss Hi! Did you figure it out? I see the same thing on dual RX6800. There is exactly zero difference between Seeing that there is a behavior to disable pipeline parallelism if you're low on memory, maybe it's disabled by something else as well, but without logging anything? |
Beta Was this translation helpful? Give feedback.
-
|
According to my tests, ROCm is 20% to 2x faster at prompt processing, and Vulkan is 20% to 3x faster at token generation, and its token generation decays very little compared to ROCm as context grows. I use both. Vulkan does not see any improvement from enabling pipeline parallelism either. Now I'm pretty sure that the problem is my old-ass hardware. It makes sense that it would rely on technologies almost everyone has nowadays. (I can't try it on my host system right now because of... Linux reasons 🥲 ) But I've also just realized that pipeline parallelism does not support partial CPU offloading - not layers, not MoE experts. I know we'd be bottlenecked by the CPU anyway, but wouldn't it still improve things quite a bit as long as only a few experts are offloaded?.. |
Beta Was this translation helpful? Give feedback.
-
|
I can't run 7.2.3 on the host, but I've tried 6.3.2 (with an older rocm-bandwidth-test, ROCm 6.3.2 host/Docker |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The log says "llama_context: pipeline parallelism enabled". As far as I can tell, with layer split, it's only "batch parallel" or "pipeline sequential". Based on my understanding of the term "pipeline parallel", a model split between N GPUs should be able process N concurrent requests "roughly" N times faster than a single request (minus overhead)
With 2 GPUs and 2 Requests (prompt processing):
While one gpu is idle, it starts processing the next request - like a pipeline. I do not see this behavior. Only 1 GPU is processing at any time.
I've tried every combination of flags I can think of. Is there a build flag?
I appreciate any help.
Beta Was this translation helpful? Give feedback.
All reactions