Does llama.cpp ACTUALLY support pipeline parallelism? #20252

marlin-oss · 2026-03-08T23:11:51Z

marlin-oss
Mar 8, 2026

The log says "llama_context: pipeline parallelism enabled". As far as I can tell, with layer split, it's only "batch parallel" or "pipeline sequential". Based on my understanding of the term "pipeline parallel", a model split between N GPUs should be able process N concurrent requests "roughly" N times faster than a single request (minus overhead)

With 2 GPUs and 2 Requests (prompt processing):

Step 1:
- GPU 1: processing request 1
- GPU 2: idle
Step 2:
- GPU 1: processing request 2
- GPU 2: processing request 1
Step 3:
- GPU 1: processing request 1
- GPU 2: processing request 2
...
Last step:
- GPU 1: idle
- GPU 2: processing request 2

While one gpu is idle, it starts processing the next request - like a pipeline. I do not see this behavior. Only 1 GPU is processing at any time.

I've tried every combination of flags I can think of. Is there a build flag?

I appreciate any help.

ggerganov · 2026-03-09T09:03:39Z

ggerganov
Mar 9, 2026
Maintainer

Yes it is supported. You can read more about how it works in #6017. If you configure it correctly, the PP performance scales nearly linear with the number of devices, even for single request.

3 replies

marlin-oss Mar 9, 2026
Author

Thanks for the reply.
If I understand the flags correctly, setting batch = ubatch should disable pipeline parallelism?
With 2 GPUs and layer split, I see no performance difference between -ub 512 -b 512 and -ub 512 -b 1024 in llama-server or llama-batched-bench. Does this mean that it isn't being enabled on my system?

In llama-batched-bench -ub 512 -b 512:

main: n_kv_max = 65536, n_batch = 512, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   19.837 |   412.96 |    3.483 |     9.19 |   23.320 |   352.66 |
|  8192 |     32 |    2 |  16448 |   38.933 |   420.82 |    3.729 |    17.16 |   42.662 |   385.54 |

-ub 512 -b 1024

main: n_kv_max = 65536, n_batch = 1024, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   20.005 |   409.51 |    3.434 |     9.32 |   23.438 |   350.88 |
|  8192 |     32 |    2 |  16448 |   39.201 |   417.95 |    3.688 |    17.35 |   42.889 |   383.50 |

This matches the performance I see in llama-server

Full log -ub 512 -b 512

./llama-batched-bench --device CUDA1,CUDA3 --model '/mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf' -ngl 999 -c 65536 -ub 512 -b 512 -npp 8192 -ntg 32 -npl 1,2 -ts 1,1 -sm layer
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
build: 8233 (c5a778891) with GNU 12.3.0 for Linux x86_64
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:01:00.0) - 24292 MiB free
llama_model_load_from_file_impl: using device CUDA3 (Tesla P40) (0000:03:00.0) - 24292 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 363 tensors from /mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mistral3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   3:                            general.version str              = 2512
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Devstral Small 2 24B Instruct 2512
llama_model_loader: - kv  12:               general.base_model.0.version str              = 2512
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Devs...
llama_model_loader: - kv  15:                               general.tags arr[str,2]       = ["mistral-common", "unsloth"]
llama_model_loader: - kv  16:                       mistral3.block_count u32              = 40
llama_model_loader: - kv  17:                    mistral3.context_length u32              = 393216
llama_model_loader: - kv  18:                  mistral3.embedding_length u32              = 5120
llama_model_loader: - kv  19:               mistral3.feed_forward_length u32              = 32768
llama_model_loader: - kv  20:              mistral3.attention.head_count u32              = 32
llama_model_loader: - kv  21:           mistral3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  22:                    mistral3.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  23:  mistral3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  24:              mistral3.attention.key_length u32              = 128
llama_model_loader: - kv  25:            mistral3.attention.value_length u32              = 128
llama_model_loader: - kv  26:              mistral3.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                 mistral3.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:               mistral3.rope.scaling.factor f32              = 48.000000
llama_model_loader: - kv  29:       mistral3.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  30:       mistral3.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  31:  mistral3.rope.scaling.yarn_log_multiplier f32              = 1.000000
llama_model_loader: - kv  32: mistral3.rope.scaling.original_context_length u32              = 8192
llama_model_loader: - kv  33:       mistral3.attention.temperature_scale f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  42:                      tokenizer.ggml.scores arr[i32,131072]  = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  44:                        mistral3.vocab_size u32              = 131072
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {#- Unsloth template fixes #}\n{%- set...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 7
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = Devstral-Small-2-24B-Instruct-2512-GG...
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = unsloth_calibration_Devstral-Small-2-...
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 280
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 75
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   67 tensors
llama_model_loader: - type q8_0:  215 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 26.99 GiB (9.84 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch                  = mistral3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 393216
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 40
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 32768
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 100000000.0
print_info: freq_scale_train      = 0.0208333
print_info: n_ctx_orig_yarn       = 8192
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 14B
print_info: model params          = 23.57 B
print_info: general.name          = Devstral-Small-2-24B-Instruct-2512
print_info: vocab type            = BPE
print_info: n_vocab               = 131072
print_info: n_merges              = 269443
print_info: BOS token             = 1 '<s>'
print_info: EOS token             = 2 '</s>'
print_info: UNK token             = 0 '<unk>'
print_info: PAD token             = 11 '<pad>'
print_info: LF token              = 1010 'Ċ'
print_info: EOG token             = 2 '</s>'
print_info: max token length      = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1280.00 MiB
load_tensors:        CUDA1 model buffer size = 13345.20 MiB
load_tensors:        CUDA3 model buffer size = 13016.07 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000000.0
llama_context: freq_scale    = 0.0208333
llama_context: n_ctx_seq (32768) < n_ctx_train (393216) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  5376.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  4864.00 MiB
llama_kv_cache: size = 10240.00 MiB ( 32768 cells,  40 layers,  2/2 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA1 compute buffer size =   556.05 MiB
sched_reserve:      CUDA3 compute buffer size =   434.05 MiB
sched_reserve:  CUDA_Host compute buffer size =   276.06 MiB
sched_reserve: graph nodes  = 1367
sched_reserve: graph splits = 3
sched_reserve: reserve took 221.00 ms, sched copies = 4

main: n_kv_max = 65536, n_batch = 512, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   19.837 |   412.96 |    3.483 |     9.19 |   23.320 |   352.66 |
|  8192 |     32 |    2 |  16448 |   38.933 |   420.82 |    3.729 |    17.16 |   42.662 |   385.54 |

llama_perf_context_print:        load time =   10751.65 ms
llama_perf_context_print: prompt eval time =   62784.77 ms / 24656 tokens (    2.55 ms per token,   392.71 tokens per second)
llama_perf_context_print:        eval time =    3482.22 ms /    32 runs   (  108.82 ms per token,     9.19 tokens per second)
llama_perf_context_print:       total time =   76738.57 ms / 24688 tokens
llama_perf_context_print:    graphs reused =          0

Full log -ub 512 -b 1024

./llama-batched-bench --device CUDA1,CUDA3 --model '/mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf' -ngl 999 -c 65536 -ub 512 -b 1024 -npp 8192 -ntg 32 -npl 1,2 -ts 1,1 -sm layer
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
build: 8233 (c5a778891) with GNU 12.3.0 for Linux x86_64
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:01:00.0) - 24292 MiB free
llama_model_load_from_file_impl: using device CUDA3 (Tesla P40) (0000:03:00.0) - 24292 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 363 tensors from /mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mistral3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   3:                            general.version str              = 2512
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Devstral Small 2 24B Instruct 2512
llama_model_loader: - kv  12:               general.base_model.0.version str              = 2512
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Devs...
llama_model_loader: - kv  15:                               general.tags arr[str,2]       = ["mistral-common", "unsloth"]
llama_model_loader: - kv  16:                       mistral3.block_count u32              = 40
llama_model_loader: - kv  17:                    mistral3.context_length u32              = 393216
llama_model_loader: - kv  18:                  mistral3.embedding_length u32              = 5120
llama_model_loader: - kv  19:               mistral3.feed_forward_length u32              = 32768
llama_model_loader: - kv  20:              mistral3.attention.head_count u32              = 32
llama_model_loader: - kv  21:           mistral3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  22:                    mistral3.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  23:  mistral3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  24:              mistral3.attention.key_length u32              = 128
llama_model_loader: - kv  25:            mistral3.attention.value_length u32              = 128
llama_model_loader: - kv  26:              mistral3.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                 mistral3.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:               mistral3.rope.scaling.factor f32              = 48.000000
llama_model_loader: - kv  29:       mistral3.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  30:       mistral3.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  31:  mistral3.rope.scaling.yarn_log_multiplier f32              = 1.000000
llama_model_loader: - kv  32: mistral3.rope.scaling.original_context_length u32              = 8192
llama_model_loader: - kv  33:       mistral3.attention.temperature_scale f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  42:                      tokenizer.ggml.scores arr[i32,131072]  = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  44:                        mistral3.vocab_size u32              = 131072
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {#- Unsloth template fixes #}\n{%- set...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 7
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = Devstral-Small-2-24B-Instruct-2512-GG...
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = unsloth_calibration_Devstral-Small-2-...
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 280
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 75
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   67 tensors
llama_model_loader: - type q8_0:  215 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 26.99 GiB (9.84 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch                  = mistral3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 393216
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 40
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 32768
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 100000000.0
print_info: freq_scale_train      = 0.0208333
print_info: n_ctx_orig_yarn       = 8192
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 14B
print_info: model params          = 23.57 B
print_info: general.name          = Devstral-Small-2-24B-Instruct-2512
print_info: vocab type            = BPE
print_info: n_vocab               = 131072
print_info: n_merges              = 269443
print_info: BOS token             = 1 '<s>'
print_info: EOS token             = 2 '</s>'
print_info: UNK token             = 0 '<unk>'
print_info: PAD token             = 11 '<pad>'
print_info: LF token              = 1010 'Ċ'
print_info: EOG token             = 2 '</s>'
print_info: max token length      = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1280.00 MiB
load_tensors:        CUDA1 model buffer size = 13345.20 MiB
load_tensors:        CUDA3 model buffer size = 13016.07 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000000.0
llama_context: freq_scale    = 0.0208333
llama_context: n_ctx_seq (32768) < n_ctx_train (393216) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  5376.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  4864.00 MiB
llama_kv_cache: size = 10240.00 MiB ( 32768 cells,  40 layers,  2/2 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA1 compute buffer size =   556.05 MiB
sched_reserve:      CUDA3 compute buffer size =   434.05 MiB
sched_reserve:  CUDA_Host compute buffer size =   276.06 MiB
sched_reserve: graph nodes  = 1367
sched_reserve: graph splits = 3
sched_reserve: reserve took 222.68 ms, sched copies = 4

main: n_kv_max = 65536, n_batch = 1024, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   20.005 |   409.51 |    3.434 |     9.32 |   23.438 |   350.88 |
|  8192 |     32 |    2 |  16448 |   39.201 |   417.95 |    3.688 |    17.35 |   42.889 |   383.50 |

llama_perf_context_print:        load time =   10581.30 ms
llama_perf_context_print: prompt eval time =   63174.30 ms / 24656 tokens (    2.56 ms per token,   390.29 tokens per second)
llama_perf_context_print:        eval time =    3433.51 ms /    32 runs   (  107.30 ms per token,     9.32 tokens per second)
llama_perf_context_print:       total time =   76913.99 ms / 24688 tokens
llama_perf_context_print:    graphs reused =          0

ggerganov Mar 10, 2026
Maintainer

Hm not sure. Try to run the llama-bench tests from the #6017. If you can't reproduce the results, either something regressed or there is something specific to your system.

marlin-oss Mar 11, 2026
Author

I ran the test and saw essentially no difference. I should be seeing some improvement here, right?
nvtop shows a sawtooth pattern where only one gpu is active at any given time.

cmake -B build -DGGML_CUDA=ON  -DGGML_BLAS=OFF -DLLAMA_CURL=OFF -DGGML_SCHED_MAX_COPIES=8

llama-bench --model '/mnt/tmpfs/Qwen3-14B-UD-Q8_K_XL.gguf' -ngl 999 --device CUDA1,CUDA1/CUDA2,CUDA1/CUDA2/CUDA3 -p 512,1024,2048,4096,8192 -b 8192
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 84348 MiB):
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 11011 MiB (10848 MiB free)
  Device 1: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)
  Device 2: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)
  Device 3: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)

model	size	params	backend	ngl	n_batch	dev	test	t/s
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp512	410.94 ± 0.16
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp1024	402.04 ± 0.21
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp2048	385.97 ± 0.31
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp4096	356.10 ± 0.42
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp8192	292.27 ± 3.76
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	tg128	15.17 ± 0.03
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp512	407.75 ± 0.73
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp1024	414.67 ± 0.54
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp2048	408.47 ± 0.45
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp4096	381.33 ± 0.18
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp8192	333.17 ± 0.28
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	tg128	15.08 ± 0.03
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp512	407.75 ± 0.78
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp1024	414.82 ± 0.78
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp2048	407.60 ± 0.34
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp4096	380.85 ± 0.33
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp8192	332.99 ± 0.34
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	tg128	15.17 ± 0.02

build: 4d99d4508 (8279)

I tested with a 24B model (on 2 and 3 gpus) also and saw a similar lack of improvement.
I also tested with CUDA_SCALE_LAUNCH_QUEUES=4x; which made no difference, for what it's worth.

gaugarg-nv · 2026-03-11T08:44:32Z

gaugarg-nv
Mar 11, 2026
Collaborator

I just tried this on 4xA40 GPUs, and I can see good scaling.

cmake -B build-A40 -DGGML_CUDA=ON
build-A40/bin/llama-bench -m ../models/Qwen3-14B-UD-Q8_K_XL.gguf -ngl 999 --device CUDA1,CUDA1/CUDA2,CUDA1/CUDA2/CUDA3 -p 512,1024,2048,4096,8192 -b 8192
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 181960 MiB):
  Device 0: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 1: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 2: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 3: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
| model                          |       size |     params | backend    | ngl | n_batch | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |           pp512 |      2393.11 ± 57.80 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp1024 |       2286.60 ± 6.00 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp2048 |       2130.31 ± 4.01 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp4096 |       1857.38 ± 2.14 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp8192 |       1462.26 ± 0.80 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |           tg128 |         31.69 ± 0.02 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |           pp512 |      2425.42 ± 17.42 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp1024 |       2915.99 ± 1.91 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp2048 |       3199.68 ± 1.65 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp4096 |       3052.61 ± 0.72 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp8192 |       2546.45 ± 0.33 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |           tg128 |         31.78 ± 0.01 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |           pp512 |      2429.78 ± 11.10 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp1024 |       3255.04 ± 2.40 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp2048 |       3956.43 ± 3.79 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp4096 |       4044.25 ± 0.52 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp8192 |       3530.37 ± 1.05 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |           tg128 |         31.78 ± 0.01 |

build: 5f91b1d5d (8286)

2 replies

marlin-oss Mar 11, 2026
Author

Thanks for testing that. Seems related to my hardware.
Are you on a dual socket by any chance?
Maybe it's P40 related. If someone running P40s could give it a try that would be great.

I've been looking into p2p which they supposedly support, might be causing issues if it's not working properly.
I'll try with just -DGGML_CUDA=ON but I doubt that's it.

gaugarg-nv Mar 11, 2026
Collaborator

Yes, this was a dual socket system. But I have tested on single socket systems too in the past, and it worked fine. I don't have access to P40s, so I can't test.

Can you try capturing nsight trace for batch size 8192, pp8192 case, and share here?

dark-penguin · 2026-05-31T17:06:41Z

dark-penguin
May 31, 2026

@marlin-oss Hi! Did you figure it out? I see the same thing on dual RX6800. There is exactly zero difference between -b 1024 -ub 1024 and -b 2048 -ub 1024, or between runs with W sched_reserve: compute buffer allocation failed, retrying without pipeline parallelism and without it.

Seeing that there is a behavior to disable pipeline parallelism if you're low on memory, maybe it's disabled by something else as well, but without logging anything?

6 replies

dark-penguin Jun 3, 2026

I haven't heard of GPU-level profilers/debuggers before, but I can probably figure out how to capture a log with some help. But I'm on AMD, so my tools would be different I suppose?

marlin-oss Jun 3, 2026
Author

I missed that. I’ll update this thread if I find anything. Are you running in a vm with GPU pass through? I was going to test if that was causing the issue

sredman Jun 3, 2026

FWIW, scaling works well for me on 3x AMD v320. Roughly the same scaling ratio as gaugarg-nv posted above, using the ROCm backend (I haven't tested Vulkan)

... maybe it's disabled by something else as well ...

One example is having any RPC devices active.

Are you running in a vm with GPU pass through by any chance?

This is a good point; I've had trouble getting this working. For AMD GPUs, you can use rocm-bandwidth-test to see which devices can do P2P and the bandwidth between them. Ideally you should see all the devices able to P2P to each other, and able to communicate at roughly the full theoretical maximum PCIe bandwidth expected for your PCIe generation and bus width.

dark-penguin Jun 4, 2026

No passthrough, nothing unusual; the most unusual thing I have is Docker. Is there anything special I have to specify to allow P2P transfers in Docker?

I've built rocm-bandwidth-test in that container. But I don't know what am I looking for, or how to read the output. But according to what an AI is telling me, it can't find a valid way to do direct P2P communication.

I've got a GA-X79S-UP5-WIFI mainboard - that's socket 2011, Intel C606 chipset, PCI-Express 3.0. No Infinity Fabric, no resizeable BAR.

# rocm_bandwidth_test plugin --run tb p2p
TransferBench v1.64.00
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET          =            0 : Using byte offset of 0
CU_MASK              =            0 : All
FILL_COMPRESS        =            0 : Not specified
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER      =            0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_TEMPORAL         =            0 : Not using non-temporal loads/stores
GFX_UNROLL           =            4 : Using GFX unroll factor of 4
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE        =            4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC      =            1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC      =            0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_SUBITERATIONS    =            1 : Running 1 subiterations
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_HIP_EVENTS       =            1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA          =            0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_SINGLE_STREAM    =            1 : Using single stream per GFX device
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE      =            0 : Do not perform source validation after prep

[P2P Related]
NUM_CPU_DEVICES      =            1 : Using 1 CPUs
NUM_CPU_SE           =            4 : Using 4 CPU threads per Transfer
NUM_GPU_DEVICES      =            2 : Using 2 GPUs
NUM_GPU_SE           =           30 : Using 30 GPU subexecutors/CUs per Transfer
P2P_MODE             =            0 : Running Uni + Bi transfers
USE_FINE_GRAIN       =            0 : Using coarse-grained memory
USE_GPU_DMA          =            0 : Using GPU-GFX as GPU executor
USE_REMOTE_READ      =            0 : Using SRC as executor

Bytes Per Direction 268435456
Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
 SRC+EXE\DST    CPU 00       GPU 00    GPU 01
Segmentation fault (core dumped)

# USE_HSA_DMA=1 USE_FINE_GRAIN=1 A2A_DIRECT=0 USE_DMA_EXEC=1 rocm_bandwidth_test plugin --run tb a2a
TransferBench v1.64.00
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET          =            0 : Using byte offset of 0
CU_MASK              =            0 : All
FILL_COMPRESS        =            0 : Not specified
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_ORDER      =            0 : Thread block ordering: Sequential
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_TEMPORAL         =            0 : Not using non-temporal loads/stores
GFX_UNROLL           =            2 : Using GFX unroll factor of 2
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
GFX_WORD_SIZE        =            4 : Using GFX word size of 4 (DWORDx4)
MIN_VAR_SUBEXEC      =            1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC      =            0 : Using up to all available subexecutors for variable subExec transfers
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_SUBITERATIONS    =            1 : Running 1 subiterations
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_HIP_EVENTS       =            1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA          =            1 : Using hsa_amd_async_copy for DMA execution
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_SINGLE_STREAM    =            1 : Using single stream per GFX device
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE      =            0 : Do not perform source validation after prep

[AllToAll Related]
A2A_DIRECT           =            0 : Full all-to-all
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
NUM_GPU_DEVICES      =            2 : Using 2 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            1 : Using DMA executor
USE_FINE_GRAIN       =            1 : Using fine-grained memory
USE_REMOTE_READ      =            0 : Using SRC as executor

GPU-GFX All-To-All benchmark:
==========================
- Copying 268435456 bytes between all pairs of GPUs using 8 CUs (2 Transfers)
Large BAR is not enabled for GPU 0 in BIOS. Large BAR is required to enable multi-gpu data access
Large BAR is not enabled for GPU 1 in BIOS. Large BAR is required to enable multi-gpu data access
Peer access is unavailable between GPU devices 1 to 0.For AMD hardware, check IOMMU configuration
Segmentation fault (core dumped)

So, I guess no direct P2P is possible? But can't data be passed "the normal way", without P2P?

sredman Jun 4, 2026

Is there anything special I have to specify to allow P2P transfers in Docker?

I haven't tried docker so I can't say, but if you post your docker run command or any other configuration, maybe someone else can help you.

But can't data be passed "the normal way", without P2P?

If you haven't already tried, try the Vulkan build. It is a lot more forgiving. I think it implements layer split through Device->RAM->Device copying. For me the performance is a bit less good compared to ROCm, but if it works for you that might be better than hours of debugging.

If you're able, I'd suggest trying at least rocm-bandwidth-test directly on your host. I was able to find it in my package manager rather than building it. If P2P works there, then you "only" need to figure out why the docker setup isn't behaving. I skipped this step and spent a lot of time in my BIOS, before realizing P2P worked fine on the host and the problem was in my VM setup.

dark-penguin · 2026-06-04T03:36:22Z

dark-penguin
Jun 4, 2026

According to my tests, ROCm is 20% to 2x faster at prompt processing, and Vulkan is 20% to 3x faster at token generation, and its token generation decays very little compared to ROCm as context grows. I use both. Vulkan does not see any improvement from enabling pipeline parallelism either.

Now I'm pretty sure that the problem is my old-ass hardware. It makes sense that it would rely on technologies almost everyone has nowadays. (I can't try it on my host system right now because of... Linux reasons 🥲 )

But I've also just realized that pipeline parallelism does not support partial CPU offloading - not layers, not MoE experts. I know we'd be bottlenecked by the CPU anyway, but wouldn't it still improve things quite a bit as long as only a few experts are offloaded?..

1 reply

sredman Jun 5, 2026

Now I'm pretty sure that the problem is my old-ass hardware. It makes sense that it would rely on technologies almost everyone has nowadays.

I would say to not blame this until you've had a chance to test in the host OS. PCIe P2P is not new, I'd expect any self-respecting C606 motherboard to support it.

But I've also just realized that pipeline parallelism does not support partial CPU offloading ... but wouldn't it still improve things quite a bit as long as only a few experts are offloaded?

Not everything that could be implemented, is implemented. Now is your chance to shine 🙂

dark-penguin · 2026-06-05T01:32:23Z

dark-penguin
Jun 5, 2026

I can't run 7.2.3 on the host, but I've tried 6.3.2 (with an older rocm-bandwidth-test), and it gave me the same results on the host and in the container. And apparently it says transfer is working fine? I guess the new version simply segfaulted for some reason before it could run the test - which would probably succeed if it didn't?

rocm-bandwidth-test, ROCm 6.3.2 host/Docker

$ sudo /opt/rocm/bin/rocm-bandwidth-test -t
          RocmBandwidthTest Version: 2.6.0
          Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -t

          Device Index:                             0
            Device Type:                            CPU
            Device Name:                            Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
              Allocatable Memory Size (KB):         65763416

          Device Index:                             1
            Device Type:                            GPU
            Device Name:                            AMD Radeon RX 6800
            Device  BDF:                            07:0.0
            Device UUID:                            GPU-0756742f46e97e82
              Allocatable Memory Size (KB):         16760832
              Allocatable Memory Size (KB):         16760832

          Device Index:                             2
            Device Type:                            GPU
            Device Name:                            AMD Radeon RX 6800
            Device  BDF:                            0a:0.0
            Device UUID:                            GPU-1f3e10e6a509c865
              Allocatable Memory Size (KB):         16760832
              Allocatable Memory Size (KB):         16760832

          Inter-Device Access
          D/D       0         1         2         
          0         1         0         0         
          1         1         1         0         
          2         1         0         1         

          Inter-Device Link Type: P = PCIe, X = xGMI, N/A = Not Applicable
          D/D       0         1         2         
          0         N/A       N/A       N/A       
          1         P         N/A       N/A       
          2         P         N/A       N/A       

          Inter-Device Numa Distance
          D/D       0         1         2         
          0         0         N/A       N/A       
          1         20        0         N/A       
          2         20        N/A       0

$ sudo /opt/rocm/bin/rocm-bandwidth-test
          RocmBandwidthTest Version: 2.6.0
          Launch Command is: /opt/rocm/bin/rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)

          Device: 0,  Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
          Device: 1,  AMD Radeon RX 6800,  GPU-0756742f46e97e82,  07:0.0
          Device: 2,  AMD Radeon RX 6800,  GPU-1f3e10e6a509c865,  0a:0.0

          Inter-Device Access
          D/D       0         1         2         
          0         1         0         0         
          1         1         1         0         
          2         1         0         1         

          Inter-Device Numa Distance
          D/D       0         1         2         
          0         0         N/A       N/A       
          1         20        0         N/A       
          2         20        N/A       0         

          Unidirectional copy peak bandwidth GB/s
          D/D       0           1           2           
          0         N/A         3.424       12.916      
          1         3.633       636.465     N/A         
          2         14.511      N/A         674.051     

          Bidirectional copy peak bandwidth GB/s
          D/D       0           1           2           
          0         N/A         6.645       26.655      
          1         6.645       N/A         N/A         
          2         26.655      N/A         N/A

12 replies

dark-penguin Jun 16, 2026

Hmm, models take much longer to load now, but prompt processing... is the same as before. I've made sure to enable pipeline parallelism - I have a fix of my own to fail instead of silently disabling it if there was not enough memory to allocate for it. And I've double-checked that I'm running it with -b 2048 -ub 512 - I see these values dumped in the logs. I've tried two models that fit completely in my VRAM - Qwen3.6-35B-A3B and Qwen3.6-27B .

dark-penguin Jun 16, 2026

I've tried both ROCm and Vulkan, tried your branch as-is to make sure my patches don't mess it up - nope, it's even a little slower than running it without RPC (and without PP).

Looking back at this:

GPU-GFX All-To-All benchmark:
==========================
- Copying 268435456 bytes between all pairs of GPUs using 8 CUs (2 Transfers)
Large BAR is not enabled for GPU 0 in BIOS. Large BAR is required to enable multi-gpu data access
Large BAR is not enabled for GPU 1 in BIOS. Large BAR is required to enable multi-gpu data access
Peer access is unavailable between GPU devices 1 to 0.For AMD hardware, check IOMMU configuration
Segmentation fault (core dumped)

...I guess this is where I got the idea that ReBAR is necessary. But what's that about IOMMU? I have one, but there is no "configuration" in BIOS other than enabling a few things.

But none of that explains why doesn't even RPC help. 😄 I was pretty sure that it would work no matter what. Is there maybe some kind of debug logging? Should I get infinite-verbosity logs?

sredman Jun 19, 2026

models take much longer to load now

Yes. It's not a perfect solution

I have a fix of my own to fail instead of silently disabling it if there was not enough memory to allocate for it.

There are a variety of other reasons which cause PP to be disabled. Are you only terminating for that particular one? Do you see in the logs the positive confirmation that it is enabled? Can you try a smaller model and share the whole command lines you are running for the RPC servers and the bench?

Is there maybe some kind of debug logging?

With --verbose on llama-bench or llama-server, you should get at least some information. Critically, you should see a positive confirmation that pipeline paralellism is enabled (though unfortunately there is no debug logging when it is not enabled, nor why, AFAIK)

dark-penguin Jun 19, 2026

models take much longer to load now

Yes. It's not a perfect solution

I just thought it might be useful information - I didn't expect it to have any impact; we're not bottlenecked by anything, local network has unlimited bandwidth, so why would it be slower? Oh, because instead of DMA from page cache to VRAM, we're now sending data through CPU?

I have a fix of my own to fail instead of silently disabling it if there was not enough memory to allocate for it.

There are a variety of other reasons which cause PP to be disabled. Are you only terminating for that particular one? Do you see in the logs the positive confirmation that it is enabled? Can you try a smaller model and share the whole command lines you are running for the RPC servers and the bench?

Is there maybe some kind of debug logging?

With --verbose on llama-bench or llama-server, you should get at least some information. Critically, you should see a positive confirmation that pipeline paralellism is enabled (though unfortunately there is no debug logging when it is not enabled, nor why, AFAIK)

I know that you can get "infinitely verbose logs" with -v , but there is nothing useful there. And yes, you can see that message or not see anything, which is a problem. That's why I've created my patch: #24205

During startup, llama.cpp tries to initialize PP, checking that all conditions are met, then prints pipeline parallelism is enabled. My patch adds here: "otherwise, throw an error". Then it tries to allocate memory buffers, and if that fails, prints retrying without pipeline parallelism. My patch changes this to "throw an error". This way, if I specify that I want PP, I can be certain that it did not get silently disabled. And I can also disable PP since it doesn't work for me anyway, which will prevent it from attempting to allocate the buffer.

So yes, it's definitely enabled, and it successfully allocates the buffer. Other than that, there is nothing related in the verbose logs.

Here are my commands:

docker run -it --rm --device /dev/kfd --device /dev/dri --group-add video --privileged --name rpc1 -p 9001:9001 \
    -v /tmpfs:/tmpfs -v /ssd_models:/ssd_models -e CUDA_VISIBLE_DEVICES=0 --entrypoint /opt/llama.cpp/rpc-server \
    llama.cpp:dev-rocm --host 0.0.0.0 --port 9001

docker run -it --rm --device /dev/kfd --device /dev/dri --group-add video --privileged --name rpc2 -p 9002:9002 \
    -v /tmpfs:/tmpfs -v /ssd_models:/ssd_models -e CUDA_VISIBLE_DEVICES=1 --entrypoint /opt/llama.cpp/rpc-server \
    llama.cpp:dev-rocm --host 0.0.0.0 --port 9002

docker run -it --rm --device /dev/kfd --device /dev/dri --group-add video --privileged --name llama.cpp -p 11234:11234 \
    -v /tmpfs:/tmpfs -v /ssd_models:/ssd_models --entrypoint /opt/llama.cpp/llama-server \
    llama.cpp:dev-rocm --host 0.0.0.0 --jinja -cram -1 -ngl all -fit off -lv 4 --models-max 1 \
    --models-preset /ssd_models/params.ini --slot-save-path /tmpfs/ -kvu -np 4 --port 11234 \
    --rpc 172.16.252.1:9001,172.16.252.1:9002 -dev RPC0,RPC1  # Add -pp on to force PP if you have my patch

So, I'm running router mode, which lets me try different models, and also I have automated builds for different Docker images, so I can try ROCm/Vulkan or different branches easily.

The model parameters file /ssd_models/params.ini has my settings for each model, fine-tuned so that each model fits in my VRAM. Smaller model... You want something that fits in your 24 GB VRAM? Let's try this one:

[DeepSeek-R1-Distill-Qwen-14B-11G]
model = /ssd_models/DeepSeek-R1-Distill-Qwen-14B-Q6_K-unsloth.gguf
b = 2048
ub = 512
c = 65536
ctk = q8_0
ctv = q8_0

temp = 0.6
top-p = 0.95
top-k = 20

I'm testing them with my own benchmark: https://github.com/dark-penguin/llm-tools (I've just created a new release so you can download the same version as me without rebuilding). It's not very accurate, but it is very simple, and it lets me test different context lengths at different depth, with real text and a real prompt, not random garbage (like llama-bench AFAIK). And it tests directly llama-server, or any other OpenAPI-compatible endpoint, not a different binary with different flags and conditions.

What it does is simply send requests with specified length roughly calculated (assuming about 4 characters per token). Each run mostly hits cache, so you don't waste time reuploading, and you can see how fast each request processes.
Results are taken from the API output:

Length - prompt length excluding cache hit
Sec - prompt processing seconds (only prompt processing, not token generation!)
Prompt - prompt TPS
Decode - decode TPS

(Append -u http://yourhost:1234/v1 if your URL is not http://localhost:8080/v1, or put URL=http://yourhost:1234/v1 in .llm-bench.env in the same folder as the executable)

My normal run (no RPC, -pp off, -b 512 -ub 512):

$ ./llm-bench -p 10000,20000,30000,40000,50000 -m DeepSeek-R1-Distill-Qwen-14B-11G
| Model                            | Length | Sec |  Prompt | Decode |
|----------------------------------|--------|-----|---------|--------|
| DeepSeek-R1-Distill-Qwen-14B-11G |   8683 |  14 |  606.52 |  27.60 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10037 |  24 |  415.51 |  24.92 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11733 |  39 |  299.90 |  21.13 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11077 |  48 |  231.75 |  19.05 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10534 |  55 |  190.67 |  17.31 |

Reference run (RPC, -pp on, -b 512 -ub 512):

$ ./llm-bench -p 10000,20000,30000,40000,50000 -m DeepSeek-R1-Distill-Qwen-14B-11G
| Model                            | Length | Sec |  Prompt | Decode |
|----------------------------------|--------|-----|---------|--------|
| DeepSeek-R1-Distill-Qwen-14B-11G |   8683 |  15 |  592.54 |  20.31 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10037 |  23 |  443.18 |  17.01 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11733 |  35 |  333.36 |  13.61 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11077 |  41 |  268.34 |  11.67 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10534 |  47 |  224.11 |  10.19 |

Test run (RPC, -pp on, -b 2048 -ub 512):

$ ./llm-bench -p 10000,20000,30000,40000,50000 -m DeepSeek-R1-Distill-Qwen-14B-11G
| Model                            | Length | Sec |  Prompt | Decode |
|----------------------------------|--------|-----|---------|--------|
| DeepSeek-R1-Distill-Qwen-14B-11G |   8683 |  15 |  592.78 |  20.24 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10037 |  23 |  441.95 |  16.98 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11733 |  35 |  332.34 |  13.66 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11077 |  41 |  267.34 |  11.70 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10534 |  47 |  224.10 |  10.20 |

Just to double-check under exactly the same conditions - which you won't be able to do without my patch:
Control run (RPC, -pp off, -b 2048 -ub 512):

$ ./llm-bench -p 10000,20000,30000,40000,50000 -m DeepSeek-R1-Distill-Qwen-14B-11G
| Model                            | Length | Sec |  Prompt | Decode |
|----------------------------------|--------|-----|---------|--------|
| DeepSeek-R1-Distill-Qwen-14B-11G |   8683 |  15 |  592.89 |  20.32 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10037 |  23 |  441.99 |  17.02 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11733 |  35 |  332.41 |  13.53 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  11077 |  41 |  267.34 |  11.66 |
| DeepSeek-R1-Distill-Qwen-14B-11G |  10534 |  47 |  224.13 |  10.21 |

Looks like RPC makes prompt processing 5-15% faster (how?!), but token generation 25-60% slower, and those effects get more prominent with higher context depth. But all RPC runs are the same - with PP enabled, with PP disabled, and with PP impossible due to -b being equal to -ub .

dark-penguin Jun 19, 2026

I see that with PP disabled, allocated compute buffers are smaller than with it enabled, so at least it's "trying" to get enabled.

Maybe we could trace PP behavior by inserting debug-level log messages at each step? Something is going wrong later on - something that's not checked, or intentionally quietly ignored (similar to enabling PP in the first place).

By the way, how do you stop the RPC server processes? They don't seem to react to any signals.

Uh oh!

Does llama.cpp ACTUALLY support pipeline parallelism? #20252

Uh oh!

Uh oh!

Replies: 5 comments · 24 replies

Uh oh!

ggerganov Mar 9, 2026 Maintainer

Uh oh!

Uh oh!

marlin-oss Mar 9, 2026 Author

Uh oh!

ggerganov Mar 10, 2026 Maintainer

Uh oh!

Uh oh!

marlin-oss Mar 11, 2026 Author

Uh oh!

Uh oh!

gaugarg-nv Mar 11, 2026 Collaborator

Uh oh!

Uh oh!

marlin-oss Mar 11, 2026 Author

Uh oh!

gaugarg-nv Mar 11, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marlin-oss Jun 3, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 24 replies

ggerganov
Mar 9, 2026
Maintainer

marlin-oss Mar 9, 2026
Author

ggerganov Mar 10, 2026
Maintainer

marlin-oss Mar 11, 2026
Author

gaugarg-nv
Mar 11, 2026
Collaborator

marlin-oss Mar 11, 2026
Author

gaugarg-nv Mar 11, 2026
Collaborator

marlin-oss Jun 3, 2026
Author