Skip to content

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Description

@s-u

Name and Version

Affects all llama builds since e0dbec0, tested up to

version: 4941 (ba932df)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

bug not present in

version: 4879 (f08f4b3)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

# Can be replicated with any model, here using Llama-3.3
# (-b/-c to reduce memory usages, but not relevant to the bug - can use model ctx size)
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

Problem description & steps to reproduce

Fails in llm_graph_context::build_pooling with:
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

Reproduce with any model using llama-embedding --pooling mean, for example:

llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf \
   -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

The error is due to mismatch between inp and inp_mean tensors in llama-graph.cpp@:1626.

Run with additional output printing nelements and nrows of inp and inp_mean:

llama_context: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 16777216, nrow = 2048
imp_mean nel = 1, nrow = 1
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

run before with llama 4879 (f08f4b3), i.e., before e0dbec0 (#12181):

llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.00 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 8192, nrow = 1
imp_mean nel = 1, nrow = 1
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
llama_init_from_model:      CUDA0 compute buffer size =  1600.03 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1664.06 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   192.09 MiB
llama_init_from_model: graph nodes  = 2569
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
inp nel = 16384, nrow = 2
imp_mean nel = 4, nrow = 2
[...]
batch_decode: n_tokens = 3, n_seq = 1
inp nel = 24576, nrow = 3
imp_mean nel = 9, nrow = 3

First Bad Commit

e0dbec0

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions