llama: add llama_batch_ext#24669
Conversation
There was a problem hiding this comment.
To keep it simpler, we can do the output-related logic in a next PR. I.e. in the first PR, we just introduce the llama_batch_ext and use it to pass the inputs, but we leave the llama_context to handle the output buffers and embedding extractions as it is. Then in the next PR, we will move all the output logic to the batch.
| // Set output = true for the last added token in the batch | ||
| // Returns the batch index (>= 0) | ||
| LLAMA_API bool llama_batch_ext_set_output( | ||
| struct llama_batch_ext * batch, | ||
| int32_t idx, | ||
| bool output_last); |
There was a problem hiding this comment.
The comment is incorrect here - it's not the last added token, but the idx token.
| float * embd_nextn; // used by nextn layers | ||
| llama_pos * pos; // if nullptr, the position will be automatically assigned | ||
| // for M-RoPE models, embedding tokens must have multiple positions per token; text token only requires one single position per token | ||
| llama_seq_id seq_id; |
There was a problem hiding this comment.
We still want to support multiple sequence ids per token.
There was a problem hiding this comment.
hmm tbh I don't quite like passing pointer-to-pointer llama_seq_id **, as it makes the caller do more works. so I'm wondering if we should redesign it to get rid of struct llama_batch_token. my idea now is to have 2 categories of calls:
_add call that returns batch index:
llama_batch_ext_add_token--> add by token IDllama_batch_ext_add_embd--> add by embeddings
then an array of _set that adds more info to the returned batch index:
llama_batch_ext_set_embd_nextn(int32_t idx, float * embd)llama_batch_ext_set_seq_id(int32_t idx, llama_seq_id * seq_id, size_t n_seq)--> can set to multiple sequencesllama_batch_ext_set_posllama_batch_ext_set_output
also, do you think _set_output should be a boolean, or it should be a bit field, for example LLAMA_OUTPUT_NEXTN | LLAMA_OUTPUT_EMBD ?
There was a problem hiding this comment.
so I'm wondering if we should redesign it to get rid of struct llama_batch_token
Yes, API based on the entry index seems quite generic and clean.
llama_batch_ext_add_token --> add by token ID
llama_batch_ext_add_embd --> add by embeddings
I think you can simplify by having single llama_batch_ext_add instead of differentiating token/embd.
also, do you think _set_output should be a boolean, or it should be a bit field, for example LLAMA_OUTPUT_NEXTN | LLAMA_OUTPUT_EMBD ?
I think we probably need per-entry llama_batch_ext_set_output(batch, idx, value);. And then the contents of the outputs likely not have to be per-entry, but for the entire batch:
llama_batch_ext_output_embd (batch, value);
llama_batch_ext_output_embd_nextn(batch, value, masked);
llama_batch_ext_output_layer_inp (batch, value);
...But for now these can remain controlled by the llama_context for now because the llama_context currently will own the output buffers, not the batch.
There was a problem hiding this comment.
I think you can simplify by having single
llama_batch_ext_addinstead of differentiating token/embd.
actually there will be 3 choices for that:
llama_batch_ext_add()that simply returns an idx, and need a separated_set_token(id)or_set_embd(float * embd)llama_batch_ext_add(id), ifid == LLAMA_TOKEN_NULLthen_set_embd(float *)is requiredllama_batch_ext_add(id, float * embd)where either one of two can be set
however, since each input entry in the batch requires at least token ID or token embd to be consider "valid", I think having explicit _add_token and _add_embd will be a better design overall. WDYT ?
There was a problem hiding this comment.
We can have all three:
int32_t llama_batch_ext_add (llama_batch_ext * batch);
int32_t llama_batch_ext_add_token(llama_batch_ext * batch, llama_token id);
int32_t llama_batch_ext_add_embd (llama_batch_ext * batch, llama_embd embd);
Overview
Supersede #11875
Status: early WIP, for discussion only
Additional information
Demo usage:
Requirements