Python Bindings for `llama.cpp`

Personal maintained fork of the original abetlen/llama-cpp-python

Python bindings for @ggerganov's llama.cpp library.

Note

This is my personal fork of the original llama-cpp-python project. The original repository has been largely inactive, and since several of my projects depend on these bindings, I created this fork to:

Provide active maintenance and quick fixes.
Offer pre-built wheels for CPU and CUDA (with a focus on stability and simplicity).
Intentionally limit support to CPU and CUDA only (no Metal, no macOS/ARM-specific builds) for better reliability on common server/desktop setups.
Plan future custom enhancements and features that may diverge significantly from the original. If you need Metal, macOS, or broader hardware support, please use the original repository: abetlen/llama-cpp-python.

Thanks to Andrei Betlen for the original work!

Note

Documentation is currently shared with the original project: https://llama-cpp-python.readthedocs.io/en/latest. As the project diverges, separate documentation may be created.

This package provides:

Low-level access to C API via ctypes interface.
High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
OpenAI compatible web server

Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.

Installation

Requirements:

Python 3.9+
C compiler
- Linux: GCC or Clang
- Windows: Visual Studio or MinGW (MSYS2)
CMake 3.21+
Git

To install the package, run:

pip install -U "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

or also:

git clone https://github.com/TheBigEye/guanaco-py --recursive
cd guanaco-py
python -m pip install -U pip
pip install .

If you want install from an specific release, by exmaple v0.5.0:

pip install -U git+https://github.com/TheBigEye/[email protected]

This will build llama.cpp from source and install it alongside this python package.

If this fails, add --verbose to the pip install see the full cmake build log.

**Pre-built Wheel **

It is also possible to install a pre-built wheel with basic CPU support.

pip install guanaco-py --extra-index-url https://thebigeye.github.io/guanaco-py/whl/cpu

Note

Sometimes I recommend running this in case Pip can't find any wheel.

pip install guanaco-py --only-binary=:all: --extra-index-url https://thebigeye.github.io/guanaco-py/whl/cpu/

Installation Configuration

llama.cpp supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp build docs for a full list.

All llama.cpp cmake build options can be set via the CMAKE_ARGS environment variable or via the --config-settings / -C cli flag during installation.

Environment Variables

# Linux
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \
  pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

# Windows
$env:CMAKE_ARGS = "-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

CLI / requirements.txt

They can also be set via pip install -C / --config-settings command and saved to a requirements.txt file:

pip install --upgrade pip # ensure pip is up to date
pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git" \
  -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"

# requirements.txt

git+https://github.com/TheBigEye/guanaco-py -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"

Supported Backends

Below are some common backends, their build commands and any additional environment variables required.

OpenBLAS (CPU)

To install with OpenBLAS, set the GGML_BLAS and GGML_BLAS_VENDOR environment variables before installing:

CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

CUDA

Installing a CUDA-supported version requires the CUDA Toolkit environment to be installed first.

See here: https://developer.nvidia.com/cuda-toolkit-archive

Then, set the GGML_CUDA=on environment variable before installing:

# Linux
CMAKE_ARGS="-DGGML_CUDA=on" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

# Windows
$env:CMAKE_ARGS = "-DGGML_CUDA=on"
pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

Pre-built Wheel

It is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:

CUDA Version is 12.1, 12.2, 12.3 or 12.4
Python Version is 3.9, 3.10, 3.11, 3.12 or 3.13

pip install guanaco-py \
  --extra-index-url https://thebigeye.github.io/guanaco-py/whl/<cuda-version>

Where <cuda-version> is one of the following:

cu121: CUDA 12.1
cu122: CUDA 12.2
cu123: CUDA 12.3
cu124: CUDA 12.4

For example, to install the CUDA 12.1 wheel:

pip install guanaco-py --extra-index-url https://thebigeye.github.io/guanaco-py/whl/cu121

HIP (ROCm)

This provides GPU acceleration on HIP-supported AMD GPUs. Make sure to have ROCm installed.

You can download it from your Linux distro's package manager or from here: ROCm Quick Start (Linux).

To install with HIP / ROCm support for AMD cards, set the GGML_HIP=on environment variable before installing:

CMAKE_ARGS="-DGGML_HIP=ON -DGPU_TARGETS=gfx1030" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

[!NOTE] GPU_TARGETS is optional, omitting it will build the code for all GPUs in the current system.

More details see here: ggml-org/llama.cpp/build.md#hip

Vulkan

For Windows User: Download and install the Vulkan SDK with the default settings.
For Linux User: Follow the official LunarG instructions for the installation and setup of the Vulkan SDK in the Getting Started with the Linux Tarball Vulkan SDK guide.

To install with Vulkan support, set the GGML_VULKAN=on environment variable before installing:

CMAKE_ARGS="-DGGML_VULKAN=on" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

SYCL

To install with SYCL support, set the GGML_SYCL=on environment variable before installing:

source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

RPC

To install with RPC support, set the GGML_RPC=on environment variable before installing:

source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_RPC=on" pip install "guanaco-py @ git+https://github.com/TheBigEye/guanaco-py.git"

Windows Notes

Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'

If you run into issues where it complains it can't find 'nmake' '?' or CMAKE_C_COMPILER, you can extract w64devkit as mentioned in llama.cpp repo and add those manually to CMAKE_ARGS before running pip install:

$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"

See the above instructions and set CMAKE_ARGS to the BLAS backend you want to use.

Upgrading and Reinstalling

To upgrade and rebuild guanaco-py add --upgrade --force-reinstall --no-cache-dir flags to the pip install command to ensure the package is rebuilt from source.

High-level API

API Reference

The high-level API provides a simple managed interface through the Llama class.

Below is a short example demonstrating how to use the high-level API to for basic text completion:

from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

By default guanaco-py generates completions in an OpenAI compatible format:

{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Text completion is available through the __call__ and create_completion methods of the Llama class.

Pulling models from Hugging Face Hub

You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method. You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).

llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

By default from_pretrained will download the model to the huggingface cache directory, you can then manage installed model files with the huggingface-cli tool.

Chat Completion

The high-level API also provides a simple interface for chat completion.

Chat completion requires that the model knows how to format the messages into a single prompt. The Llama class does this using pre-registered chat formats (ie. chatml, llama-2, gemma, etc) or by providing a custom chat handler object.

The model will will format the messages into a single prompt using the following order of precedence:

Use the chat_handler if provided
Use the chat_format if provided
Use the tokenizer.chat_template from the gguf model's metadata (should work for most new models, older models may not have this)
else, fallback to the llama-2 chat format

Set verbose=True to see the selected chat format.

from llama_cpp import Llama
llm = Llama(
      model_path="path/to/llama-2/llama-model.gguf",
      chat_format="llama-2"
)
llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)

Chat completion is available through the create_chat_completion method of the Llama class.

For OpenAI API v1 compatibility, you use the create_chat_completion_openai_v1 method which will return pydantic models instead of dicts.

JSON and JSON Schema Mode

To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.

JSON Mode

The following example will constrain the response to valid JSON strings only.

from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
    },
    temperature=0.7,
)

JSON Schema Mode

To constrain the response further to a specific JSON Schema add the schema to the schema property of the response_format argument.

from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)

Function Calling

The high-level API supports OpenAI compatible function and tool calling. This is possible through the functionary pre-trained models chat format or through the generic chatml-function-calling chat format.

from llama_cpp import Llama
llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
          "role": "user",
          "content": "Extract Jason is 25 years old"
        }
      ],
      tools=[{
        "type": "function",
        "function": {
          "name": "UserDetail",
          "parameters": {
            "type": "object",
            "title": "UserDetail",
            "properties": {
              "name": {
                "title": "Name",
                "type": "string"
              },
              "age": {
                "title": "Age",
                "type": "integer"
              }
            },
            "required": [ "name", "age" ]
          }
        }
      }],
      tool_choice={
        "type": "function",
        "function": {
          "name": "UserDetail"
        }
      }
)

Functionary v2

The various gguf-converted files for this set of models can be found here. Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supports parallel function calling. You can provide either functionary-v1 or functionary-v2 for the chat_format when initializing the Llama class.

Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.

from llama_cpp import Llama
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
llm = Llama.from_pretrained(
  repo_id="meetkai/functionary-small-v2.2-GGUF",
  filename="functionary-small-v2.2.q4_0.gguf",
  chat_format="functionary-v2",
  tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-small-v2.2-GGUF")
)

NOTE: There is no need to provide the default system messages used in Functionary as they are added automatically in the Functionary chat handler. Thus, the messages should contain just the chat messages and/or system messages that provide additional context for the model (e.g.: datetime, etc.).

Multi-modal Models

guanaco-py supports such as llava1.5 which allow the language model to read information from both text and images.

Below are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).

Model	`LlamaChatHandler`	`chat_format`
llava-v1.5-7b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.5-13b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.6-34b	`Llava16ChatHandler`	`llava-1-6`
moondream2	`MoondreamChatHandler`	`moondream2`
nanollava	`NanollavaChatHandler`	`nanollava`
llama-3-vision-alpha	`Llama3VisionAlphaChatHandler`	`llama-3-vision-alpha`
minicpm-v-2.6	`MiniCPMv26ChatHandler`	`minicpm-v-2.6`, `minicpm-v-4.0`
minicpm-v-4.5	`MiniCPMv45ChatHandler`	`minicpm-v-4.5`
gemma3	`Gemma3ChatHandler`	`gemma3`
glm4.1v	`GLM41VChatHandler`	`glm4.1v`
glm4.6v	`GLM46VChatHandler`	`glm4.6v`
granite-docling	`GraniteDoclingChatHandler`	`granite-docling`
lfm2-vl	`LFM2VLChatHandler`	`lfm2-vl`
qwen2.5-vl	`Qwen25VLChatHandler`	`qwen2.5-vl`
qwen3-vl	`Qwen3VLChatHandler`	`qwen3-vl`

Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
  model_path="./path/to/llava/llama-model.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)

You can also pull the model from the Hugging Face Hub using the from_pretrained method.

from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

chat_handler = MoondreamChatHandler.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*text-model*",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print(response["choices"][0]["text"])

Note: Multi-modal models also support tool calling and JSON mode.

Loading a Local Image

Images can be passed as base64 encoded data URIs. The following example demonstrates how to do this.

import base64

def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return f"data:image/png;base64,{base64_data}"

# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'file_path.png'
data_uri = image_to_base64_data_uri(file_path)

messages = [
    {"role": "system", "content": "You are an assistant who perfectly describes images."},
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": data_uri }},
            {"type" : "text", "text": "Describe this image in detail please."}
        ]
    }
]

Speculative Decoding

guanaco-py supports speculative decoding which allows the model to generate completions based on a draft model.

The fastest way to use speculative decoding is through the LlamaPromptLookupDecoding class.

Just pass this as a draft model to the Llama class during initialization.

from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

Embeddings

To generate text embeddings use create_embedding or embed. Note that you must pass embedding=True to the constructor upon model creation for these to work properly.

import llama_cpp

llm = llama_cpp.Llama(model_path="path/to/model.gguf", embedding=True)

embeddings = llm.create_embedding("Hello, world!")

# or create multiple embeddings at once

embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])

There are two primary notions of embeddings in a Transformer-style model: token level and sequence level. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.

Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.

It is possible to control pooling behavior in some cases using the pooling_type flag on model creation. You can ensure token level embeddings from any model using LLAMA_POOLING_TYPE_NONE. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.

Adjusting the Context Window

The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:

llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)

OpenAI Compatible Web Server

guanaco-py offers a web server which aims to act as a drop-in replacement for the OpenAI API. This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

pip install 'git+https://github.com/TheBigEye/guanaco-py.git#egg=guanaco-py[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf

Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:

CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install 'git+https://github.com/TheBigEye/guanaco-py.git#egg=guanaco-py[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

To bind to 0.0.0.0 to enable remote connections, use python3 -m llama_cpp.server --host 0.0.0.0. Similarly, to change the port (default is 8000), use --port.

You probably also want to set the prompt format. For chatml, use

python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml

That will format the prompt according to how model expects it. You can find the prompt format in the model card. For possible options, see llama_cpp/llama_chat_format.py and look for lines starting with "@register_chat_format".

If you have huggingface-hub installed, you can also use the --hf_model_repo_id flag to load a model from the Hugging Face Hub.

python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model '*q8_0.gguf'

Web Server Features

Docker image

A Docker image is available on GHCR. To run the server:

docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/thebigeye/guanaco-py:latest

Docker on termux (requires root) is currently the only known way to run this on phones, see termux support issue

Low-level API

API Reference

The low-level API is a direct ctypes binding to the C API provided by llama.cpp. The entire low-level API can be found in llama_cpp/llama_cpp.py and directly mirrors the C API in llama.h.

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

import llama_cpp
import ctypes
llama_cpp.llama_backend_init(False) # Must be called once at the start of each program
params = llama_cpp.llama_context_default_params()
# use bytes for char * params
model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
ctx = llama_cpp.llama_new_context_with_model(model, params)
max_tokens = params.n_ctx
# use ctypes arrays for array params
tokens = (llama_cpp.llama_token * int(max_tokens))()
n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, llama_cpp.c_bool(True))
llama_cpp.llama_free(ctx)

Check out the examples folder for more examples of using the low-level API.

Documentation

Documentation is available via https://llama-cpp-python.readthedocs.io/. If you find any issues with the documentation, please open an issue or submit a PR.

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.github		.github
docker		docker
docs		docs
examples		examples
llama_cpp		llama_cpp
tests		tests
vendor		vendor
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.nojekyll		.nojekyll
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Bindings for `llama.cpp`

Installation

Installation Configuration

Supported Backends

Windows Notes

Upgrading and Reinstalling

High-level API

Pulling models from Hugging Face Hub

Chat Completion

JSON and JSON Schema Mode

JSON Mode

JSON Schema Mode

Function Calling

Multi-modal Models

Speculative Decoding

Embeddings

Adjusting the Context Window

OpenAI Compatible Web Server

Web Server Features

Docker image

Low-level API

Documentation

License

About

Uh oh!

Releases 19

Packages

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

TheBigEye/guanaco-py

Folders and files

Latest commit

History

Repository files navigation

Python Bindings for llama.cpp

Installation

Installation Configuration

Supported Backends

Windows Notes

Upgrading and Reinstalling

High-level API

Pulling models from Hugging Face Hub

Chat Completion

JSON and JSON Schema Mode

JSON Mode

JSON Schema Mode

Function Calling

Multi-modal Models

Speculative Decoding

Embeddings

Adjusting the Context Window

OpenAI Compatible Web Server

Web Server Features

Docker image

Low-level API

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Python Bindings for `llama.cpp`

Packages