| layout | title | nav_order | has_children |
|---|---|---|---|
default |
llama.cpp Tutorial |
86 |
true |
Run large language models efficiently on your local machine with pure C/C++.
llama.cppView Repo is a pure C/C++ implementation for running LLMs locally. It supports a wide range of models and hardware, from MacBooks to servers, with impressive performance through quantization and optimization.
| Feature | Description |
|---|---|
| Pure C/C++ | No Python dependencies, fast startup |
| Quantization | 2-8 bit quantization for memory efficiency |
| Apple Silicon | Native Metal support for M1/M2/M3 |
| CUDA Support | GPU acceleration on NVIDIA cards |
| CPU Optimized | AVX, AVX2, AVX-512 acceleration |
| Model Support | LLaMA, Mistral, Phi, Qwen, and more |
flowchart TD
A[GGUF Model File] --> B[llama.cpp]
B --> C{Hardware}
C --> D[CPU with AVX]
C --> E[Apple Metal]
C --> F[NVIDIA CUDA]
C --> G[AMD ROCm]
D --> H[Inference]
E --> H
F --> H
G --> H
H --> I[Text Generation]
classDef model fill:#e1f5fe,stroke:#01579b
classDef engine fill:#f3e5f5,stroke:#4a148c
classDef hw fill:#fff3e0,stroke:#ef6c00
classDef output fill:#e8f5e8,stroke:#1b5e20
class A model
class B engine
class C,D,E,F,G hw
class H,I output
- repository:
ggerganov/llama.cpp - stars: about 97.3k
- latest release:
b8247(published 2026-03-09)
- Chapter 1: Getting Started - Building llama.cpp and running your first model
- Chapter 2: Model Formats - Understanding GGUF and quantization
- Chapter 3: CLI Usage - Command-line interface and options
- Chapter 4: Server Mode - Running an OpenAI-compatible API server
- Chapter 5: GPU Acceleration - Metal, CUDA, and ROCm setup
- Chapter 6: Quantization - Converting and quantizing models
- Chapter 7: Advanced Features - Grammar, embedding, and multimodal
- Chapter 8: Integration - Python bindings and production use
- Build llama.cpp for your platform
- Run Models Locally without cloud dependencies
- Quantize Models for memory efficiency
- Use GPU Acceleration for faster inference
- Serve APIs with OpenAI compatibility
- Integrate with Apps via bindings and APIs
- Optimize Performance for your hardware
- C/C++ compiler (gcc, clang, or MSVC)
- CMake 3.14+
- Git
- (Optional) CUDA toolkit for NVIDIA GPUs
- (Optional) Xcode for Apple Metal
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CMake
cmake -B build
cmake --build build --config Release
# Or use make on Linux/macOS
make -j$(nproc)# Download a GGUF model (example: Llama 3.1 8B)
# From Hugging Face
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf# Basic text generation
./build/bin/llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "Explain quantum computing in simple terms:" \
-n 256
# Interactive chat mode
./build/bin/llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--interactive \
--color# Start OpenAI-compatible server
./build/bin/llama-server \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080
# Use with curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello!"}]
}'# Metal is enabled by default on macOS
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Releasecmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
# Run with GPU layers
./build/bin/llama-cli -m model.gguf -ngl 35| Quantization | Bits | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| F16 | 16 | 14GB | Best | Slow |
| Q8_0 | 8 | 7GB | Excellent | Good |
| Q5_K_M | 5 | 4.5GB | Great | Fast |
| Q4_K_M | 4 | 4GB | Good | Faster |
| Q3_K_M | 3 | 3GB | Decent | Fastest |
| Q2_K | 2 | 2.5GB | Usable | Fastest |
| Model Family | Architectures |
|---|---|
| Meta | LLaMA 2, LLaMA 3, LLaMA 3.1 |
| Mistral | Mistral 7B, Mixtral 8x7B |
| Microsoft | Phi-2, Phi-3 |
| Alibaba | Qwen, Qwen2 |
| Gemma, Gemma 2 | |
| Stability | StableLM |
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=35 # Offload to GPU
)
# Generate text
output = llm(
"Explain machine learning:",
max_tokens=256,
temperature=0.7,
stop=["\\n\\n"]
)
print(output["choices"][0]["text"])- Chapters 1-3: Build, model basics, and CLI
- Run models on your local machine
- Chapters 4-6: Server, GPU, and quantization
- Optimize for your hardware
- Chapters 7-8: Advanced features and integration
- Build production applications
Ready to run LLMs locally? Let's begin with Chapter 1: Getting Started!
Generated for Awesome Code Docs
- Start Here: Chapter 1: Getting Started with llama.cpp
- Back to Main Catalog
- Browse A-Z Tutorial Directory
- Search by Intent
- Explore Category Hubs
- Chapter 1: Getting Started with llama.cpp
- Chapter 2: Model Formats and GGUF
- Chapter 3: Command Line Interface
- Chapter 4: Server Mode
- Chapter 5: GPU Acceleration
- Chapter 6: Quantization
- Chapter 7: Advanced Features
- Chapter 8: Integration
Generated by AI Codebase Knowledge Builder