Name	Name	Last commit message	Last commit date
parent directory ..
01-getting-started.md	01-getting-started.md
02-model-formats.md	02-model-formats.md
03-cli-usage.md	03-cli-usage.md
04-server.md	04-server.md
05-gpu.md	05-gpu.md
06-quantization.md	06-quantization.md
07-advanced.md	07-advanced.md
08-integration.md	08-integration.md
README.md	README.md

layout	title	nav_order	has_children
default	llama.cpp Tutorial	86	true

llama.cpp Tutorial: Local LLM Inference

Run large language models efficiently on your local machine with pure C/C++.

🦙 Fast, Portable LLM Inference

🎯 What is llama.cpp?

llama.cpp^{View Repo} is a pure C/C++ implementation for running LLMs locally. It supports a wide range of models and hardware, from MacBooks to servers, with impressive performance through quantization and optimization.

Key Features

Feature	Description
Pure C/C++	No Python dependencies, fast startup
Quantization	2-8 bit quantization for memory efficiency
Apple Silicon	Native Metal support for M1/M2/M3
CUDA Support	GPU acceleration on NVIDIA cards
CPU Optimized	AVX, AVX2, AVX-512 acceleration
Model Support	LLaMA, Mistral, Phi, Qwen, and more

flowchart TD
    A[GGUF Model File] --> B[llama.cpp]
    
    B --> C{Hardware}
    C --> D[CPU with AVX]
    C --> E[Apple Metal]
    C --> F[NVIDIA CUDA]
    C --> G[AMD ROCm]
    
    D --> H[Inference]
    E --> H
    F --> H
    G --> H
    
    H --> I[Text Generation]
    
    classDef model fill:#e1f5fe,stroke:#01579b
    classDef engine fill:#f3e5f5,stroke:#4a148c
    classDef hw fill:#fff3e0,stroke:#ef6c00
    classDef output fill:#e8f5e8,stroke:#1b5e20
    
    class A model
    class B engine
    class C,D,E,F,G hw
    class H,I output

Current Snapshot (auto-updated)

repository: ggerganov/llama.cpp
stars: about 97.3k
latest release: b8247 (published 2026-03-09)

Tutorial Chapters

Chapter 1: Getting Started - Building llama.cpp and running your first model
Chapter 2: Model Formats - Understanding GGUF and quantization
Chapter 3: CLI Usage - Command-line interface and options
Chapter 4: Server Mode - Running an OpenAI-compatible API server
Chapter 5: GPU Acceleration - Metal, CUDA, and ROCm setup
Chapter 6: Quantization - Converting and quantizing models
Chapter 7: Advanced Features - Grammar, embedding, and multimodal
Chapter 8: Integration - Python bindings and production use

What You'll Learn

Build llama.cpp for your platform
Run Models Locally without cloud dependencies
Quantize Models for memory efficiency
Use GPU Acceleration for faster inference
Serve APIs with OpenAI compatibility
Integrate with Apps via bindings and APIs
Optimize Performance for your hardware

Prerequisites

C/C++ compiler (gcc, clang, or MSVC)
CMake 3.14+
Git
(Optional) CUDA toolkit for NVIDIA GPUs
(Optional) Xcode for Apple Metal

Quick Start

Build from Source

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CMake
cmake -B build
cmake --build build --config Release

# Or use make on Linux/macOS
make -j$(nproc)

Download a Model

# Download a GGUF model (example: Llama 3.1 8B)
# From Hugging Face
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

Run Inference

# Basic text generation
./build/bin/llama-cli \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    -p "Explain quantum computing in simple terms:" \
    -n 256

# Interactive chat mode
./build/bin/llama-cli \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --interactive \
    --color

Server Mode

# Start OpenAI-compatible server
./build/bin/llama-server \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080

# Use with curl
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama3",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

GPU Acceleration

Apple Metal (M1/M2/M3)

# Metal is enabled by default on macOS
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release

NVIDIA CUDA

cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

# Run with GPU layers
./build/bin/llama-cli -m model.gguf -ngl 35

Quantization Levels

Quantization	Bits	Size (7B)	Quality	Speed
F16	16	14GB	Best	Slow
Q8_0	8	7GB	Excellent	Good
Q5_K_M	5	4.5GB	Great	Fast
Q4_K_M	4	4GB	Good	Faster
Q3_K_M	3	3GB	Decent	Fastest
Q2_K	2	2.5GB	Usable	Fastest

Supported Models

Model Family	Architectures
Meta	LLaMA 2, LLaMA 3, LLaMA 3.1
Mistral	Mistral 7B, Mixtral 8x7B
Microsoft	Phi-2, Phi-3
Alibaba	Qwen, Qwen2
Google	Gemma, Gemma 2
Stability	StableLM

Python Bindings

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35  # Offload to GPU
)

# Generate text
output = llm(
    "Explain machine learning:",
    max_tokens=256,
    temperature=0.7,
    stop=["\\n\\n"]
)

print(output["choices"][0]["text"])

Learning Path

🟢 Beginner Track

Chapters 1-3: Build, model basics, and CLI
Run models on your local machine

🟡 Intermediate Track

Chapters 4-6: Server, GPU, and quantization
Optimize for your hardware

🔴 Advanced Track

Chapters 7-8: Advanced features and integration
Build production applications

Ready to run LLMs locally? Let's begin with Chapter 1: Getting Started!

Generated for Awesome Code Docs

Navigation & Backlinks

Full Chapter Map

Source References

Generated by AI Codebase Knowledge Builder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

llama.cpp Tutorial: Local LLM Inference

🎯 What is llama.cpp?

Key Features

Current Snapshot (auto-updated)

Tutorial Chapters

What You'll Learn

Prerequisites

Quick Start

Build from Source

Download a Model

Run Inference

Server Mode

GPU Acceleration

Apple Metal (M1/M2/M3)

NVIDIA CUDA

Quantization Levels

Supported Models

Python Bindings

Learning Path

🟢 Beginner Track

🟡 Intermediate Track

🔴 Advanced Track

Navigation & Backlinks

Full Chapter Map

Source References

FilesExpand file tree

llama-cpp-tutorial

Directory actions

More options

Directory actions

More options

Latest commit

History

llama-cpp-tutorial

Folders and files

parent directory

README.md

llama.cpp Tutorial: Local LLM Inference

🎯 What is llama.cpp?

Key Features

Current Snapshot (auto-updated)

Tutorial Chapters

What You'll Learn

Prerequisites

Quick Start

Build from Source

Download a Model

Run Inference

Server Mode

GPU Acceleration

Apple Metal (M1/M2/M3)

NVIDIA CUDA

Quantization Levels

Supported Models

Python Bindings

Learning Path

🟢 Beginner Track

🟡 Intermediate Track

🔴 Advanced Track

Navigation & Backlinks

Full Chapter Map

Source References