| layout | title | nav_order | has_children | parent |
|---|---|---|---|---|
default |
llama.cpp Tutorial - Chapter 6: Quantization |
6 |
false |
llama.cpp Tutorial |
Welcome to Chapter 6: Quantization. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Convert and quantize models to reduce memory usage while maintaining quality.
Quantization reduces model precision to save memory and improve speed. This chapter covers converting PyTorch models to GGUF and applying different quantization schemes.
- Memory Reduction: 50-87% less RAM/VRAM required
- Speed Improvement: Faster inference due to smaller data types
- Broader Compatibility: Run larger models on limited hardware
- Cost Efficiency: Lower hardware requirements
| Type | Bits/Weight | Memory | Quality | Speed |
|---|---|---|---|---|
| F32 | 32 | 100% | Excellent | Slowest |
| F16 | 16 | 50% | Very Good | Slow |
| Q8_0 | 8 | 50% | Very Good | Good |
| Q6_K | 6 | 37.5% | Good | Fast |
| Q5_K | 5 | 31.25% | Good | Faster |
| Q4_K | 4 | 25% | Good | Fast |
| Q3_K | 3 | 18.75% | Acceptable | Very Fast |
| Q2_K | 2 | 12.5% | Basic | Fastest |
# Install conversion dependencies
pip install torch transformers sentencepiece protobuf
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert PyTorch model to GGUF (FP16 first)
python convert.py /path/to/pytorch/model \
--outfile model.gguf \
--outtype f16
# Quantize to different formats
./llama-quantize model.gguf model-Q4_K.gguf Q4_K
./llama-quantize model.gguf model-Q5_K.gguf Q5_K
./llama-quantize model.gguf model-Q8_0.gguf Q8_0llama.cpp supports conversion from:
- Llama: Llama 2, Llama 3, Code Llama, Llama Guard
- Mistral: Mistral 7B, Mixtral 8x7B, Mistral Large
- Phi: Phi-2, Phi-3
- Qwen: Qwen 1.5, Qwen 2
- Gemma: Google's Gemma models
- Other: Falcon, GPT-2, GPT-J, many more
#!/bin/bash
# convert_and_quantize.sh
model_dir="$1"
output_dir="${2:-quantized_models}"
mkdir -p "$output_dir"
# Convert to GGUF (F16)
echo "Converting to GGUF (F16)..."
python convert.py "$model_dir" \
--outfile "$output_dir/model.gguf" \
--outtype f16
# Generate multiple quantizations
quantizations=("Q2_K" "Q3_K" "Q4_K" "Q5_K" "Q6_K" "Q8_0")
for quant in "${quantizations[@]}"; do
echo "Quantizing to $quant..."
./llama-quantize "$output_dir/model.gguf" \
"$output_dir/model-$quant.gguf" \
"$quant"
done
echo "Conversion complete!"
ls -lh "$output_dir/"# High quality quantization (best for 13B+ models)
./llama-quantize model.gguf model-Q5_K_M.gguf Q5_K_M
# Balanced quality/speed (recommended for most cases)
./llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M
# Size optimized (minimal quality loss)
./llama-quantize model.gguf model-Q3_K_L.gguf Q3_K_L
# Ultra compressed (significant quality loss)
./llama-quantize model.gguf model-Q2_K.gguf Q2_K# Python quantization (experimental)
import gguf
from llama_cpp import Llama
# Load model
model = Llama("model.gguf", verbose=False)
# Apply custom quantization
# (Advanced usage - requires deep understanding)#!/bin/bash
# compare_quantizations.sh
model_base="llama-2-7b-chat"
quantizations=("Q2_K" "Q3_K" "Q4_K" "Q5_K" "Q8_0")
test_prompt="Explain the concept of recursion in programming."
echo "Quantization | File Size | Tokens/sec | Quality Score"
echo "-------------|-----------|------------|--------------"
for quant in "${quantizations[@]}"; do
model_file="${model_base}-${quant}.gguf"
if [ -f "$model_file" ]; then
# Get file size
size=$(ls -lh "$model_file" | awk '{print $5}')
# Quick benchmark
tokens_sec=$(./llama-bench -m "$model_file" -p 5 -n 64 -t 4 2>/dev/null | \
grep "tokens/sec" | head -1 | grep -o "[0-9.]*")
# Generate sample output for quality assessment
output=$(./llama-cli -m "$model_file" \
--prompt "$test_prompt" \
--n-predict 100 \
--temp 0.1 \
--seed 42 \
--simple-io 2>/dev/null)
# Simple quality score (word count as proxy)
quality=$(echo "$output" | wc -w)
echo "$quant | $size | ${tokens_sec:-N/A} | $quality"
fi
done# Calculate perplexity (measure of model quality)
./llama-perplexity -m model-Q4_K.gguf \
--ctx-size 4096 \
--batch-size 512 \
--ubatch-size 512 \
--chunks 100 \
-f wiki.test.raw
# Compare perplexity across quantizations
# Lower perplexity = better quality# Merge base model with LoRA adapter
./llama-export-lora \
--model-base model.gguf \
--model-out model-merged.gguf \
--lora-scaled model-lora.gguf 1.0
# Quantize merged model
./llama-quantize model-merged.gguf model-merged-Q4_K.gguf Q4_K# Quantize with CPU instruction set in mind
./llama-quantize model.gguf model-CPU.gguf Q4_K
# Test with different CPU features
./llama-cli -m model-CPU.gguf --prompt "Hello" --threads $(nproc)# GPU-friendly quantization
./llama-quantize model.gguf model-GPU.gguf Q4_K
# Test GPU performance
./llama-cli -m model-GPU.gguf --gpu-layers 35 --prompt "Hello"# Download from HuggingFace (requires token)
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
--local-dir llama-2-7b-chat
# Convert to GGUF
python convert.py llama-2-7b-chat \
--outfile llama-2-7b-chat.gguf \
--outtype f16
# Quantize
./llama-quantize llama-2-7b-chat.gguf llama-2-7b-chat-Q4_K.gguf Q4_K# Mistral 7B
python convert.py mistral-7b-instruct \
--outfile mistral-7b.gguf \
--outtype f16
./llama-quantize mistral-7b.gguf mistral-7b-Q4_K.gguf Q4_K# CodeLlama conversion
python convert.py CodeLlama-7b-Instruct-hf \
--outfile codellama-7b.gguf \
--outtype f16
./llama-quantize codellama-7b.gguf codellama-7b-Q4_K.gguf Q4_K- Start with F16: Always convert to F16 first, then quantize
- Test Quality: Compare outputs across quantization levels
- Use Q4_K: Best balance of quality/speed for most use cases
- Avoid Q2_K: Significant quality loss for general use
- Fine-tune Threshold: Test your specific use case at each level
def optimize_quantization_for_hardware(ram_gb, use_case):
"""Recommend quantization based on hardware and use case."""
recommendations = {
"chat": {
4: "Q2_K", # Very limited RAM
8: "Q3_K", # Basic chat
16: "Q4_K", # Good chat quality
24: "Q5_K", # High quality chat
32: "Q6_K" # Maximum quality
},
"code": {
4: "Q3_K", # Basic code completion
8: "Q4_K", # Good code quality
16: "Q5_K", # Excellent code quality
24: "Q6_K", # Maximum code quality
32: "Q8_0"
},
"creative": {
4: "Q3_K", # Basic creative writing
8: "Q4_K", # Good creative quality
16: "Q5_K", # High creative quality
24: "Q6_K", # Excellent creative quality
32: "Q8_0"
}
}
# Find closest RAM match
ram_options = sorted(recommendations[use_case].keys())
closest_ram = min(ram_options, key=lambda x: abs(x - ram_gb))
return recommendations[use_case][closest_ram]
# Usage
ram_gb = 16
use_case = "chat" # "chat", "code", or "creative"
recommended_quant = optimize_quantization_for_hardware(ram_gb, use_case)
print(f"Recommended quantization: {recommended_quant}")# Fast inference quantization
./llama-quantize model.gguf model-fast.gguf Q4_K_S # Small Q4_K
# Balanced quantization
./llama-quantize model.gguf model-balanced.gguf Q4_K_M # Medium Q4_K
# Quality-focused quantization
./llama-quantize model.gguf model-quality.gguf Q5_K_M # Medium Q5_KConversion fails:
# Check model compatibility
python convert.py --help
# Try different output types
python convert.py model --outfile model.gguf --outtype f32
# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('model'))"Quantization errors:
# Ensure input is F16 GGUF
./llama-cli -m model.gguf --verbose-prompt # Check format
# Try different quantization methods
./llama-quantize model.gguf output.gguf Q4_0 # Alternative Q4Quality loss too high:
# Use higher quantization
./llama-quantize model.gguf model-Q5_K.gguf Q5_K
# Test different K-types
./llama-quantize model.gguf model-Q4_K_S.gguf Q4_K_S # Small
./llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M # Medium
./llama-quantize model.gguf model-Q4_K_L.gguf Q4_K_L # Large# Create importance matrix for better quantization
./llama-imatrix \
-m model.gguf \
-f calibration-dataset.txt \
-o imatrix.dat
# Use imatrix for quantization
./llama-quantize \
--imatrix imatrix.dat \
model.gguf \
model-imatrix-Q4_K.gguf \
Q4_K# Prepare calibration data
./llama-cli -m model.gguf \
--file calibration-texts.txt \
--export-imatrix imatrix.dat \
--n-predict 0#!/bin/bash
# full_quantization_pipeline.sh
set -e
model_dir="$1"
output_dir="${2:-quantized}"
calibration_file="$3"
mkdir -p "$output_dir"
echo "=== Converting to GGUF ==="
python convert.py "$model_dir" \
--outfile "$output_dir/model.gguf" \
--outtype f16
echo "=== Generating Importance Matrix ==="
if [ -f "$calibration_file" ]; then
./llama-imatrix \
-m "$output_dir/model.gguf" \
-f "$calibration_file" \
-o "$output_dir/imatrix.dat"
fi
echo "=== Creating Quantizations ==="
quantizations=("Q2_K" "Q3_K" "Q4_K" "Q5_K" "Q6_K" "Q8_0")
for quant in "${quantizations[@]}"; do
echo "Quantizing to $quant..."
if [ -f "$output_dir/imatrix.dat" ]; then
./llama-quantize \
--imatrix "$output_dir/imatrix.dat" \
"$output_dir/model.gguf" \
"$output_dir/model-$quant.gguf" \
"$quant"
else
./llama-quantize \
"$output_dir/model.gguf" \
"$output_dir/model-$quant.gguf" \
"$quant"
fi
done
echo "=== Quantization Complete ==="
ls -lh "$output_dir/"Quantization is essential for practical LLM deployment. It enables running large models on consumer hardware while maintaining good quality. Always test your specific use case at different quantization levels to find the best balance.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, gguf, llama so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 6: Quantization as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around quantize, Q4_K, quality as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 6: Quantization usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
model. - Input normalization: shape incoming data so
ggufreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
llama. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com). - Awesome Code Docs
Why it matters: authoritative reference on
Awesome Code Docs(github.com).
Suggested trace strategy:
- search upstream code for
modelandggufto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production