ggml: add Q2_0 2-bit quantization support (CPU)#40
Conversation
bb17927 to
bc716e0
Compare
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for a new 2-bit quantization format Q2_0 (block size 64) across llama.cpp’s quantization pipeline, GGML core type system, CPU backend kernels, and GGUF Python tooling—intended for testing via a draft PR.
Changes:
- Introduces
GGML_TYPE_Q2_0/LLAMA_FTYPE_MOSTLY_Q2_0and wires them through model loading and quantization type selection. - Implements Q2_0 reference quantize/dequant routines plus CPU dot-product kernels (generic + x86/ARM optimized).
- Adds GGUF Python quant/dequant support and conversion mappings to enable exporting/loading Q2_0 models.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/quantize/quantize.cpp | Exposes Q2_0 as a quantization option in the CLI tool. |
| src/llama-quant.cpp | Integrates Q2_0 into tensor-type selection, fallback logic, and default ftype mapping. |
| src/llama-model-loader.cpp | Adds ftype name + GGML type → llama ftype mapping for Q2_0. |
| include/llama.h | Adds LLAMA_FTYPE_MOSTLY_Q2_0 enum value. |
| gguf-py/gguf/quants.py | Adds Python quantize/dequant implementation for Q2_0 blocks. |
| gguf-py/gguf/constants.py | Adds GGUF constants for Q2_0 quant/type IDs and size table entry. |
| ggml/src/ggml.c | Registers Q2_0 in GGML type traits and quantize dispatch. |
| ggml/src/ggml-quants.h | Declares Q2_0 quantize/dequant APIs. |
| ggml/src/ggml-quants.c | Implements Q2_0 reference quantize/dequant and chunk quantization support. |
| ggml/src/ggml-cpu/quants.h | Declares CPU-side Q2_0 quantize and dot-product functions. |
| ggml/src/ggml-cpu/quants.c | Adds CPU generic dot-product implementation for Q2_0×Q8_0. |
| ggml/src/ggml-cpu/ops.cpp | Treats Q2_0 as a quantized type in several CPU ops dispatch paths. |
| ggml/src/ggml-cpu/ggml-cpu.c | Adds CPU type traits entry for Q2_0. |
| ggml/src/ggml-cpu/arch/x86/quants.c | Adds x86 implementation of Q2_0×Q8_0 dot product (with AVX-512 VNNI path). |
| ggml/src/ggml-cpu/arch/arm/quants.c | Adds ARM NEON implementation of Q2_0×Q8_0 dot product (fallback to generic). |
| ggml/src/ggml-common.h | Defines Q2_0 block format (block_q2_0, QK2_0). |
| ggml/include/ggml.h | Adds GGML_TYPE_Q2_0 and GGML_FTYPE_MOSTLY_Q2_0 public enums. |
| convert_hf_to_gguf.py | Adds "q2_0" CLI mapping to GGUF file type. |
| conversion/base.py | Maps MOSTLY_Q2_0 to GGMLQuantizationType.Q2_0 during conversion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bri-prism
left a comment
There was a problem hiding this comment.
Review: Add Q2_0 quantization (type definition + CPU backend)
Complete and consistent CPU implementation — all paths (scalar reference, NEON, x86-VNNI, x86 scalar, gguf-py) agree on the same 18-byte / 64-weight layout (2.25 bpw). Two items to resolve before merge:
-
Run
test-quantize-fnsandtest-backend-ops. Registering the type enrolls it in the automated round-trip/dot-product accuracy tests, which use synthetic data. With codes{0,1,2,3} → {-1,0,+1,+2}·dand the reference quantizer emitting only{0,1,2}(effectively a ternary codebook with a per-64 fp16 scale), reconstruction error on that synthetic data can exceed the per-type tolerances and fail CI. Please confirm both pass — and adjust the per-type threshold if that's intended. -
Validate the x86 path on VNNI hardware. The AVX-512-VNNI kernel is logically correct — it computes
Σ(c−1)·qyasdpbusd(c, qy) − dpbusd(1, qy), and the bitfield extraction (×{64,16,4,1} >> 6 & 3) plus thepackus+permute4x64(…, 0xD8)reordering are sound — but it hasn't yet run on VNNI silicon (Sapphire Rapids / Zen 4–5). The#elsescalar fallback is straightforwardly correct.
A one-line rationale in the description for why this is a new type vs. TQ2_0 (finer, group-64 scaling) would also pre-empt the obvious review question.
7c6c628 to
0f07ba4
Compare
0f07ba4 to
a69cff5
Compare
a69cff5 to
dc7c932
Compare
|
This PR adds Q2_0 (aka symmetric int2) support for CPU. Main motivation is to support Ternary Bonsai models (1.7B, 4B, 8B) and upcoming models. This PR is CPU only (ARM NEON + generic scalar fallback). Notes:
Speed/Correctness Summary
Models + EvalsMore info on the models and working demos can be found below:
ModelsRepos: pip install -U "huggingface_hub[cli]"
# F16 reference + Q2_0 group-64 GGUF (swap 1.7B -> 4B / 8B for other sizes)
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-F16.gguf --local-dir models
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-Q2_0_g64.gguf --local-dir modelsEach repo has three gguf variants:
TestingTested on Mac M4 Pro, 48 GB. Two CPU routes: ARM NEON and Generic Scalar Fallback Pack to Q2_0 (from F16 GGUF)./build/bin/llama-quantize --pure models/Ternary-Bonsai-1.7B-F16.gguf models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf Q2_0Speed Benchmarks Details./build/bin/llama-bench -m <model.gguf> -t 8 -ngl 0 -p 512 -n 128ARM NEON Generic Scalar Fallback (1.7B, small KL Kernel Accuracy Test Details (Q2_0 g64 vs F16, packed vs unpacked)# Step 1: save F16 reference logits
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-F16.gguf \
-f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
--save-all-logits models/f16_logits.bin
# Step 2: KL Q2_0 vs F16
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf \
-f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
--kl-divergence --kl-divergence-base models/f16_logits.binARM NEON: Q2_0 g64 vs F16, by size 1.7B: full statistics (ARM NEON vs F16) 1.7B: Generic Scalar Fallback vs F16 (matches NEON) Requirements
|

DRAFT PR for TESTING.