ggml: add Q2_0 2-bit quantization support (CPU) by khosravipasha · Pull Request #40 · PrismML-Eng/llama.cpp

khosravipasha · 2026-06-10T19:02:06Z

DRAFT PR for TESTING.

Copilot

Pull request overview

Adds end-to-end support for a new 2-bit quantization format Q2_0 (block size 64) across llama.cpp’s quantization pipeline, GGML core type system, CPU backend kernels, and GGUF Python tooling—intended for testing via a draft PR.

Changes:

Introduces GGML_TYPE_Q2_0 / LLAMA_FTYPE_MOSTLY_Q2_0 and wires them through model loading and quantization type selection.
Implements Q2_0 reference quantize/dequant routines plus CPU dot-product kernels (generic + x86/ARM optimized).
Adds GGUF Python quant/dequant support and conversion mappings to enable exporting/loading Q2_0 models.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/quantize/quantize.cpp	Exposes `Q2_0` as a quantization option in the CLI tool.
src/llama-quant.cpp	Integrates Q2_0 into tensor-type selection, fallback logic, and default ftype mapping.
src/llama-model-loader.cpp	Adds ftype name + GGML type → llama ftype mapping for Q2_0.
include/llama.h	Adds `LLAMA_FTYPE_MOSTLY_Q2_0` enum value.
gguf-py/gguf/quants.py	Adds Python quantize/dequant implementation for Q2_0 blocks.
gguf-py/gguf/constants.py	Adds GGUF constants for Q2_0 quant/type IDs and size table entry.
ggml/src/ggml.c	Registers Q2_0 in GGML type traits and quantize dispatch.
ggml/src/ggml-quants.h	Declares Q2_0 quantize/dequant APIs.
ggml/src/ggml-quants.c	Implements Q2_0 reference quantize/dequant and chunk quantization support.
ggml/src/ggml-cpu/quants.h	Declares CPU-side Q2_0 quantize and dot-product functions.
ggml/src/ggml-cpu/quants.c	Adds CPU generic dot-product implementation for Q2_0×Q8_0.
ggml/src/ggml-cpu/ops.cpp	Treats Q2_0 as a quantized type in several CPU ops dispatch paths.
ggml/src/ggml-cpu/ggml-cpu.c	Adds CPU type traits entry for Q2_0.
ggml/src/ggml-cpu/arch/x86/quants.c	Adds x86 implementation of Q2_0×Q8_0 dot product (with AVX-512 VNNI path).
ggml/src/ggml-cpu/arch/arm/quants.c	Adds ARM NEON implementation of Q2_0×Q8_0 dot product (fallback to generic).
ggml/src/ggml-common.h	Defines Q2_0 block format (`block_q2_0`, `QK2_0`).
ggml/include/ggml.h	Adds `GGML_TYPE_Q2_0` and `GGML_FTYPE_MOSTLY_Q2_0` public enums.
convert_hf_to_gguf.py	Adds `"q2_0"` CLI mapping to GGUF file type.
conversion/base.py	Maps `MOSTLY_Q2_0` to `GGMLQuantizationType.Q2_0` during conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bri-prism

Review: Add Q2_0 quantization (type definition + CPU backend)

Complete and consistent CPU implementation — all paths (scalar reference, NEON, x86-VNNI, x86 scalar, gguf-py) agree on the same 18-byte / 64-weight layout (2.25 bpw). Two items to resolve before merge:

Run test-quantize-fns and test-backend-ops. Registering the type enrolls it in the automated round-trip/dot-product accuracy tests, which use synthetic data. With codes {0,1,2,3} → {-1,0,+1,+2}·d and the reference quantizer emitting only {0,1,2} (effectively a ternary codebook with a per-64 fp16 scale), reconstruction error on that synthetic data can exceed the per-type tolerances and fail CI. Please confirm both pass — and adjust the per-type threshold if that's intended.
Validate the x86 path on VNNI hardware. The AVX-512-VNNI kernel is logically correct — it computes Σ(c−1)·qy as dpbusd(c, qy) − dpbusd(1, qy), and the bitfield extraction (×{64,16,4,1} >> 6 & 3) plus the packus + permute4x64(…, 0xD8) reordering are sound — but it hasn't yet run on VNNI silicon (Sapphire Rapids / Zen 4–5). The #else scalar fallback is straightforwardly correct.

A one-line rationale in the description for why this is a new type vs. TQ2_0 (finer, group-64 scaling) would also pre-empt the obvious review question.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

khosravipasha · 2026-06-11T02:01:57Z

This PR adds Q2_0 (aka symmetric int2) support for CPU. Main motivation is to support Ternary Bonsai models (1.7B, 4B, 8B) and upcoming models. This PR is CPU only (ARM NEON + generic scalar fallback).
This completes the Q1_0, Q2_0, Q4_0, Q8_0 family.
We have the x86, Metal, CUDA, and Vulkan backends ready to submit later.

Notes:

Format: Each group of 64 weights shares one fp16 scale d; weights are packed at 2 bits each with
mapping of {0,1,2,3} => {-1,0,+1,+2} * d
Our models natively support group size 128; however, it was requested to do group size 64 for the official Q2_0 format (see discussion #22019), so this PR uses 64.
We plan to also maintain a sibling group-128 variant (PQ2_0) in our fork since the
0.125 extra bpw becomes significant on larger models. If you have a cleaner way to do this, please let us know. For future release we will pack the models into both Q2_0 and PQ2_0 formats.

Speed/Correctness Summary

Speeds: All on Mac M4 Pro (ARM NEON, -t 8, -ngl 0).
Correctness: Compare unpacked (from FP16 gguf) logits vs packed (Q2_0 gguf) logits on.
More details and raw outputs in appendix.

Size	Q2_0 size (vs F16)	tg128 Q2_0 / F16	pp512 Q2_0 / F16	Mean KLD	Same top-1
1.7B	461.79 MiB / 3.20 GiB	117.20 / 48.70 t/s	170.51 / 200.15 t/s	0.000204	99.392 %
4B	1.05 GiB / 7.49 GiB	55.00 / 27.23 t/s	68.31 / 80.64 t/s	0.000130	99.373 %
8B	2.15 GiB / 15.25 GiB	30.14 / 14.88 t/s	36.36 / 38.98 t/s	0.000120	99.314 %

Models + Evals

More info on the models and working demos can be found below:

Demo Repo: https://github.com/PrismML-Eng/Bonsai-demo
Whitepaper: ternary-bonsai-8b-whitepaper.pdf

Models

Repos:

pip install -U "huggingface_hub[cli]"

# F16 reference + Q2_0 group-64 GGUF (swap 1.7B -> 4B / 8B for other sizes)
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-F16.gguf      --local-dir models
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-Q2_0_g64.gguf --local-dir models

Each repo has three gguf variants:

Q2_0_g64.gguf: the new group-64 format this PR adds. The _g64 suffix is for
convenience; it will be renamed to plain Q2_0 once these PRs merge. Use this file with this PR.
Q2_0.gguf: the old Q2_0 from our fork (group 128); predates the group-64 change
and does not load with this PR. Will be deleted/renamed once this PR merges.
PQ2_0.gguf: the sibling format with group size 128 we keep maintaining in our fork (fork-only,
not part of this PR).

Testing

Tested on Mac M4 Pro, 48 GB. Two CPU routes: ARM NEON and Generic Scalar Fallback
(generic built by steering ggml to the portable path: GGML_SYSTEM_ARCH=UNKNOWN, NEON
arch/arm/quants.c not compiled).

Pack to Q2_0 (from F16 GGUF)

./build/bin/llama-quantize --pure models/Ternary-Bonsai-1.7B-F16.gguf models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf Q2_0

llama_model_quantize_impl: model size  =  3280.93 MiB (16.00 BPW)
llama_model_quantize_impl: quant size  =   461.79 MiB (2.25 BPW)

Speed Benchmarks Details

./build/bin/llama-bench -m <model.gguf> -t 8 -ngl 0 -p 512 -n 128

ARM NEON

| model           |       size |     params | backend | threads |  test |            t/s |
| --------------- | ---------: | ---------: | ------- | ------: | ----: | -------------: |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | pp512 | 170.51 ± 2.48  |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | tg128 | 117.20 ± 0.17  |
| qwen3 1.7B F16  |   3.20 GiB |     1.72 B | CPU     |       8 | pp512 | 200.15 ± 7.18  |
| qwen3 1.7B F16  |   3.20 GiB |     1.72 B | CPU     |       8 | tg128 |  48.70 ± 1.44  |
| qwen3 4B Q2_0   |   1.05 GiB |     4.02 B | CPU     |       8 | pp512 |  68.31 ± 0.30  |
| qwen3 4B Q2_0   |   1.05 GiB |     4.02 B | CPU     |       8 | tg128 |  55.00 ± 0.12  |
| qwen3 4B F16    |   7.49 GiB |     4.02 B | CPU     |       8 | pp512 |  80.64 ± 0.47  |
| qwen3 4B F16    |   7.49 GiB |     4.02 B | CPU     |       8 | tg128 |  27.23 ± 0.26  |
| qwen3 8B Q2_0   |   2.15 GiB |     8.19 B | CPU     |       8 | pp512 |  36.36 ± 0.15  |
| qwen3 8B Q2_0   |   2.15 GiB |     8.19 B | CPU     |       8 | tg128 |  30.14 ± 0.50  |
| qwen3 8B F16    |  15.25 GiB |     8.19 B | CPU     |       8 | pp512 |  38.98 ± 1.01  |
| qwen3 8B F16    |  15.25 GiB |     8.19 B | CPU     |       8 | tg128 |  14.88 ± 0.03  |

Generic Scalar Fallback (1.7B, small -p 16 -n 8)

| model           |       size |     params | backend | threads | test |           t/s |
| --------------- | ---------: | ---------: | ------- | ------: | ---: | ------------: |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | pp16 | 24.70 ± 0.32  |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 |  tg8 | 19.31 ± 0.30  |

KL Kernel Accuracy Test Details (Q2_0 g64 vs F16, packed vs unpacked)

# Step 1: save F16 reference logits
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-F16.gguf \
  -f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
  --save-all-logits models/f16_logits.bin

# Step 2: KL Q2_0 vs F16
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf \
  -f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
  --kl-divergence --kl-divergence-base models/f16_logits.bin

ARM NEON: Q2_0 g64 vs F16, by size

| Metric      |       1.7B       |        4B        |        8B        |
| ----------- | ---------------- | ---------------- | ---------------- |
| Same top p  | 99.392 ± 0.109 % | 99.373 ± 0.111 % | 99.314 ± 0.116 % |
| Mean KLD    | 0.000204 ± 4e-6  | 0.000130 ± 2e-6  | 0.000120 ± 3e-6  |
| Maximum KLD |        0.008953  |        0.002888  |        0.004727  |

1.7B: full statistics (ARM NEON vs F16)

====== KL divergence statistics ======
Mean    KLD:   0.000204 ±   0.000004
Maximum KLD:   0.008953
99.9%   KLD:   0.003159
99.0%   KLD:   0.001131
95.0%   KLD:   0.000647
90.0%   KLD:   0.000475
10.0%   KLD:   0.000001
 5.0%   KLD:  -0.000000
 1.0%   KLD:  -0.000003
 0.1%   KLD:  -0.000024
Minimum KLD:  -0.000041
Same top p: 99.392 ± 0.109 %

1.7B: Generic Scalar Fallback vs F16 (matches NEON)

====== KL divergence statistics ======
Mean    KLD:   0.000203 ±   0.000004
Maximum KLD:   0.003817
99.9%   KLD:   0.002900
99.0%   KLD:   0.001182
95.0%   KLD:   0.000626
90.0%   KLD:   0.000482
10.0%   KLD:   0.000001
 5.0%   KLD:  -0.000000
 1.0%   KLD:  -0.000004
 0.1%   KLD:  -0.000035
Minimum KLD:  -0.000046
Same top p: 99.059 ± 0.135 %

Requirements

I have read and agree with the contributing guidelines: Yes
AI usage disclosure: NEON paths were generated with AI help to follow Q1_0 NEON path but with Q2_0 logic, manually verified correctness using KL-test as above. Have been using the packed models and its working well as expected.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

khosravipasha force-pushed the pr/q2_0-cpu branch from bb17927 to bc716e0 Compare June 10, 2026 19:03

khosravipasha requested review from bri-prism and Copilot June 10, 2026 19:05

Copilot started reviewing on behalf of khosravipasha June 10, 2026 19:05 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread gguf-py/gguf/quants.py Outdated

Comment thread ggml/src/ggml-cpu/ggml-cpu.c

bri-prism reviewed Jun 10, 2026

View reviewed changes

khosravipasha force-pushed the pr/q2_0-cpu branch 2 times, most recently from 7c6c628 to 0f07ba4 Compare June 11, 2026 00:08

khosravipasha requested a review from Copilot June 11, 2026 00:14

Copilot started reviewing on behalf of khosravipasha June 11, 2026 00:14 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/test-quantize-fns.cpp

Comment thread tests/test-quantize-fns.cpp

khosravipasha force-pushed the pr/q2_0-cpu branch from 0f07ba4 to a69cff5 Compare June 11, 2026 00:28

Add Q2_0 quantization: type definition and CPU backend

dc7c932

khosravipasha force-pushed the pr/q2_0-cpu branch from a69cff5 to dc7c932 Compare June 11, 2026 00:37

khosravipasha changed the title ~~Add Q2_0 quantization: type definition and CPU backend~~ ggml: add Q2_0 2-bit quantization support (CPU) Jun 11, 2026

khosravipasha requested a review from Copilot June 11, 2026 02:02

Copilot started reviewing on behalf of khosravipasha June 11, 2026 02:02 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread ggml/src/ggml-quants.c

Comment thread tests/test-quantize-fns.cpp

Comment thread tests/test-quantize-fns.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: add Q2_0 2-bit quantization support (CPU)#40

ggml: add Q2_0 2-bit quantization support (CPU)#40
khosravipasha wants to merge 1 commit into
masterfrom
pr/q2_0-cpu

khosravipasha commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

bri-prism left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Jun 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

khosravipasha commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

bri-prism left a comment

Choose a reason for hiding this comment

Review: Add Q2_0 quantization (type definition + CPU backend)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed/Correctness Summary

Models + Evals

Models

Testing

Requirements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

khosravipasha commented Jun 11, 2026 •

edited

Loading