Skip to content

ggml: add Q2_0 2-bit quantization support (CPU)#40

Draft
khosravipasha wants to merge 1 commit into
masterfrom
pr/q2_0-cpu
Draft

ggml: add Q2_0 2-bit quantization support (CPU)#40
khosravipasha wants to merge 1 commit into
masterfrom
pr/q2_0-cpu

Conversation

@khosravipasha

Copy link
Copy Markdown
Collaborator

DRAFT PR for TESTING.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for a new 2-bit quantization format Q2_0 (block size 64) across llama.cpp’s quantization pipeline, GGML core type system, CPU backend kernels, and GGUF Python tooling—intended for testing via a draft PR.

Changes:

  • Introduces GGML_TYPE_Q2_0 / LLAMA_FTYPE_MOSTLY_Q2_0 and wires them through model loading and quantization type selection.
  • Implements Q2_0 reference quantize/dequant routines plus CPU dot-product kernels (generic + x86/ARM optimized).
  • Adds GGUF Python quant/dequant support and conversion mappings to enable exporting/loading Q2_0 models.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/quantize/quantize.cpp Exposes Q2_0 as a quantization option in the CLI tool.
src/llama-quant.cpp Integrates Q2_0 into tensor-type selection, fallback logic, and default ftype mapping.
src/llama-model-loader.cpp Adds ftype name + GGML type → llama ftype mapping for Q2_0.
include/llama.h Adds LLAMA_FTYPE_MOSTLY_Q2_0 enum value.
gguf-py/gguf/quants.py Adds Python quantize/dequant implementation for Q2_0 blocks.
gguf-py/gguf/constants.py Adds GGUF constants for Q2_0 quant/type IDs and size table entry.
ggml/src/ggml.c Registers Q2_0 in GGML type traits and quantize dispatch.
ggml/src/ggml-quants.h Declares Q2_0 quantize/dequant APIs.
ggml/src/ggml-quants.c Implements Q2_0 reference quantize/dequant and chunk quantization support.
ggml/src/ggml-cpu/quants.h Declares CPU-side Q2_0 quantize and dot-product functions.
ggml/src/ggml-cpu/quants.c Adds CPU generic dot-product implementation for Q2_0×Q8_0.
ggml/src/ggml-cpu/ops.cpp Treats Q2_0 as a quantized type in several CPU ops dispatch paths.
ggml/src/ggml-cpu/ggml-cpu.c Adds CPU type traits entry for Q2_0.
ggml/src/ggml-cpu/arch/x86/quants.c Adds x86 implementation of Q2_0×Q8_0 dot product (with AVX-512 VNNI path).
ggml/src/ggml-cpu/arch/arm/quants.c Adds ARM NEON implementation of Q2_0×Q8_0 dot product (fallback to generic).
ggml/src/ggml-common.h Defines Q2_0 block format (block_q2_0, QK2_0).
ggml/include/ggml.h Adds GGML_TYPE_Q2_0 and GGML_FTYPE_MOSTLY_Q2_0 public enums.
convert_hf_to_gguf.py Adds "q2_0" CLI mapping to GGUF file type.
conversion/base.py Maps MOSTLY_Q2_0 to GGMLQuantizationType.Q2_0 during conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gguf-py/gguf/quants.py Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu.c

@bri-prism bri-prism left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add Q2_0 quantization (type definition + CPU backend)

Complete and consistent CPU implementation — all paths (scalar reference, NEON, x86-VNNI, x86 scalar, gguf-py) agree on the same 18-byte / 64-weight layout (2.25 bpw). Two items to resolve before merge:

  1. Run test-quantize-fns and test-backend-ops. Registering the type enrolls it in the automated round-trip/dot-product accuracy tests, which use synthetic data. With codes {0,1,2,3} → {-1,0,+1,+2}·d and the reference quantizer emitting only {0,1,2} (effectively a ternary codebook with a per-64 fp16 scale), reconstruction error on that synthetic data can exceed the per-type tolerances and fail CI. Please confirm both pass — and adjust the per-type threshold if that's intended.

  2. Validate the x86 path on VNNI hardware. The AVX-512-VNNI kernel is logically correct — it computes Σ(c−1)·qy as dpbusd(c, qy) − dpbusd(1, qy), and the bitfield extraction (×{64,16,4,1} >> 6 & 3) plus the packus + permute4x64(…, 0xD8) reordering are sound — but it hasn't yet run on VNNI silicon (Sapphire Rapids / Zen 4–5). The #else scalar fallback is straightforwardly correct.

A one-line rationale in the description for why this is a new type vs. TQ2_0 (finer, group-64 scaling) would also pre-empt the obvious review question.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Comment thread tests/test-quantize-fns.cpp
Comment thread tests/test-quantize-fns.cpp
@khosravipasha

khosravipasha commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

This PR adds Q2_0 (aka symmetric int2) support for CPU. Main motivation is to support Ternary Bonsai models (1.7B, 4B, 8B) and upcoming models. This PR is CPU only (ARM NEON + generic scalar fallback).
This completes the Q1_0, Q2_0, Q4_0, Q8_0 family.
We have the x86, Metal, CUDA, and Vulkan backends ready to submit later.

Notes:

  • Format: Each group of 64 weights shares one fp16 scale d; weights are packed at 2 bits each with
    mapping of {0,1,2,3} => {-1,0,+1,+2} * d

  • Our models natively support group size 128; however, it was requested to do group size 64 for the official Q2_0 format (see discussion #22019), so this PR uses 64.

  • We plan to also maintain a sibling group-128 variant (PQ2_0) in our fork since the
    0.125 extra bpw becomes significant on larger models. If you have a cleaner way to do this, please let us know. For future release we will pack the models into both Q2_0 and PQ2_0 formats.

Speed/Correctness Summary

  • Speeds: All on Mac M4 Pro (ARM NEON, -t 8, -ngl 0).
  • Correctness: Compare unpacked (from FP16 gguf) logits vs packed (Q2_0 gguf) logits on.
    More details and raw outputs in appendix.
Size Q2_0 size (vs F16) tg128 Q2_0 / F16 pp512 Q2_0 / F16 Mean KLD Same top-1
1.7B 461.79 MiB / 3.20 GiB 117.20 / 48.70 t/s 170.51 / 200.15 t/s 0.000204 99.392 %
4B 1.05 GiB / 7.49 GiB 55.00 / 27.23 t/s 68.31 / 80.64 t/s 0.000130 99.373 %
8B 2.15 GiB / 15.25 GiB 30.14 / 14.88 t/s 36.36 / 38.98 t/s 0.000120 99.314 %

Models + Evals

More info on the models and working demos can be found below:

Screenshot 2026-06-10 at 19 10 59

Models

Repos:

pip install -U "huggingface_hub[cli]"

# F16 reference + Q2_0 group-64 GGUF (swap 1.7B -> 4B / 8B for other sizes)
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-F16.gguf      --local-dir models
hf download prism-ml/Ternary-Bonsai-1.7B-gguf Ternary-Bonsai-1.7B-Q2_0_g64.gguf --local-dir models

Each repo has three gguf variants:

  • Q2_0_g64.gguf: the new group-64 format this PR adds. The _g64 suffix is for
    convenience; it will be renamed to plain Q2_0 once these PRs merge. Use this file with this PR.
  • Q2_0.gguf: the old Q2_0 from our fork (group 128); predates the group-64 change
    and does not load with this PR. Will be deleted/renamed once this PR merges.
  • PQ2_0.gguf: the sibling format with group size 128 we keep maintaining in our fork (fork-only,
    not part of this PR).

Testing

Tested on Mac M4 Pro, 48 GB. Two CPU routes: ARM NEON and Generic Scalar Fallback
(generic built by steering ggml to the portable path: GGML_SYSTEM_ARCH=UNKNOWN, NEON
arch/arm/quants.c not compiled).

Pack to Q2_0 (from F16 GGUF)
./build/bin/llama-quantize --pure models/Ternary-Bonsai-1.7B-F16.gguf models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf Q2_0
llama_model_quantize_impl: model size  =  3280.93 MiB (16.00 BPW)
llama_model_quantize_impl: quant size  =   461.79 MiB (2.25 BPW)
Speed Benchmarks Details
./build/bin/llama-bench -m <model.gguf> -t 8 -ngl 0 -p 512 -n 128

ARM NEON

| model           |       size |     params | backend | threads |  test |            t/s |
| --------------- | ---------: | ---------: | ------- | ------: | ----: | -------------: |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | pp512 | 170.51 ± 2.48  |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | tg128 | 117.20 ± 0.17  |
| qwen3 1.7B F16  |   3.20 GiB |     1.72 B | CPU     |       8 | pp512 | 200.15 ± 7.18  |
| qwen3 1.7B F16  |   3.20 GiB |     1.72 B | CPU     |       8 | tg128 |  48.70 ± 1.44  |
| qwen3 4B Q2_0   |   1.05 GiB |     4.02 B | CPU     |       8 | pp512 |  68.31 ± 0.30  |
| qwen3 4B Q2_0   |   1.05 GiB |     4.02 B | CPU     |       8 | tg128 |  55.00 ± 0.12  |
| qwen3 4B F16    |   7.49 GiB |     4.02 B | CPU     |       8 | pp512 |  80.64 ± 0.47  |
| qwen3 4B F16    |   7.49 GiB |     4.02 B | CPU     |       8 | tg128 |  27.23 ± 0.26  |
| qwen3 8B Q2_0   |   2.15 GiB |     8.19 B | CPU     |       8 | pp512 |  36.36 ± 0.15  |
| qwen3 8B Q2_0   |   2.15 GiB |     8.19 B | CPU     |       8 | tg128 |  30.14 ± 0.50  |
| qwen3 8B F16    |  15.25 GiB |     8.19 B | CPU     |       8 | pp512 |  38.98 ± 1.01  |
| qwen3 8B F16    |  15.25 GiB |     8.19 B | CPU     |       8 | tg128 |  14.88 ± 0.03  |

Generic Scalar Fallback (1.7B, small -p 16 -n 8)

| model           |       size |     params | backend | threads | test |           t/s |
| --------------- | ---------: | ---------: | ------- | ------: | ---: | ------------: |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 | pp16 | 24.70 ± 0.32  |
| qwen3 1.7B Q2_0 | 461.79 MiB |     1.72 B | CPU     |       8 |  tg8 | 19.31 ± 0.30  |
KL Kernel Accuracy Test Details (Q2_0 g64 vs F16, packed vs unpacked)
# Step 1: save F16 reference logits
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-F16.gguf \
  -f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
  --save-all-logits models/f16_logits.bin

# Step 2: KL Q2_0 vs F16
./build/bin/llama-perplexity -m models/Ternary-Bonsai-1.7B-Q2_0_g64.gguf \
  -f datasets/wikitext-2-raw/wiki.test.raw -c 512 --chunks 20 \
  --kl-divergence --kl-divergence-base models/f16_logits.bin

ARM NEON: Q2_0 g64 vs F16, by size

| Metric      |       1.7B       |        4B        |        8B        |
| ----------- | ---------------- | ---------------- | ---------------- |
| Same top p  | 99.392 ± 0.109 % | 99.373 ± 0.111 % | 99.314 ± 0.116 % |
| Mean KLD    | 0.000204 ± 4e-6  | 0.000130 ± 2e-6  | 0.000120 ± 3e-6  |
| Maximum KLD |        0.008953  |        0.002888  |        0.004727  |

1.7B: full statistics (ARM NEON vs F16)

====== KL divergence statistics ======
Mean    KLD:   0.000204 ±   0.000004
Maximum KLD:   0.008953
99.9%   KLD:   0.003159
99.0%   KLD:   0.001131
95.0%   KLD:   0.000647
90.0%   KLD:   0.000475
10.0%   KLD:   0.000001
 5.0%   KLD:  -0.000000
 1.0%   KLD:  -0.000003
 0.1%   KLD:  -0.000024
Minimum KLD:  -0.000041
Same top p: 99.392 ± 0.109 %

1.7B: Generic Scalar Fallback vs F16 (matches NEON)

====== KL divergence statistics ======
Mean    KLD:   0.000203 ±   0.000004
Maximum KLD:   0.003817
99.9%   KLD:   0.002900
99.0%   KLD:   0.001182
95.0%   KLD:   0.000626
90.0%   KLD:   0.000482
10.0%   KLD:   0.000001
 5.0%   KLD:  -0.000000
 1.0%   KLD:  -0.000004
 0.1%   KLD:  -0.000035
Minimum KLD:  -0.000046
Same top p: 99.059 ± 0.135 %

Requirements

  • I have read and agree with the contributing guidelines: Yes
  • AI usage disclosure: NEON paths were generated with AI help to follow Q1_0 NEON path but with Q2_0 logic, manually verified correctness using KL-test as above. Have been using the packed models and its working well as expected.

@khosravipasha khosravipasha changed the title Add Q2_0 quantization: type definition and CPU backend ggml: add Q2_0 2-bit quantization support (CPU) Jun 11, 2026
@khosravipasha khosravipasha requested a review from Copilot June 11, 2026 02:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Comment thread ggml/src/ggml-quants.c
Comment thread tests/test-quantize-fns.cpp
Comment thread tests/test-quantize-fns.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants