Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)#1
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for GGUF Q1_0 (type id 41) across the CUDA GGUF kernels and Python GGUF loading stack, including a temporary Python-side compat shim for older gguf pip releases.
Changes:
- Introduce
Q1_0block definition + CUDA dequantization, MMVQ vecdot, and MMQ tiling support. - Wire
Q1_0into GGUF matmul / matmul-vec / moe-vec dispatch paths in the CUDA extension. - Add a Python
ggufmonkey-patch shim and invoke it from config/model-loading code paths soGGUFReadercan parse type 41.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sgl-kernel/csrc/quantization/gguf/vecdotq.cuh | Adds Q1_0 vecdot and MMQ tile load/unpack support by mapping Q1_0 to a signed-byte layout. |
| sgl-kernel/csrc/quantization/gguf/moe_vec.cuh | Adds moe_vec_q1_0_q8_1_cuda launcher for Q1_0. |
| sgl-kernel/csrc/quantization/gguf/mmvq.cuh | Adds mul_mat_vec_q1_0_q8_1_cuda launcher for Q1_0. |
| sgl-kernel/csrc/quantization/gguf/mmq.cuh | Adds ggml_mul_mat_q1_0_q8_1_cuda (MMQ) kernel entrypoint for Q1_0. |
| sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu | Adds dispatch cases for type 41 in matmul-vec / matmul / moe-vec wrappers. |
| sgl-kernel/csrc/quantization/gguf/ggml-common.h | Defines Q1_0 constants and block_q1_0 layout. |
| sgl-kernel/csrc/quantization/gguf/dequantize.cuh | Adds device dequantizer and registration for type 41. |
| python/sglang/srt/utils/hf_transformers/config.py | Ensures the Q1_0 compat shim runs before Transformers reads GGUF. |
| python/sglang/srt/utils/gguf_compat.py | New Python shim to extend gguf with Q1_0 enum/sizes + numpy dequantization. |
| python/sglang/srt/model_loader/weight_utils.py | Applies the shim before using GGUFReader in iterator helpers. |
| python/sglang/srt/model_loader/loader.py | Applies the shim before deriving GGUF↔HF tensor name mappings. |
| python/sglang/srt/layers/quantization/gguf.py | Adds Q1_0 to supported type sets and triggers shim at import time. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // MMVQ = mul_mat_vec_q, MMQ = mul_mat_q | ||
|
|
||
| #define VDR_Q1_0_Q8_1_MMVQ 1 // Process one 32-element chunk at a time | ||
| #define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block |
There was a problem hiding this comment.
VDR_Q1_0_Q8_1_MMQ is defined as 4, but the MMQ path for Q1_0 reuses vec_dot_q8_0_q8_1_mul_mat, which internally processes VDR_Q8_0_Q8_1_MMQ (=8) ints per call. If VDR_Q1_0_Q8_1_MMQ is ever used to drive the MMQ loop step, it would cause overlapping reads and potential out-of-bounds access on the last iteration. Consider removing this macro (since it's currently unused) or setting it to the effective VDR required by the reused dot-product implementation.
| #define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block | |
| #define VDR_Q1_0_Q8_1_MMQ 8 // Match the effective MMQ VDR of the reused q8_0/q8_1 dot-product implementation |
| DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | ||
| MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | ||
| MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES |
There was a problem hiding this comment.
The new Q1_0_TYPES are added to MMQ_QUANT_TYPES, which can make fused_moe_gguf() take the ggml_moe_a8 (MMQ) path for Q1_0 when x.shape[0] > 64. In gguf_kernel.cu, ggml_moe_get_block_size() and ggml_moe_a8() don’t handle type 41, so this will return a block size of 0 and/or skip launching any kernel. Either add Q1_0 support to the MMQ MoE kernels + block-size mapping, or exclude Q1_0 from the MMQ MoE fast path (e.g., keep it MMVQ-only for MoE for now).
| DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | |
| MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | |
| MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES | |
| # NOTE: Q1_0 is intentionally excluded from MMQ_QUANT_TYPES because the MMQ | |
| # MoE kernels/block-size mapping do not support it yet. Keep Q1_0 on the | |
| # dequant/MMVQ paths until MMQ MoE support is added. | |
| DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | |
| MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES | |
| MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES |
| case 41: | ||
| mul_mat_vec_q1_0_q8_1_cuda<scalar_t>( | ||
| (void*)W.data_ptr(), (void*)quant_X.data_ptr(), (scalar_t*)Y.data_ptr(), col, row, vecs, stream); | ||
| break; |
There was a problem hiding this comment.
This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.
| break; | |
| break; | |
| default: | |
| TORCH_CHECK(false, "ggml_mul_mat_vec_a8: unsupported GGUF type: ", type); |
| padded, | ||
| row, | ||
| stream); | ||
| break; |
There was a problem hiding this comment.
This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.
| break; | |
| break; | |
| default: | |
| TORCH_CHECK(false, "Unsupported GGUF quantization type in ggml_mul_mat_a8: ", type); |
| case 41: | ||
| moe_vec_q1_0_q8_1_cuda<scalar_t>( | ||
| (void*)W.data_ptr(), | ||
| (void*)quant_X.data_ptr(), | ||
| (scalar_t*)Y.data_ptr(), | ||
| (int*)topk_ids.data_ptr(), | ||
| top_k, | ||
| tokens, | ||
| col, | ||
| row, | ||
| quant_X.stride(0), | ||
| stream); | ||
| break; | ||
| } |
There was a problem hiding this comment.
This switch has no default/error handling. If an unsupported type is passed, the function will silently return all-zeros (since Y is initialized with torch::zeros) and no kernel will run. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) to avoid masking unsupported quantization types as valid outputs.
|
|
||
| # 0. Add Q1_0 to the GGMLQuantizationType enum so GGUFReader can parse it | ||
| from gguf import GGMLQuantizationType | ||
| import enum |
There was a problem hiding this comment.
enum is imported but never used here, which will fail the repo’s configured Ruff check (F401 unused import). Please remove the import (or use it if it was intended).
| import enum |
| def ensure_q1_0_gguf_compat(): | ||
| """Register Q1_0 type in the gguf library if not already present.""" | ||
| import gguf |
There was a problem hiding this comment.
New Q1_0 GGUF compatibility shim and dequantization logic is introduced here, but there are no unit tests covering (a) that ensure_q1_0_gguf_compat() makes gguf.GGUFReader accept type 41 and (b) that the numpy dequantization produces the expected ±scale outputs. Since the repo already has GGUF-related unit tests, consider adding a small targeted test that runs the shim against the installed gguf module and validates enum/size registration and dequantization for a tiny synthetic block.
Draft