Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch) by khosravipasha · Pull Request #1 · PrismML-Eng/sglang

khosravipasha · 2026-04-17T23:54:24Z

Draft

Copilot

Pull request overview

Adds end-to-end support for GGUF Q1_0 (type id 41) across the CUDA GGUF kernels and Python GGUF loading stack, including a temporary Python-side compat shim for older gguf pip releases.

Changes:

Introduce Q1_0 block definition + CUDA dequantization, MMVQ vecdot, and MMQ tiling support.
Wire Q1_0 into GGUF matmul / matmul-vec / moe-vec dispatch paths in the CUDA extension.
Add a Python gguf monkey-patch shim and invoke it from config/model-loading code paths so GGUFReader can parse type 41.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
sgl-kernel/csrc/quantization/gguf/vecdotq.cuh	Adds Q1_0 vecdot and MMQ tile load/unpack support by mapping Q1_0 to a signed-byte layout.
sgl-kernel/csrc/quantization/gguf/moe_vec.cuh	Adds `moe_vec_q1_0_q8_1_cuda` launcher for Q1_0.
sgl-kernel/csrc/quantization/gguf/mmvq.cuh	Adds `mul_mat_vec_q1_0_q8_1_cuda` launcher for Q1_0.
sgl-kernel/csrc/quantization/gguf/mmq.cuh	Adds `ggml_mul_mat_q1_0_q8_1_cuda` (MMQ) kernel entrypoint for Q1_0.
sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu	Adds dispatch cases for type 41 in matmul-vec / matmul / moe-vec wrappers.
sgl-kernel/csrc/quantization/gguf/ggml-common.h	Defines Q1_0 constants and `block_q1_0` layout.
sgl-kernel/csrc/quantization/gguf/dequantize.cuh	Adds device dequantizer and registration for type 41.
python/sglang/srt/utils/hf_transformers/config.py	Ensures the Q1_0 compat shim runs before Transformers reads GGUF.
python/sglang/srt/utils/gguf_compat.py	New Python shim to extend `gguf` with Q1_0 enum/sizes + numpy dequantization.
python/sglang/srt/model_loader/weight_utils.py	Applies the shim before using `GGUFReader` in iterator helpers.
python/sglang/srt/model_loader/loader.py	Applies the shim before deriving GGUF↔HF tensor name mappings.
python/sglang/srt/layers/quantization/gguf.py	Adds Q1_0 to supported type sets and triggers shim at import time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-18T00:01:58Z

 // MMVQ = mul_mat_vec_q, MMQ = mul_mat_q

+#define VDR_Q1_0_Q8_1_MMVQ 1  // Process one 32-element chunk at a time
+#define VDR_Q1_0_Q8_1_MMQ  4  // Q1_0 has 128 bits (4 ints) per block


VDR_Q1_0_Q8_1_MMQ is defined as 4, but the MMQ path for Q1_0 reuses vec_dot_q8_0_q8_1_mul_mat, which internally processes VDR_Q8_0_Q8_1_MMQ (=8) ints per call. If VDR_Q1_0_Q8_1_MMQ is ever used to drive the MMQ loop step, it would cause overlapping reads and potential out-of-bounds access on the last iteration. Consider removing this macro (since it's currently unused) or setting it to the effective VDR required by the reused dot-product implementation.

Suggested change

#define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block

#define VDR_Q1_0_Q8_1_MMQ 8 // Match the effective MMQ VDR of the reused q8_0/q8_1 dot-product implementation

Copilot · 2026-04-18T00:01:59Z

+DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
+MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
+MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES


The new Q1_0_TYPES are added to MMQ_QUANT_TYPES, which can make fused_moe_gguf() take the ggml_moe_a8 (MMQ) path for Q1_0 when x.shape[0] > 64. In gguf_kernel.cu, ggml_moe_get_block_size() and ggml_moe_a8() don’t handle type 41, so this will return a block size of 0 and/or skip launching any kernel. Either add Q1_0 support to the MMQ MoE kernels + block-size mapping, or exclude Q1_0 from the MMQ MoE fast path (e.g., keep it MMVQ-only for MoE for now).

Suggested change

DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES

MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES

MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES

# NOTE: Q1_0 is intentionally excluded from MMQ_QUANT_TYPES because the MMQ

# MoE kernels/block-size mapping do not support it yet. Keep Q1_0 on the

# dequant/MMVQ paths until MMQ MoE support is added.

DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES

MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES

MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES

Copilot · 2026-04-18T00:01:59Z

+      case 41:
+        mul_mat_vec_q1_0_q8_1_cuda<scalar_t>(
+            (void*)W.data_ptr(), (void*)quant_X.data_ptr(), (scalar_t*)Y.data_ptr(), col, row, vecs, stream);
+        break;


This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.

Suggested change

break;

break;

default:

TORCH_CHECK(false, "ggml_mul_mat_vec_a8: unsupported GGUF type: ", type);

Copilot · 2026-04-18T00:02:00Z

+            padded,
+            row,
+            stream);
+        break;


This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.

Suggested change

break;

break;

default:

TORCH_CHECK(false, "Unsupported GGUF quantization type in ggml_mul_mat_a8: ", type);

Copilot · 2026-04-18T00:02:00Z

+      case 41:
+        moe_vec_q1_0_q8_1_cuda<scalar_t>(
+            (void*)W.data_ptr(),
+            (void*)quant_X.data_ptr(),
+            (scalar_t*)Y.data_ptr(),
+            (int*)topk_ids.data_ptr(),
+            top_k,
+            tokens,
+            col,
+            row,
+            quant_X.stride(0),
+            stream);
+        break;
    }


This switch has no default/error handling. If an unsupported type is passed, the function will silently return all-zeros (since Y is initialized with torch::zeros) and no kernel will run. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) to avoid masking unsupported quantization types as valid outputs.

Copilot · 2026-04-18T00:02:00Z

+
+    # 0. Add Q1_0 to the GGMLQuantizationType enum so GGUFReader can parse it
+    from gguf import GGMLQuantizationType
+    import enum


enum is imported but never used here, which will fail the repo’s configured Ruff check (F401 unused import). Please remove the import (or use it if it was intended).

Suggested change

import enum

Copilot · 2026-04-18T00:02:00Z

+def ensure_q1_0_gguf_compat():
+    """Register Q1_0 type in the gguf library if not already present."""
+    import gguf


New Q1_0 GGUF compatibility shim and dequantization logic is introduced here, but there are no unit tests covering (a) that ensure_q1_0_gguf_compat() makes gguf.GGUFReader accept type 41 and (b) that the numpy dequantization produces the expected ±scale outputs. Since the repo already has GGUF-related unit tests, consider adding a small targeted test that runs the shim against the installed gguf module and validates enum/size registration and dequantization for a tiny synthetic block.

Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)

1cfe32a

khosravipasha requested a review from Copilot April 17, 2026 23:54

Copilot started reviewing on behalf of khosravipasha April 17, 2026 23:54 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)#1

Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)#1
khosravipasha wants to merge 1 commit into
mainfrom
prism

khosravipasha commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	#define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block
	#define VDR_Q1_0_Q8_1_MMQ 8 // Match the effective MMQ VDR of the reused q8_0/q8_1 dot-product implementation

-        break;
+        break;
+      default:
+        TORCH_CHECK(false, "ggml_mul_mat_vec_a8: unsupported GGUF type: ", type);

Conversation

khosravipasha commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants