Skip to content
This repository was archived by the owner on May 31, 2026. It is now read-only.

Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)#1

Draft
khosravipasha wants to merge 1 commit into
mainfrom
prism
Draft

Add GGUF Q1_0 kernel support (MMQ/MMVQ + enum patch)#1
khosravipasha wants to merge 1 commit into
mainfrom
prism

Conversation

@khosravipasha

Copy link
Copy Markdown
Collaborator

Draft

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for GGUF Q1_0 (type id 41) across the CUDA GGUF kernels and Python GGUF loading stack, including a temporary Python-side compat shim for older gguf pip releases.

Changes:

  • Introduce Q1_0 block definition + CUDA dequantization, MMVQ vecdot, and MMQ tiling support.
  • Wire Q1_0 into GGUF matmul / matmul-vec / moe-vec dispatch paths in the CUDA extension.
  • Add a Python gguf monkey-patch shim and invoke it from config/model-loading code paths so GGUFReader can parse type 41.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sgl-kernel/csrc/quantization/gguf/vecdotq.cuh Adds Q1_0 vecdot and MMQ tile load/unpack support by mapping Q1_0 to a signed-byte layout.
sgl-kernel/csrc/quantization/gguf/moe_vec.cuh Adds moe_vec_q1_0_q8_1_cuda launcher for Q1_0.
sgl-kernel/csrc/quantization/gguf/mmvq.cuh Adds mul_mat_vec_q1_0_q8_1_cuda launcher for Q1_0.
sgl-kernel/csrc/quantization/gguf/mmq.cuh Adds ggml_mul_mat_q1_0_q8_1_cuda (MMQ) kernel entrypoint for Q1_0.
sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu Adds dispatch cases for type 41 in matmul-vec / matmul / moe-vec wrappers.
sgl-kernel/csrc/quantization/gguf/ggml-common.h Defines Q1_0 constants and block_q1_0 layout.
sgl-kernel/csrc/quantization/gguf/dequantize.cuh Adds device dequantizer and registration for type 41.
python/sglang/srt/utils/hf_transformers/config.py Ensures the Q1_0 compat shim runs before Transformers reads GGUF.
python/sglang/srt/utils/gguf_compat.py New Python shim to extend gguf with Q1_0 enum/sizes + numpy dequantization.
python/sglang/srt/model_loader/weight_utils.py Applies the shim before using GGUFReader in iterator helpers.
python/sglang/srt/model_loader/loader.py Applies the shim before deriving GGUF↔HF tensor name mappings.
python/sglang/srt/layers/quantization/gguf.py Adds Q1_0 to supported type sets and triggers shim at import time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// MMVQ = mul_mat_vec_q, MMQ = mul_mat_q

#define VDR_Q1_0_Q8_1_MMVQ 1 // Process one 32-element chunk at a time
#define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VDR_Q1_0_Q8_1_MMQ is defined as 4, but the MMQ path for Q1_0 reuses vec_dot_q8_0_q8_1_mul_mat, which internally processes VDR_Q8_0_Q8_1_MMQ (=8) ints per call. If VDR_Q1_0_Q8_1_MMQ is ever used to drive the MMQ loop step, it would cause overlapping reads and potential out-of-bounds access on the last iteration. Consider removing this macro (since it's currently unused) or setting it to the effective VDR required by the reused dot-product implementation.

Suggested change
#define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block
#define VDR_Q1_0_Q8_1_MMQ 8 // Match the effective MMQ VDR of the reused q8_0/q8_1 dot-product implementation

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +150
DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Q1_0_TYPES are added to MMQ_QUANT_TYPES, which can make fused_moe_gguf() take the ggml_moe_a8 (MMQ) path for Q1_0 when x.shape[0] > 64. In gguf_kernel.cu, ggml_moe_get_block_size() and ggml_moe_a8() don’t handle type 41, so this will return a block size of 0 and/or skip launching any kernel. Either add Q1_0 support to the MMQ MoE kernels + block-size mapping, or exclude Q1_0 from the MMQ MoE fast path (e.g., keep it MMVQ-only for MoE for now).

Suggested change
DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | Q1_0_TYPES
# NOTE: Q1_0 is intentionally excluded from MMQ_QUANT_TYPES because the MMQ
# MoE kernels/block-size mapping do not support it yet. Keep Q1_0 on the
# dequant/MMVQ paths until MMQ MoE support is added.
DEQUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMVQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES | IMATRIX_QUANT_TYPES | Q1_0_TYPES
MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES

Copilot uses AI. Check for mistakes.
case 41:
mul_mat_vec_q1_0_q8_1_cuda<scalar_t>(
(void*)W.data_ptr(), (void*)quant_X.data_ptr(), (scalar_t*)Y.data_ptr(), col, row, vecs, stream);
break;

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.

Suggested change
break;
break;
default:
TORCH_CHECK(false, "ggml_mul_mat_vec_a8: unsupported GGUF type: ", type);

Copilot uses AI. Check for mistakes.
padded,
row,
stream);
break;

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch has no default/error handling. If an unsupported type is passed, no kernel will run and Y (allocated with torch::empty) will be returned uninitialized, leading to silent incorrect outputs. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) or otherwise initializes/handles the output for unknown types.

Suggested change
break;
break;
default:
TORCH_CHECK(false, "Unsupported GGUF quantization type in ggml_mul_mat_a8: ", type);

Copilot uses AI. Check for mistakes.
Comment on lines +823 to 836
case 41:
moe_vec_q1_0_q8_1_cuda<scalar_t>(
(void*)W.data_ptr(),
(void*)quant_X.data_ptr(),
(scalar_t*)Y.data_ptr(),
(int*)topk_ids.data_ptr(),
top_k,
tokens,
col,
row,
quant_X.stride(0),
stream);
break;
}

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch has no default/error handling. If an unsupported type is passed, the function will silently return all-zeros (since Y is initialized with torch::zeros) and no kernel will run. Consider adding a default: that raises (e.g., TORCH_CHECK(false, ...)) to avoid masking unsupported quantization types as valid outputs.

Copilot uses AI. Check for mistakes.

# 0. Add Q1_0 to the GGMLQuantizationType enum so GGUFReader can parse it
from gguf import GGMLQuantizationType
import enum

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum is imported but never used here, which will fail the repo’s configured Ruff check (F401 unused import). Please remove the import (or use it if it was intended).

Suggested change
import enum

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +37
def ensure_q1_0_gguf_compat():
"""Register Q1_0 type in the gguf library if not already present."""
import gguf

Copilot AI Apr 18, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Q1_0 GGUF compatibility shim and dequantization logic is introduced here, but there are no unit tests covering (a) that ensure_q1_0_gguf_compat() makes gguf.GGUFReader accept type 41 and (b) that the numpy dequantization produces the expected ±scale outputs. Since the repo already has GGUF-related unit tests, consider adding a small targeted test that runs the shim against the installed gguf module and validates enum/size registration and dequantization for a tiny synthetic block.

Copilot uses AI. Check for mistakes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants