Releases: openvinotoolkit/nncf
v3.1.0
- General:
- Migrated
NNCFGraphfromnx.DiGraphtonx.MultiDiGraphto support models with parallel/multi-edges, enabling correct quantization of models with complex graph structures such as YOLO26 and models likea = conv(x); return a * a(#3843).
- Migrated
- Features:
- (OpenVINO) Added NVFP4 (
f4e2m1) compression data type in Weight Compression. NVFP4 uses a constant group size of 16 with scales compressed tof8e4m3using a second-degree scale (#3967). - (OpenVINO) Added
backup_modeparameter for FP compression formats (MXFP4, MXFP8, FP4, FP8), allowing first/last layers to be compressed with a backup FP format instead of INT8 (#3886). - (OpenVINO) RoPe ignored pattern is updated to handle operations without a preceding transpose like in the Phi-3.5-MoE-instruct model (#3989).
- (PyTorch) Added
TopKMetatypesupport for TorchFX backend, enabling correct graph building for models with TopK operations such as YOLO26 (#3944). - (PyTorch) Migrated to use
torchaoinstead of deprecatedtorch.ao(#3854).
- (OpenVINO) Added NVFP4 (
- Fixes:
- Improvements:
- (PyTorch) Added lazy import for
nncf.torchmodule to reduce startup import time (#3862).
- (PyTorch) Added lazy import for
- Tutorials:
- Post-Training Optimization of Gemma 4 Model
- Post-Training Optimization of Code-specialized LLMs
- Post-Training Optimization of Vision-Language Models (VLMs)
- Post-Training Optimization of MiniCPM-o 4.5 Multimodal Model
- Post-Training Optimization of PaddleOCR-VL/PaddleOCR-VL-1.5 Models
- Post-Training Optimization of RAG pipeline
- Requirements:
v3.0.0
Post-training Quantization:
- Breaking changes:
- Renamed
nncf.CompressWeightsMode.CB4_F8E4M3mode option tonncf.CompressWeightsMode.CB4.
- Renamed
- General:
- Added
nncf.pruneAPI function, which provides a unified interface for pruning algorithms. Currently available for PyTorch backend and supports Magnitude Pruning.
More details about the new API can be found in the documentation. - Added
nncf.build_graphAPI function for buildingNNCFGraphfrom a model. This API can be used to inspect and define the ignored scope. - Added documentation about using
nncf.IgnoredScope. - Reworked
HWConfig, now using Python-style definition of hardware configuration instead of JSON files.
- Added
- Features:
- Added support for models containing MatMul operations with transposed activation inputs in data-free Weight Compression and data-aware AWQ algorithms.
- (OpenVINO) Introduced new experimental compression data type ADAPTIVE_CODEBOOK. This compression type calculates a unique codebook for each MatMul or block of identical MatMuls (for example, all down_proj could have the same codebook). This approach reduces quality degradation in the case of per-channel weight compression. See example.
- (TorchFX) Preview support for the new
compress_pt2eAPI has been introduced, enabling quantization oftorch.fx.GraphModulemodels with theOpenVINOQuantizer. Users now can quantize their models in ExecuTorch for the OpenVINO backend via the nncfcompress_pt2eemploying Scale Estimation and AWQ. - (PyTorch) Added support for linear functions for the Fast Bias Correction algorithm to improve the accuracy of such models after the quantization.
- (OpenVINO) Added activation profiler tool to collect and visualize tensor statistics.
- Fixes:
- (ONNX) Fixed
compress_quantize_weights_transformation()method by removing names of deleted initializers from graph inputs. - (ONNX) Fixed incorrect insertion of MatMulNBits nodes.
- (ONNX) Fixed
- Improvements:
- Added support for the compression of 3D weights in AWQ, Scale Estimation, and GPTQ algorithms. Models with MoE (Mixture of Experts), such as GPT-OSS-20B and Qwen3-30B-A3B, can be compressed with data-aware methods now.
- Tutorials:
- Post-Training Quantization of YOLO26 OpenVINO Model
- Post-Training Optimization of Wan2.2 Model
- Post-Training Optimization of DeepSeek-OCR Model
- Post-Training Optimization of Z-Image-Turbo Model
- Post-Training Optimization of Qwen-Image Model
- Post-Training Optimization of Qwen3-TTS Model
- Post-Training Optimization of Qwen3-ASR Model
- Post-Training Optimization of Fun-ASR-Nano Model
- Post-Training Optimization of Fun-CosyVoice 3.0 Model
Deprecations/Removals:
- (TensorFlow) Removed support for TensorFlow backend.
- (PyTorch) Removed legacy
create_compressed_modelAPI for PyTorch backend, which was previously marked as deprecated. - (PyTorch) Removed legacy algorithms for PyTorch that were based on using
NNCFNetwork: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, and Movement Sparsity.
Requirements:
- Dropped
jsonschema,natsort, andpymoofrom dependencies as they are no longer required. - Updated
numpyto>=1.24.0, <2.5.0.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@avolkov-intel @Shehrozkashif @ruro @mostafafaheem
v2.19.0
Post-training Quantization:
- Breaking changes:
- (OpenVINO)
nncf.CompressWeightsMode.E2M1modeoption is renamed tonncf.CompressWeightsMode.MXFP4.
- (OpenVINO)
- Features:
- The Histogram Aggregator was introduced: it records the running histogram of tensor values and computes the quantization range to minimize the L2 norm of histogram bin quantization error. The Histogram Aggregator improved accuracy metrics for a number of PTQ classification models. It can be activated using
RangeEstimatorParametersSet.HISTOGRAMthrough theAdvancedQuantizationParametersinnncf.quantize(). - (OpenVINO) Introduced several new compression modes in
nncf.CompressWeightsMode:MXFP8,FP8, andFP4. These can be used as themodeoption innncf.compress_weights()to apply the corresponding MXFP8, FP8, or FP4 precisions (experimental). - Now weight compression bitwidth distribution table also displays group size value for each of the compression data type.
- (ONNX) Support for the SmoothQuant algorithm has been added to the ONNX backend for INT8 quantization.
- (ONNX) A new transformation has been added to optimize models by folding
QuantizeLinearnodes with constant inputs into precomputed, quantized initializers. This behavior is controlled by theCOMPRESS_WEIGHTSbackend parameter innncf.quantize(), which is now enabled (True) by default. - (ONNX) Support has been added for applying the Fast Bias/Bias Correction algorithm to
MatMul+Addsubgraphs where one of the inputs to theAddoperation is a constant. Previously, these cases were skipped because theMatMuloperation was not recognized as having a bias, preventing the algorithm from being applied.
- The Histogram Aggregator was introduced: it records the running histogram of tensor values and computes the quantization range to minimize the L2 norm of histogram bin quantization error. The Histogram Aggregator improved accuracy metrics for a number of PTQ classification models. It can be activated using
- Fixes:
- Added an ignored pattern for position embedding layer in Segment Anything model.
- (ONNX) Fixed incorrect input handling for the
MatMulNBitsoperation that previously caused graph breaks. - (ONNX) Resolved an issue with INT4 weight compression in the
Gemmoperation whentransB=1. - Fixed a typo in the
_get_smooth_quant_param_grid()method reported in #3613.
- Improvements:
- Maximum memory consumption during statistic collection has been reduced by releasing model output memory before the next statistic collection inference call.
- Reduced peak memory footprint for Bias Correction algorithm.
- (OpenVINO) Reduced time (by up to 3x) and memory (by up to 1.5x) it takes to compress models to
MXFP4data type.
- Tutorials:
- Other:
- Refined the handling of layers that don't have channel size divisible by group size during weight compression. Now the default behavior in such case is that an error will be raised and in the error message users are suggested to provide a different group size value or use
GroupSizeFallbackMode.ADJUSTto automatically adjust group size for problematic layers.
- Refined the handling of layers that don't have channel size divisible by group size during weight compression. Now the default behavior in such case is that an error will be raised and in the error message users are suggested to provide a different group size value or use
Compression-aware training:
- Improvements:
- Optimized
nncf.strip()forStripFormat.IN_PLACEandexample_inputis no longer required.
- Optimized
Deprecations/Removals:
- (TensorFlow) The TensorFlow backend is now deprecated and will be removed in future releases. It is recommended to use PyTorch analogous models for training-aware optimization methods and OpenVINO IR, PyTorch, and ONNX models for post-training optimization methods from NNCF.
- The following experimental NNCF methods are deprecated and will be removed in future releases: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, Movement Sparsity.
Requirements:
- Updated PyTorch (2.9.0) and Torchvision (0.24.0) versions.
- Dropped support for Python 3.9.
v2.18.0
Post-training Quantization:
- Features:
- (OpenVINO) Introduced new compression data types CB4_F8E4M3 and CODEBOOK. CB4_F8E4M3 is a fixed codebook with 16 fp8 values based on NF4 data type values. CODEBOOK is an arbitrary user-selectable codebook that can be used to experiment with different data types. Both data types are used for weight compression. The AWQ and scale estimation algorithms are supported for these data types.
- (OpenVINO) Added support for compressing FP8 (f8e4m3 and f8e5m2) weights to 4-bit data types, which is particularly beneficial for models like DeepSeek-R1.
- Added
group_size_fallback_modeparameter for advanced weight compression. It controls how nodes that do not support the default group size are handled. By default (IGNORE), such nodes are skipped. WithERROR, an exception is raised if the channel size is not divisible by the group size, whileADJUSTattempts to modify the group size so it becomes valid. - (TorchFX) Added support for external quantizers in the
quantize_pt2eAPI, including XNNPACKQuantizer and CoreMLQuantizer. Users now can quantize their models in ExecuTorch for the XNNPACK and CoreML backends via the nncfquantize_pt2eemploying smooth quant, bias correction algorithms and a wide range of statistic collectors. - (ONNX) Added support for data-aware weight compression in the ONNX backend, including the AWQ and Scale Estimation algorithms. Provided an example demonstrating the data-aware weight compression pipeline using the
TinyLlama/TinyLlama-1.1B-Chat-v1.0model in ONNX format.
- Improvements:
- Support of weight compression for models with the Rotary Positional Embedding block.
- Support of weight compression for models with stateful self-attention blocks.
- Tutorials:
Compression-aware training:
- Features:
- (PyTorch) Enhanced initialization for "QAT with absorbable LoRA" using advanced compression methods (AWQ + Scale Estimation). This improvement replaces the previous basic data-free compression approach, enabling QAT to start with a more accurate model baseline and achieve superior final accuracy.
- Improvements:
- (PyTorch) Streamlined "QAT with absorbable LoRA" by removing checkpoint selection based on validation set. This change significantly reduces overall tuning time and maximum allocated memory. While the results on Wikitext are slightly worse, it provides a more efficient and faster tuning pipeline (e.g. reduced from 32 minutes to 25 minutes for SmoLM-1.7B).
- Tutorials:
Deprecations/Removals:
- Removed examples that used
create_compressed_modelAPI.
Requirements:
- Updated PyTorch (2.8.0) and Torchvision (0.23.0) versions.
- Set require
setuptools>=77to build package.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@bopeng1234 @jpablomch
v2.17.0
Post-training Quantization:
- General:
- (PyTorch) The function_hook module is now the default mechanism for model tracing. It has moved out from experimental status and has been moved to the core nncf.torch namespace.
- Features:
- (OpenVINO, PyTorch, TorchFX) Added 4-bit data-free AWQ (Activation-aware Weight Quantization) based on the per-column magnitudes of the weights making it possible to apply AWQ without a dataset for more accurate compression.
- (OpenVINO) Added support for quantizing of the value input for ScaledDotProductAttention for FP8.
- (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the
TinyLlama-1.1B-Chat-v0.3model in ONNX format using the NNCF weight compression API. - (ONNX) Added the
BackendParameters.EXTERNAL_DATA_DIRparameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model's external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data usingonnx.load("model.onnx", load_external_data=False), and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss.
- Fixes:
- (TorchFX, Experimental) To simplify usage, the nncf.torch.disable_patching() context manager has been made redundant and is no longer required (example).
- Fixed BiasCorrection failures with models without a batch dimension.
- Aligned quantile centers for NF4 with OpenVINO implementation.
- Weights compression statistics collection have been fixed to show the data types of ignored weights.
- Improvements:
- (OpenVINO) Added the version of NNCF to rt_info.
- Optimized weight compression for NF4 (up to 10x speed up).
- Support for
transformer>4.52bynncf.data.generate_text_data.
- Tutorials:
- Post-Training Optimization of MiniCPM-o 2.6 Model
- Post-Training Optimization of Qwen2.5-Omni Model
- Post-Training Optimization of InternVideo2 Model
- Post-Training Optimization of OpenVoice2 and MeloTTS Models
- Post-Training Optimization of Flex.2 Model
- Post-Training Optimization of Wan2.1 Model
- Post-Training Optimization of Phi-4-mini Model
- Post-Training Optimization of Torch.FX Stable Diffusion v3 Model
Compression-aware training:
- Features:
- (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The
nncf.compress_weightsAPI now includes a newcompression_formatoption,nncf.CompressionFormat.FQ_LORA_NLS. A sample QAT compression pipeline with preview support is available here. Building on our previous work with absorbable LoRA adapters, this new pipeline is specifically designed for downstream tasks. In contrast, the pipeline from the previous release was tailored to enhance general accuracy through knowledge distillation using static rank settings. For a more comprehensive understanding of both approaches, please refer to "Weight-Only Quantization Aware Training with LoRA and NLS" in the "Training-Time Compression Algorithms" section of the main README in the repository.
- (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The
- Fixes:
- (PyTorch) Minimized the disparity in accuracy between the Torch model and its exported OpenVINO equivalent for "Weight-Only Quantization Aware Training with LoRA and NLS".
- Improvements:
- (Pytorch) The evaluation and selection process for the best checkpoint in "QAT + absorbable LoRA" with knowledge distillation has been revised. The tuned Torch model is now evaluated using the validation split of Wikitext, while the final results are measured on the test split with the OpenVINO model. The results table for Wikitext has been updated accordingly and now includes three additional models.
Requirements:
- Updated ONNX Runtime (1.21.1).
- Updated PyTorch (2.7.1) and Torchvision (0.22.1) versions.
- Removed jstyleson from requirements.
v2.16.0
Post-training Quantization:
Features:
- (PyTorch) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce quality loss.
- (PyTorch, Experimental) Introduced TorchFunctionMode support for MinMax, FastBiasCorrection, SmoothQuant, WeightCompression algorithms.
Fixes:
- Fixed occasional failures of the weights compression algorithm on ARM CPUs.
- Fixed GPTQ fails with per-channel int4 weights compression.
- Fixed weight compression fails for models with fp8 weights.
- (PyTorch, Experimental) Fixed weights compression for float16/bfloat16 models.
- (PyTorch, Experimental) Fixed several memory leak issues: non-detached tensors, extracted modules & graph building with gradients.
Improvements:
- Reduced the run time and peak memory of the mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in the mixed precision case is about 20-40%; peak memory reduction is about 20%.
- The NNCF hardware config has been extended with the
narrow_rangeparameter, enabling more combinations of quantization configurations in the MinMax quantization algorithm. - (TorchFX, Experimental) Added quantization support for TorchFX models exported with dynamic shapes.
- (TorchFX, Experimental) The constant folding step is removed from the
quantize_pt2efunction and thetransform_for_annotationmethod of theOpenVINOQuantizerto align with thetorch.aoquantization implementation. - Optimized GPTQ algorithm behavior to decrease memory & time consumption by 2.71x and 1.16x, respectively.
- Added general support for optimization of models with FP8 and NF4 weights.
- Disable applying overflow fix for non 8-bit quantization.
Tutorials:
- Post-Training Optimization of Gemma3 Model
- Post-Training Optimization of GLM4-V Model
- Post-Training Optimization of Llasa Model
- Post-Training Optimization of YOLOv12 Model
- Post-Training Optimization of Phi-4-multimodal Model
- Post-Training Optimization of Qwen2.5VL Model
- Post-Training Optimization of DeepSeek-VL2 Model
- Post-Training Optimization of FLUX.1 Fill Model
- Post-Training Optimization of olmOCR Model
- Post-Training Optimization of SmolDocling Model
- Post-Training Optimization of SmolVLM2 Model
- Post-Training Optimization of GOT-OCR 2.0 Model
- Post-Training Optimization of LTX-Video Model
- Post-Training Optimization of OuteTTS Model
- Post-Training Optimization of SigLIP2 Model
- Post-Training Optimization of OpenCLIP Model
Compression-aware training:
Features:
- (PyTorch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The
nncf.compress_weightsAPI now includes a newcompression_formatoption,nncf.CompressionFormat.FQ_LORA, for this QAT method, a sample compression pipeline with preview support is available here. - (PyTorch) Changed compression modules serialization API:
compressed_model.nncf.get_configwas changed tonncf.torch.get_config. The documentation was updated to use the new API.
Requirements:
- Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions.
- Updated Transformers (>=4.48.0) version.
- Updated NumPy (<2.3.0) version support.
- Updated NetworkX (<3.5.0) version support.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@shumaari
v2.15.0
Post-training Quantization:
Features:
- (TensorFlow) The
nncf.quantize()method is now the recommended API for Quantization-Aware Training. Please refer to an example for more details about how to use a new approach. - (TensorFlow) Compression layers placement in the model now can be serialized and restored with new API functions:
nncf.tensorflow.get_config()andnncf.tensorflow.load_from_config(). Please see the documentation for the saving/loading of a quantized model for more details. - (OpenVINO) Added example with LLM quantization to FP8 precision.
- (TorchFX, Experimental) Preview support for the new
quantize_pt2eAPI has been introduced, enabling quantization oftorch.fx.GraphModulemodels with theOpenVINOQuantizerand theX86InductorQuantizerquantizers.quantize_pt2eAPI utilizes MinMax algorithm statistic collectors, as well as SmoothQuant, BiasCorrection and FastBiasCorrection Post-Training Quantization algorithms. - Added unification of scales for ScaledDotProductAttention operation.
Fixes:
- (ONNX) Fixed sporadic accuracy issues with the BiasCorrection algorithm.
- (ONNX) Fixed GroupConvolution operation weight quantization, which also improves performance for a number of models.
- Fixed AccuracyAwareQuantization algorithm to solve #3118 issue.
- Fixed issue with NNCF usage with potentially corrupted backend frameworks.
Improvements:
- (TorchFX, Experimental) Added YoloV11 support.
- (OpenvINO) The performance of the FastBiasCorrection algorithm was improved.
- Significantly faster data-free weight compression for OpenVINO models: INT4 compression is now up to 10x faster, while INT8 compression is up to 3x faster. The larger the model the higher the time reduction.
- AWQ weight compression is now up to 2x faster, improving overall runtime efficiency.
- Peak memory usage during INT4 data-free weight compression in the OpenVINO backend is reduced by up to 50% for certain models.
Tutorials:
- Post-Training Optimization of GLM-Edge-V Model
- Post-Training Optimization of OmniGen Model
- Post-Training Optimization of Sana Models
- Post-Training Optimization of BGE Models
- Post-Training Optimization of Stable Diffusion Inpainting Model
- Post-Training Optimization of LTX Video Model
- Post-Training Optimization of DeepSeek-R1-Distill Model
- Post-Training Optimization of Janus DeepSeek-LLM-1.3b Model
Deprecations/Removals:
- (TensorFlow) The
nncf.tensorflow.create_compressed_model()method is now marked as deprecated. Please use thenncf.quantize()method for the quantization initialization.
Requirements:
- Updated the minimal version for
numpy(>=1.24.0). - Removed
tqdmdependency.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119
@devesh-2002
v2.14.1
v2.14.0
Post-training Quantization:
Features:
- Introduced
backup_modeoptional parameter innncf.compress_weights()to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights. - Added the
quantizer_propagation_ruleparameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy. - Introduced
nncf.data.generate_text_dataAPI method that utilizes LLM to generate data for further data-aware optimization. See the example for details. - (OpenVINO) Extended support of data-free and data-aware weight compression methods for
nncf.compress_weights()with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU. - (OpenVINO) Introduced a new option
statistics_pathto cache and reuse statistics fornncf.compress_weights(), reducing the time required to find optimal compression configurations. See the TinyLlama example for details. - (TorchFX, Experimental) Added support for quantization and weight compression of Torch FX models. The compressed models can be directly executed via
torch.compile(compressed_model, backend="openvino")(see details here). Added INT8 quantization example. The list of supported features:- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
nncf.quantize(). - Data-free INT8, INT4, and mixed-precision weights compression with
nncf.compress_weights().
- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
- (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.
Fixes:
- Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
- Fixed type mismatch issue for
nncf.quantize_with_accuracy_control(). - Fixed BiasCorrection algorithm for specific branching cases.
- (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
- (OpenVINO) Fixed issue with the variational statistics processing for
nncf.compress_weights(). - (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.
Improvements:
- Reduction in peak memory by 30-50% for data-aware
nncf.compress_weights()with AWQ, Scale Estimation, LoRA and mixed-precision algorithms. - Reduction in compression time by 10-20% for
nncf.compress_weights()with AWQ algorithm. - Aligned behavior for ignored subgraph between different
networkxversions. - Extended ignored patterns with RoPE block for
nncf.ModelType.TRANSFORMERscheme. - (OpenVINO) Extended to the ignored scope for
nncf.ModelType.TRANSFORMERscheme with GroupNorm metatype. - (ONNX) SE-block ignored pattern variant for
torchvisionmobilenet_v3 has been extended.
Tutorials:
- Post-Training Optimization of Llama-3.2-11B-Vision Model
- Post-Training Optimization of YOLOv11 Model
- Post-Training Optimization of Whisper in Automatic speech recognition with OpenVINO Generate API
- Post-Training Optimization of Pixtral Model
- Post-Training Optimization of LLM ReAct Agent Model
- Post-Training Optimization of CatVTON Model
- Post-Training Optimization of Stable Diffusion v3 Model in Torch FX Representation
Known issues:
- (ONNX)
nncf.quantize()method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.
Deprecations/Removals:
- Migrated from using
setup.pytopyproject.tomlfor the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation throughsetup.pydoes not work anymore. No impact on the installation from PyPI and Conda. - Removed support for Python 3.8.
- (PyTorch)
nncf.torch.create_compressed_model()function has been deprecated.
Requirements:
- Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
- Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
- Updated NumPy (<2.2.0) version support.
- Updated Ultralytics (8.3.22) version.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119
@zina-cs
v2.13.0
Post-training Quantization:
Features:
- (OpenVINO) Added support for combining GPTQ with AWQ and Scale Estimation (SE) algorithms in
nncf.compress_weights()for more accurate weight compression of LLMs. Thus, the following combinations with GPTQ are now supported: AWQ+GPTQ+SE, AWQ+GPTQ, GPTQ+SE, GPTQ. - (OpenVINO) Added LoRA Correction Algorithm to further improve the accuracy of int4 compressed models on top of other algorithms - AWQ and Scale Estimation. It can be enabled via the optional
lora_correctionparameter of thenncf.compress_weights()API. The algorithm increases compression time and incurs a negligible model size overhead. Refer to accuracy/footprint trade-off for different int4 compression methods. - (PyTorch) Added implementation of the experimental Post-training Activation Pruning algorithm. Refer to Activation Sparsity for details.
- Added a memory monitoring tool for logging the memory a piece of python code or a script allocates. Refer to NNCF tools for details.
Fixes:
- (OpenVINO) Fixed the quantization of Convolution and LSTMSequence operations in cases where some inputs are part of a ShapeOF subgraph.
- (OpenVINO) Fixed issue with the FakeConvert duplication for FP8.
- Fixed Smooth Quant algorithm issue in case of the incorrect shapes.
- Fixed non-deterministic layer-wise scheduling.
Improvements:
- (OpenVINO) Increased hardware-fused pattern coverage.
- Improved progress bar logic during weights compression for more accurate remaining time estimation.
- Extended Scale estimation bitness range support for the
nncf.compress_weights(). - Removed extra logging for the algorithm-generated ignored scope.
Tutorials:
- Post-Training Optimization of Flux.1 Model
- Post-Training Optimization of PixArt-α Model
- Post-Training Optimization of InternVL2 Model
- Post-Training Optimization of Qwen2Audio Model
- Post-Training Optimization of NuExtract Model
- Post-Training Optimization of MiniCPM-V2 Model
Compression-aware training:
Fixes:
- (PyTorch) Fixed some scenarios of NNCF patching interfering with
torch.compile.
Requirements:
- Updated PyTorch (2.4.0) and Torchvision (0.19.0) versions.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119