Support vLLM-based Model Quantization with llm_compressor Export#1978
Open
changwangss wants to merge 3 commits into
Open
Support vLLM-based Model Quantization with llm_compressor Export#1978changwangss wants to merge 3 commits into
changwangss wants to merge 3 commits into
Conversation
Signed-off-by: changwangss <chang1.wang@intel.com>
Signed-off-by: changwangss <chang1.wang@intel.com>
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Summary
Adds
--enable_vllm_loadingto load a model through the vLLM engine before quantization, enabling quantization of models that rely on vLLM's kernel infrastructure (e.g.,FusedMoE,ColumnParallelLinear) and exporting them inllm_compressorformat for direct vLLM inference.Supported schemes:
W4A16,W8A16,MXFP8,MXFP4,NVFP4Constraint: Single-GPU only (
tensor_parallel_size=1); enforced at both CLI and API level.Key Changes
cli/parser.py,cli/main.py--enable_vllm_loadingand--vllm_model_kwargs(JSON) CLI flags; TP=1 guardcompressors/vllm_mixin.py(new)VLLMMixin: registers vLLMLinearBaselayers intolayer_configcompressors/vllm/vllm_calibrator.py(new)llm.generate()for block-hook calibration; registers act_max hooks for static activation quantization (NVFP4)compressors/vllm/linearized_fused_moe.py(new)FusedMoE(3-D weight tensors) into per-expertnn.Linearbefore quantizationcontext/model.pylinearize_vllm_moe()inprepare_model()wrapper.pyvllm_linear_forward: replaces vLLMquant_method.applywithF.linear(valid at TP=1)export/export_to_llmcompressor/export.pyLinearBase→nn.Linearbefore packing; unfuseqkv_proj→q/k/v_projandgate_up_proj→gate/up_projafter packingexport/export_to_llmcompressor/export_to_fp.pyalgorithms/quantization/rtn/quantizer.pylist(block.named_modules())snapshot to preventRuntimeErrorwhen unfuse mutates_modulesduring iterationcalibration/hooks.pyhidden_statesas keyword arg; recognizeLinearBaseas leaf layer (not block)compressors/data_driven.pyint64tensors (positions) ininput_others; skipcollect_referencewhen vLLMForwardContextis unavailable (safe foriters=0)Validated Tests
Environment: vLLM v0.22.1, A100-80G, single GPU
Quantization
Inference (vLLM)
from vllm import LLM, SamplingParams
quantization=compressed-tensors is loaded automatically from
quantization_configinconfig.json. All four quantized models loaded and generated correct outputs.Known Limitations
ValueError— wrapper forward and unfuse logic assume TP=1.--enable_vllm_loadingand raise an early error.iters > 0(gradient-based tuning) with vLLM loading is not tested; collect_reference skips are safe only foriters=0Type of Change
New feature
Related Issues
Fixes or relates to #1119
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.