Skip to content

Support vLLM-based Model Quantization with llm_compressor Export#1978

Open
changwangss wants to merge 3 commits into
mainfrom
wangchang/vllmmixin
Open

Support vLLM-based Model Quantization with llm_compressor Export#1978
changwangss wants to merge 3 commits into
mainfrom
wangchang/vllmmixin

Conversation

@changwangss

@changwangss changwangss commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Description

Summary

Adds --enable_vllm_loading to load a model through the vLLM engine before quantization, enabling quantization of models that rely on vLLM's kernel infrastructure (e.g., FusedMoE, ColumnParallelLinear) and exporting them in llm_compressor format for direct vLLM inference.

Supported schemes: W4A16, W8A16, MXFP8, MXFP4, NVFP4
Constraint: Single-GPU only (tensor_parallel_size=1); enforced at both CLI and API level.


Key Changes

Component Change
cli/parser.py, cli/main.py Add --enable_vllm_loading and --vllm_model_kwargs (JSON) CLI flags; TP=1 guard
compressors/vllm_mixin.py (new) VLLMMixin: registers vLLM LinearBase layers into layer_config
compressors/vllm/vllm_calibrator.py (new) Drives llm.generate() for block-hook calibration; registers act_max hooks for static activation quantization (NVFP4)
compressors/vllm/linearized_fused_moe.py (new) Decomposes vLLM FusedMoE (3-D weight tensors) into per-expert nn.Linear before quantization
context/model.py Add vLLM load path; call linearize_vllm_moe() in prepare_model()
wrapper.py vllm_linear_forward: replaces vLLM quant_method.apply with F.linear (valid at TP=1)
export/export_to_llmcompressor/export.py Integer formats: convert LinearBasenn.Linear before packing; unfuse qkv_projq/k/v_proj and gate_up_projgate/up_proj after packing
export/export_to_llmcompressor/export_to_fp.py FP formats: same unfuse logic for MXFP8/MXFP4/NVFP4
algorithms/quantization/rtn/quantizer.py list(block.named_modules()) snapshot to prevent RuntimeError when unfuse mutates _modules during iteration
calibration/hooks.py Pass hidden_states as keyword arg; recognize LinearBase as leaf layer (not block)
compressors/data_driven.py Preserve int64 tensors (positions) in input_others; skip collect_reference when vLLM ForwardContext is unavailable (safe for iters=0)

Validated Tests

Environment: vLLM v0.22.1, A100-80G, single GPU

Quantization

# W4A16 — Qwen3-30B-A3B, Qwen3-4B (dense/MoE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme W4A16 \
  --output_dir /tmp/qwen3_30b_w4a16 \
  --enable_vllm_loading \
  --iters 0 \
  --disable_opt_rtn

# W4A16 — Qwen3-30B-A3B, Qwen3-4B (dense/MoE, Opt-RTN)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme W4A16 \
  --output_dir /tmp/qwen3_30b_w4a16 \
  --enable_vllm_loading \
  --iters 0 \
  --nsamples 128

# NVFP4 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot, static activation quantization)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme NVFP4 \
  --output_dir /tmp/qwen3_4b_nvfp4 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 \
  --nsamples 128

# MXFP4 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-4B \
  --format llm_compressor \
  --scheme MXFP4 \
  --output_dir /tmp/qwen3_4b_mxfp4 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 

# MXFP8 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme MXFP8 \
  --output_dir /tmp/qwen3_30b_mxfp8 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 \

Inference (vLLM)
from vllm import LLM, SamplingParams

llm = LLM(
    model="/tmp/qwen3_30b_w4a16/w4g128",  # or any output dir above
    max_model_len=512,
)
outputs = llm.generate(
    ["What is the capital of France?"],
    SamplingParams(temperature=0.0, max_tokens=32),
)
print(outputs[0].outputs[0].text)

quantization=compressed-tensors is loaded automatically from quantization_config in config.json. All four quantized models loaded and generated correct outputs.

Known Limitations

  • tensor_parallel_size > 1 is explicitly rejected with ValueError— wrapper forward and unfuse logic assume TP=1.
  • Multimodal and diffusion models conflict with --enable_vllm_loading and raise an early error.
  • iters > 0 (gradient-based tuning) with vLLM loading is not tested; collect_reference skips are safe only for iters=0

Type of Change

New feature

Related Issues

Fixes or relates to #1119

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

changwangss and others added 3 commits July 1, 2026 14:30
Signed-off-by: changwangss <chang1.wang@intel.com>
Signed-off-by: changwangss <chang1.wang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant