Support vLLM-based Model Quantization with llm_compressor Export by changwangss · Pull Request #1978 · intel/auto-round

changwangss · 2026-07-01T14:49:43Z

Description

Summary

Adds --enable_vllm_loading to load a model through the vLLM engine before quantization, enabling quantization of models that rely on vLLM's kernel infrastructure (e.g., FusedMoE, ColumnParallelLinear) and exporting them in llm_compressor format for direct vLLM inference.

Supported schemes: W4A16, W8A16, MXFP8, MXFP4, NVFP4
Constraint: Single-GPU only (tensor_parallel_size=1); enforced at both CLI and API level.

Key Changes

Component	Change
`cli/parser.py`, `cli/main.py`	Add `--enable_vllm_loading` and `--vllm_model_kwargs` (JSON) CLI flags; TP=1 guard
`compressors/vllm_mixin.py` (new)	`VLLMMixin`: registers vLLM `LinearBase` layers into `layer_config`
`compressors/vllm/vllm_calibrator.py` (new)	Drives `llm.generate()` for block-hook calibration; registers act_max hooks for static activation quantization (NVFP4)
`compressors/vllm/linearized_fused_moe.py` (new)	Decomposes vLLM `FusedMoE` (3-D weight tensors) into per-expert `nn.Linear` before quantization
`context/model.py`	Add vLLM load path; call `linearize_vllm_moe()` in `prepare_model()`
`wrapper.py`	`vllm_linear_forward`: replaces vLLM `quant_method.apply` with `F.linear` (valid at TP=1)
`export/export_to_llmcompressor/export.py`	Integer formats: convert `LinearBase`→`nn.Linear` before packing; unfuse `qkv_proj`→`q/k/v_proj` and `gate_up_proj`→`gate/up_proj` after packing
`export/export_to_llmcompressor/export_to_fp.py`	FP formats: same unfuse logic for MXFP8/MXFP4/NVFP4
`algorithms/quantization/rtn/quantizer.py`	`list(block.named_modules())` snapshot to prevent `RuntimeError` when unfuse mutates `_modules` during iteration
`calibration/hooks.py`	Pass `hidden_states` as keyword arg; recognize `LinearBase` as leaf layer (not block)
`compressors/data_driven.py`	Preserve `int64` tensors (`positions`) in `input_others`; skip `collect_reference` when vLLM `ForwardContext` is unavailable (safe for `iters=0`)

Validated Tests

Environment: vLLM v0.22.1, A100-80G, single GPU

Quantization

# W4A16 — Qwen3-30B-A3B, Qwen3-4B (dense/MoE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme W4A16 \
  --output_dir /tmp/qwen3_30b_w4a16 \
  --enable_vllm_loading \
  --iters 0 \
  --disable_opt_rtn

# W4A16 — Qwen3-30B-A3B, Qwen3-4B (dense/MoE, Opt-RTN)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme W4A16 \
  --output_dir /tmp/qwen3_30b_w4a16 \
  --enable_vllm_loading \
  --iters 0 \
  --nsamples 128

# NVFP4 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot, static activation quantization)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme NVFP4 \
  --output_dir /tmp/qwen3_4b_nvfp4 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 \
  --nsamples 128

# MXFP4 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-4B \
  --format llm_compressor \
  --scheme MXFP4 \
  --output_dir /tmp/qwen3_4b_mxfp4 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 

# MXFP8 — Qwen3-30B-A3B, Qwen3-4B (dense/MOE, RTN zero-shot)
auto-round \
  --model /dataset/Qwen3-30B-A3B \
  --format llm_compressor \
  --scheme MXFP8 \
  --output_dir /tmp/qwen3_30b_mxfp8 \
  --enable_vllm_loading \
  --disable_opt_rtn \
  --iters 0 \

Inference (vLLM)
from vllm import LLM, SamplingParams

llm = LLM(
    model="/tmp/qwen3_30b_w4a16/w4g128",  # or any output dir above
    max_model_len=512,
)
outputs = llm.generate(
    ["What is the capital of France?"],
    SamplingParams(temperature=0.0, max_tokens=32),
)
print(outputs[0].outputs[0].text)

quantization=compressed-tensors is loaded automatically from quantization_config in config.json. All four quantized models loaded and generated correct outputs.

Known Limitations

tensor_parallel_size > 1 is explicitly rejected with ValueError— wrapper forward and unfuse logic assume TP=1.
Multimodal and diffusion models conflict with --enable_vllm_loading and raise an early error.
iters > 0 (gradient-based tuning) with vLLM loading is not tested; collect_reference skips are safe only for iters=0

Type of Change

New feature

Related Issues

Fixes or relates to #1119

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: changwangss <chang1.wang@intel.com>

for more information, see https://pre-commit.ci

changwangss and others added 3 commits July 1, 2026 14:30

support vllm based model quantization

6939e02

Signed-off-by: changwangss <chang1.wang@intel.com>

improve nvfp4 moe

9285b86

Signed-off-by: changwangss <chang1.wang@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

31773b8

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support vLLM-based Model Quantization with llm_compressor Export#1978

Support vLLM-based Model Quantization with llm_compressor Export#1978
changwangss wants to merge 3 commits into
mainfrom
wangchang/vllmmixin

changwangss commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

changwangss commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Key Changes

Validated Tests

Quantization

Known Limitations

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changwangss commented Jul 1, 2026 •

edited

Loading