Skip to content

perf: improve default/fallback backend implementation for blockwise quantization ops#1960

Open
matthewdouglas wants to merge 2 commits into
mainfrom
improve-default-blockwise-quant-ops
Open

perf: improve default/fallback backend implementation for blockwise quantization ops#1960
matthewdouglas wants to merge 2 commits into
mainfrom
improve-default-blockwise-quant-ops

Conversation

@matthewdouglas
Copy link
Copy Markdown
Member

Improves the four ops for blockwise quantization/dequantization on the default/fallback backend. These ops would be used when no device-specific implementation is available. For CPU, one of these four ops applies (quantize_4bit). On MPS and PrivateUse1 devices like HPU, more of them will apply.

This does not impact CUDA/ROCm/XPU implementations.

Impacted ops

Op CPU MPS HPU Other PrivateUse1
quantize_4bit ✓ *
quantize_blockwise -
dequantize_blockwise -
dequantize_4bit - ✓ * -

The ops marked with * can have impact on MPS, pending a planned follow-up PR. These will likely become fallbacks to kernels hosted on kernels-community, which only builds for macOS 26+ and will fail on macOS 15 or without kernels installed.

On impacted devices devices, these ops are used for:

  • Model loading: quantize_4bit and quantize_blockwise (nested absmax compression)
  • Inference/Training: dequantize_blockwise (nested absmax decompression), dequantize_4bit
  • 8-bit optimizers: quantize_blockwise + dequantize_blockwise for optimizer state

quantize_4bit

This is the only op without a CPU C++ kernel, so improvements apply directly to CPU users quantizing models.

Before: Excessive intermediate memory, causing OOM on large models.
After: Reduced memory usage throughout with better performance. Also fixes a division-by-zero on all-zero weight blocks.

Ryzen 7950X, PyTorch 2.12.0, NF4, blocksize=64, bf16 input

Shape Before After
1024x4096 15ms / 136MB 9ms / 30MB
4096x4096 236ms / 2112MB 45ms / 128MB
8192x7168 808ms / 7616MB 155ms / 448MB
28672x8192 OOM ~600ms / ~1.8GB

quantize_blockwise

Before: Excessive intermediate memory.
After: Reduced memory usage throughout, and better performance.

Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256

Shape Before After
1024x4096 193ms / 2GB 2ms / 9MB
4096x4096 3000ms / 32GB 80ms / 144MB
8192x7168 OOM 277ms / 504MB

CPU data is illustrative of default implementation improvements; C++ kernels handle CPU in practice.


dequantize_blockwise

Before: Unnecessary overhead and excess memory usage.
After: Cleaner implementation with reduced memory, improved throughput.

Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256

Shape Before After
1024x4096 2ms / 12MB 0.1ms / 4MB
4096x4096 33ms / 192MB 13ms / 64MB
8192x7168 109ms / 672MB 43ms / 224MB

CPU data is illustrative - separate C++ kernels handle CPU in practice.

MPS: M4, macOS 15, PyTorch 2.12.0, fp32, blocksize=256

Shape Before After
4096x4096 8.44ms / 128MB 0.59ms / 32MB

dequantize_4bit

Before: Intermediate computation in lower precision, producing slightly different results
from the C++ and CUDA kernels.
After: Matches C++/CUDA precision. Improved throughput via torch.compile.

Ryzen 7950X, PyTorch 2.12.0 — NF4, blocksize=64, bf16 output

Shape Before After
1024x4096 7ms / 14MB 0.5ms / 2MB
4096x4096 42ms / 224MB 7ms / 32MB
8192x7168 140ms / 784MB 24ms / 112MB

CPU data is illustrative - separate C++ kernels handle CPU in practice.

@github-actions
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant