perf: improve default/fallback backend implementation for blockwise quantization ops#1960
Open
matthewdouglas wants to merge 2 commits into
Open
perf: improve default/fallback backend implementation for blockwise quantization ops#1960matthewdouglas wants to merge 2 commits into
matthewdouglas wants to merge 2 commits into
Conversation
…uantize/dequantize ops
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improves the four ops for blockwise quantization/dequantization on the default/fallback backend. These ops would be used when no device-specific implementation is available. For CPU, one of these four ops applies (
quantize_4bit). On MPS and PrivateUse1 devices like HPU, more of them will apply.This does not impact CUDA/ROCm/XPU implementations.
Impacted ops
quantize_4bitquantize_blockwisedequantize_blockwisedequantize_4bitThe ops marked with
*can have impact on MPS, pending a planned follow-up PR. These will likely become fallbacks to kernels hosted onkernels-community, which only builds for macOS 26+ and will fail on macOS 15 or withoutkernelsinstalled.On impacted devices devices, these ops are used for:
quantize_4bitandquantize_blockwise(nested absmax compression)dequantize_blockwise(nested absmax decompression),dequantize_4bitquantize_blockwise+dequantize_blockwisefor optimizer statequantize_4bitThis is the only op without a CPU C++ kernel, so improvements apply directly to CPU users quantizing models.
Before: Excessive intermediate memory, causing OOM on large models.
After: Reduced memory usage throughout with better performance. Also fixes a division-by-zero on all-zero weight blocks.
Ryzen 7950X, PyTorch 2.12.0, NF4, blocksize=64, bf16 input
quantize_blockwiseBefore: Excessive intermediate memory.
After: Reduced memory usage throughout, and better performance.
Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256
CPU data is illustrative of default implementation improvements; C++ kernels handle CPU in practice.
dequantize_blockwiseBefore: Unnecessary overhead and excess memory usage.
After: Cleaner implementation with reduced memory, improved throughput.
Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256
CPU data is illustrative - separate C++ kernels handle CPU in practice.
MPS: M4, macOS 15, PyTorch 2.12.0, fp32, blocksize=256
dequantize_4bitBefore: Intermediate computation in lower precision, producing slightly different results
from the C++ and CUDA kernels.
After: Matches C++/CUDA precision. Improved throughput via
torch.compile.Ryzen 7950X, PyTorch 2.12.0 — NF4, blocksize=64, bf16 output
CPU data is illustrative - separate C++ kernels handle CPU in practice.