perf: improve default/fallback backend implementation for blockwise quantization ops by matthewdouglas · Pull Request #1960 · bitsandbytes-foundation/bitsandbytes

matthewdouglas · 2026-05-29T18:29:13Z

Improves the four ops for blockwise quantization/dequantization on the default/fallback backend. These ops would be used when no device-specific implementation is available. For CPU, one of these four ops applies (quantize_4bit). On MPS and PrivateUse1 devices like HPU, more of them will apply.

This does not impact CUDA/ROCm/XPU implementations.

Impacted ops

Op	CPU	MPS	HPU	Other PrivateUse1
`quantize_4bit`	✓	✓ *	✓	✓
`quantize_blockwise`	-	✓	✓	✓
`dequantize_blockwise`	-	✓	✓	✓
`dequantize_4bit`	-	✓ *	-	✓

The ops marked with * can have impact on MPS, pending a planned follow-up PR. These will likely become fallbacks to kernels hosted on kernels-community, which only builds for macOS 26+ and will fail on macOS 15 or without kernels installed.

On impacted devices devices, these ops are used for:

Model loading: quantize_4bit and quantize_blockwise (nested absmax compression)
Inference/Training: dequantize_blockwise (nested absmax decompression), dequantize_4bit
8-bit optimizers: quantize_blockwise + dequantize_blockwise for optimizer state

`quantize_4bit`

This is the only op without a CPU C++ kernel, so improvements apply directly to CPU users quantizing models.

Before: Excessive intermediate memory, causing OOM on large models.
After: Reduced memory usage throughout with better performance. Also fixes a division-by-zero on all-zero weight blocks.

_{Ryzen 7950X, PyTorch 2.12.0, NF4, blocksize=64, bf16 input}

Shape	Before	After
1024x4096	15ms / 136MB	9ms / 30MB
4096x4096	236ms / 2112MB	45ms / 128MB
8192x7168	808ms / 7616MB	155ms / 448MB
28672x8192	OOM	~600ms / ~1.8GB

`quantize_blockwise`

Before: Excessive intermediate memory.
After: Reduced memory usage throughout, and better performance.

_{Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256}

Shape	Before	After
1024x4096	193ms / 2GB	2ms / 9MB
4096x4096	3000ms / 32GB	80ms / 144MB
8192x7168	OOM	277ms / 504MB

CPU data is illustrative of default implementation improvements; C++ kernels handle CPU in practice.

`dequantize_blockwise`

Before: Unnecessary overhead and excess memory usage.
After: Cleaner implementation with reduced memory, improved throughput.

_{Ryzen 7950X, PyTorch 2.12.0, fp32, blocksize=256}

Shape	Before	After
1024x4096	2ms / 12MB	0.1ms / 4MB
4096x4096	33ms / 192MB	13ms / 64MB
8192x7168	109ms / 672MB	43ms / 224MB

CPU data is illustrative - separate C++ kernels handle CPU in practice.

_{MPS: M4, macOS 15, PyTorch 2.12.0, fp32, blocksize=256}

Shape	Before	After
4096x4096	8.44ms / 128MB	0.59ms / 32MB

`dequantize_4bit`

Before: Intermediate computation in lower precision, producing slightly different results
from the C++ and CUDA kernels.
After: Matches C++/CUDA precision. Improved throughput via torch.compile.

_{Ryzen 7950X, PyTorch 2.12.0 — NF4, blocksize=64, bf16 output}

Shape	Before	After
1024x4096	7ms / 14MB	0.5ms / 2MB
4096x4096	42ms / 224MB	7ms / 32MB
8192x7168	140ms / 784MB	24ms / 112MB

CPU data is illustrative - separate C++ kernels handle CPU in practice.

…uantize/dequantize ops

github-actions · 2026-05-29T18:32:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

perf: improve default/fallback backend implementation for blockwise q…

5a5c6ac

…uantize/dequantize ops

matthewdouglas added the Cross Platform label May 29, 2026

matthewdouglas added this to the v0.50.0 milestone May 29, 2026

Minor improvement

f1f7eb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: improve default/fallback backend implementation for blockwise quantization ops#1960

perf: improve default/fallback backend implementation for blockwise quantization ops#1960
matthewdouglas wants to merge 2 commits into
mainfrom
improve-default-blockwise-quant-ops

matthewdouglas commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

matthewdouglas commented May 29, 2026

Impacted ops

quantize_4bit

quantize_blockwise

dequantize_blockwise

dequantize_4bit

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`quantize_4bit`

`quantize_blockwise`

`dequantize_blockwise`

`dequantize_4bit`