[cuda backend] store scale/zero in int4_plain_mm in [N, n_groups] layout by Gasoonjia · Pull Request #20038 · pytorch/executorch

Gasoonjia · 2026-06-04T17:47:00Z

This PR updates int4_plain_mm in cuda backend to reads scale/zero in the transposed [N, n_groups] layout instead of [n_groups, N]. In this way every warp can load both scale and zero together in one cache line, instead of 32 cache lines previously.

gemma4-31b decode perf: ~27 token/s -> 37.36 token/s.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

pytorch-bot · 2026-06-04T17:47:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20038

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Pending

As of commit 8e404c7 with merge base a79f3e4 ():

NEW FAILURES - The following jobs have failed:

pull / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t 09440886de763fdbe97482d41f78186128f7cf9734531c766dcb8f80ccb4713e /exec failed with exit code 1
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 8a3d91fbc1ffa11d3082760a5e079d1ffe657125ae58f981f8be1e275baa5b67 /exec failed with exit code 139
pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_vit_model
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mergennachin · 2026-06-05T14:38:04Z

+// Reads scale/zero in the transposed [N, n_groups] layout (transposed AOT at
+// export time). With group_size >= 32, one uint4 (32 weights) maps to exactly


If this is the new contract, can you do runtime validation here on the new packing format and reject old formats?

And add a unit test to showcase that it successfully rejects an old format

mergennachin · 2026-06-05T15:01:36Z

+        w = Int4Tensor(
+            qdata=w.qdata,
+            scale=w.scale.t().contiguous(),
+            zero_point=w.zero_point.t().contiguous(),
+            block_size=w.block_size,
+            shape=w.shape,
+            act_pre_scale=w.act_pre_scale,
+            activation_dtype=w.activation_dtype,
+        )
+        module.weight = nn.Parameter(w, requires_grad=False)


The new pack-time/AOT layout is much better than the runtime transpose cache, but I don’t think we should represent it as a plain Int4Tensor anymore. Int4Tensor has a fixed strong contract already.

For example, if the newly format of Int4Tensor is created but it is not dispatched to int4_plain_mm somehow but accidentally goes through regular torchao kernel, this will be an issue. For example, int4_dispatch.py globally overrides Int4Tensor F.linear, so any native torchao Int4Tensor that bypasses pack_cuda.py can be interpreted as newly packed.

Instead of changing the fixed contract of Int4Tensor, I'd rather create a new tensor subclass in ET TorchAOBaseTensor and change int4_dispatch accordingly

wdyt @metascroy @digantdesai

linux-foundation-easycla · 2026-06-08T06:10:40Z

The committers listed above are authorized under a signed CLA.

✅ login: Gasoonjia / name: gasoonjia (08278e7, 19cd76a, d3632f0)

Gasoonjia · 2026-06-08T08:04:10Z

Thanks @mergennachin for your comment. Have introduce a new int4 class CudaCoalescedInt4Tensor living in cuda backend, guarded by tests for mis-dispatch, and the PR also udpated int4_dispatch.py and pack_cuda.py to support the new class. Also the We can further update the tensor into executorch.extension.llm if mlx and other backend need this in the future.

Gasoonjia · 2026-06-08T08:43:28Z

also add runtime check for layout format.

…decode Coalesce int4 W4A8 decode-matvec scale/zero loads by baking the [N, n_groups] layout into the weight constant at pack time. Introduces CudaCoalescedInt4Tensor (an ExecuTorch-internal subclass) that owns the [n_groups, N] -> [N, n_groups] transpose, registers the int4_plain_mm dispatch on it by type, and adds the coalesced dp4a matvec kernel that reads scale/zero row-for-row with qdata (single coalesced load vs 32 stride-N cache lines). ~29.2 -> 37.4 tok/s on gemma group_size=32. Rebased onto main; INT8 dp4a decode op and the floor_div pass from this branch landed separately and now live in quantize_op_dispatch/.

github-actions · 2026-06-09T06:42:12Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 4, 2026

Gasoonjia changed the title ~~G4 opt coalesced scale~~ [cuda backend] store scale/zero in [N, n_groups] to reduce cache read cost. Jun 4, 2026

Gasoonjia changed the title ~~[cuda backend] store scale/zero in [N, n_groups] to reduce cache read cost.~~ [cuda backend] store scale/zero in int4_plain_mm in [N, n_groups] layout Jun 4, 2026

Gasoonjia marked this pull request as ready for review June 4, 2026 17:58

Gasoonjia requested review from kirklandsign and larryliu0820 as code owners June 4, 2026 17:58

mergennachin requested a review from digantdesai June 4, 2026 18:06

mergennachin reviewed Jun 4, 2026

View reviewed changes

Comment thread backends/cuda/runtime/shims/int4_plain_mm.cuh Outdated

mergennachin reviewed Jun 4, 2026

View reviewed changes

Comment thread backends/cuda/runtime/shims/int4_plain_mm.cuh Outdated

digantdesai reviewed Jun 4, 2026

View reviewed changes

Comment thread backends/cuda/runtime/shims/int4_plain_mm.cuh Outdated

digantdesai reviewed Jun 4, 2026

View reviewed changes

Comment thread backends/cuda/runtime/shims/int4_plain_mm.cuh Outdated

mergennachin reviewed Jun 5, 2026

View reviewed changes

mergennachin requested a review from metascroy June 5, 2026 15:02

Gasoonjia force-pushed the g4-int8-decode-op branch from 1a527d2 to 20d021d Compare June 5, 2026 20:00

Gasoonjia force-pushed the g4-opt-coalesced-scale branch from c9bc3fb to df47b27 Compare June 8, 2026 07:44

Gasoonjia force-pushed the g4-opt-coalesced-scale branch from d80b1b6 to 99f20f8 Compare June 8, 2026 08:05

Gasoonjia requested review from JacobSzwejbka, SS-JIA, abhinaykukkadapu, kimishpatel, psiddh, rascani, robert-kalmar and shoumikhin as code owners June 8, 2026 08:05

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels Jun 8, 2026

Gasoonjia force-pushed the g4-opt-coalesced-scale branch 2 times, most recently from d3632f0 to 6ac4974 Compare June 8, 2026 08:42

mergennachin approved these changes Jun 8, 2026

View reviewed changes

Base automatically changed from g4-int8-decode-op to main June 9, 2026 04:43

Gasoonjia force-pushed the g4-opt-coalesced-scale branch from 6ac4974 to 8e404c7 Compare June 9, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda backend] store scale/zero in int4_plain_mm in [N, n_groups] layout#20038

[cuda backend] store scale/zero in int4_plain_mm in [N, n_groups] layout#20038
Gasoonjia wants to merge 1 commit into
mainfrom
g4-opt-coalesced-scale

Gasoonjia commented Jun 4, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin Jun 5, 2026 •

edited

Loading

Uh oh!

mergennachin Jun 5, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Gasoonjia commented Jun 8, 2026

Uh oh!

Gasoonjia commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Reads scale/zero in the transposed [N, n_groups] layout (transposed AOT at
		// export time). With group_size >= 32, one uint4 (32 weights) maps to exactly

Conversation

Gasoonjia commented Jun 4, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20038

❌ 4 New Failures, 1 Pending

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gasoonjia commented Jun 8, 2026

Uh oh!

Gasoonjia commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gasoonjia commented Jun 4, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jun 4, 2026 •

edited

Loading

mergennachin Jun 5, 2026 •

edited

Loading

mergennachin Jun 5, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 8, 2026 •

edited

Loading

This PR needs a `release notes:` label