Add moe prefill/ decode with int2/int4/int8 sym /asym and fp8 e4m3 e5m2 by Copilot · Pull Request #1813 · intel/auto-round

Copilot · 2026-05-14T04:27:10Z

Restore the qwen MoE perf benchmark to Qwen/Qwen1.5-MoE-A2.7B and remove the DeepSeek-V2-Lite case from test/test_ark/test_moe_model_perf.py.

Description

Parametrization – qwen entry switched back from the inline "Qwen/Qwen3-30B-A3B" to the shared qwen_moe_name_or_path helper; deepseek-v2-lite entry removed.
Imports – re-add qwen_moe_name_or_path, drop now-unused deepseek_v2_name_or_path.
Docstring – module header updated to reference only Qwen1.5-MoE.

Type of Change

Test

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

for more information, see https://pre-commit.ci

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

a32543254 · 2026-05-18T15:33:53Z

@copilot resolve the merge conflicts in this pull request

…ecode-implementation # Conflicts: # auto_round_extension/ark/auto_round_kernel/ark.cpp Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot · 2026-05-18T15:36:17Z

@copilot resolve the merge conflicts in this pull request

Merged origin/main in 430868d. The only conflict was in auto_round_extension/ark/auto_round_kernel/ark.cpp — I kept the new moe_gemm_decode_wrapper from this branch alongside the updated sage_dynamic_quant signature (added bias param) from main.

Copilot

Pull request overview

This PR adds an XPU-optimized MoE decode-phase GEMV kernel (small M per expert) with multiple weight formats, and wires it through the C++/PyTorch extension layer with corresponding unit tests.

Changes:

Added a SYCL decode GEMV kernel supporting FP16/BF16, INT8/INT4/INT2 (sym/asym), and FP8 (E4M3/E5M2) weights.
Exposed the kernel via pybind (moe_gemm_decode) and added a Python wrapper with argument validation.
Added unit tests covering the new decode paths and key validation error cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
auto_round_extension/ark/test/test_moe.py	Adds decode-path unit tests plus packing/dequant reference helpers for INT2/4/8 and FP8.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_moe_decode.hpp	Introduces the new SYCL MoE decode GEMV kernel implementations and dispatch.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_common.hpp	Declares the new `moe_gemm_decode` API (but docs currently lag implementation).
auto_round_extension/ark/auto_round_kernel/ark.cpp	Includes the new header and binds `moe_gemm_decode` via pybind.
auto_round_extension/ark/auto_round_kernel/init.py	Adds the `ARK.moe_gemm_decode` Python wrapper and validation logic.

Comments suppressed due to low confidence (2)

auto_round_extension/ark/auto_round_kernel/init.py:871

num_tokens_per_expert is converted to int32/contiguous but its device is not validated. If it’s a CPU tensor, the kernel will treat a host pointer as device memory. Please ensure num_tokens_per_expert is on XPU (and matches activations.device), or move it to XPU explicitly before calling into the extension.

            weights = weights.contiguous()

        if num_tokens_per_expert.dtype != torch.int32:
            num_tokens_per_expert = num_tokens_per_expert.to(torch.int32)
        if not num_tokens_per_expert.is_contiguous():

auto_round_extension/ark/auto_round_kernel/init.py:896

group_size is used in modulo/division checks (e.g., K % group_size) without validating group_size > 0. Passing group_size=0 will raise a ZeroDivisionError rather than a clear ValueError. Please add an explicit check that group_size is a positive integer before any modulo/division operations.

            if scales is None:
                raise ValueError("scales is required for FP8 weights")
            if scales.dtype != activations.dtype:
                raise ValueError("scales dtype must match activations dtype")
            if K % group_size != 0:

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/132db2ab-85c0-45b6-81a7-b9baaa533e5e Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

- test_perf_int8_per_tensor mirrors test_perf_fp8_per_tensor with sym round-nearest-clamp int8 packing and [E] fp32 scales; native(ms) / dpas(ms) columns stay '--' since the Variant A DPAS entry point IS the ARK column for this scheme. - test_accuracy_int8_per_tensor_dpas mirrors the FP8 counterpart under the standard _TOL_INT8 tolerance. - Sync README_MOE_PREFILL_PERF.md + _CN.md with a new INT8 per-tensor section (per AGENTS.md CN-docs rule). - Both tests skip silently when the moe_gemm_prefill_int_dpas pybind symbol is absent.

Header now reads 'INT8 per-expert scale (scales=[E] fp32, ...)' instead of 'INT8 per-expert scale int8 (...)'. Cosmetic fix from code review.

`sycl_tla_moe_prefill_int_dpas.hpp` opens its own `ark::moe_dpas_int` namespace and imports select names from `ark::moe_dpas_fp8`, but `make_moe_tensor` (defined in the FP8 header) was missing from the using-declarations. The three call sites in `MoEGEMM_int` therefore failed to resolve at compile time (unqualified lookup found nothing, and ADL cannot help because the arguments are pointers/ints). Add the missing `using ::ark::moe_dpas_fp8::make_moe_tensor;`.

Removes the host `wait()` between the asum precompute kernel and the INT8 asym DPAS grouped-GEMM submit in `moe_prefill_int_dpas_per_group_dispatch`. The asum event is now threaded into the DPAS submit via `cgh.depends_on(...)` on the same SYCL queue, so: * device-side ordering is preserved (identical numerical results), * one host round-trip per moe_gemm_prefill call is eliminated, and * DPAS launch-prep (template inst, hw query, kernel props, arg marshalling) runs on the host in parallel with asum's device time. Sym path is byte-identical: the new event parameter on `MoEGEMMLauncher_int` defaults to an empty (completed) event, and the sym launch site keeps its previous argument list. Refs perf gap in `test_perf_int8[True-dtype0]` where asym dpas trails sym dpas by ~20% -- this closes the host-stall portion of that gap. Full mainloop-fusion (Lever 2 as originally described) is deferred: it requires either DPAS w8a16 A-fragment lane-layout knowledge or an SLM-atomic accumulator whose contention profile needs to be measured on XPU hardware to confirm it isn't a regression vs. the current well-coalesced pre-pass kernel. Not landing that blind.

… 1)" This reverts commit 0363ace.

Roll back the INT8 asym DPAS path (perf regressed vs. dequant fallback on hardware). Add INT4-sym and INT2-sym prefill paths that upcast the packed weights into an int8_t [E, N, K] view inside the existing dequant workspace and dispatch through the same per-group INT8 DPAS mainloop the S8-sym branch uses, reusing the packed scale tensor unmodified.

chensuyue · 2026-07-01T22:14:43Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-01T22:14:50Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Add packed-word decoders `decode_int4_octet` / `decode_int2_octet` and switch `launch_dequant_int{4,2}` and their sym→int8 upcast siblings to a one-work-item-per-word fast path (INT4: 4 bytes = 8 K outputs; INT2: 2 bytes = 8 K outputs), amortising packed-byte loads by 4×/2× and scale/zero loads by 4×/2× relative to the previous byte-per-item path. * No numerics change: arithmetic still runs in fp32 and the octet decoders are thin `#pragma unroll` wrappers over the existing `decode_int4_pair` / `decode_int2_quad`, so results are bit-identical to the scalar path (verified exhaustively for the INT2 uint16 domain). * Fast path is guarded by `K % 8 == 0 && group_size % 8 == 0` (upcast: only `K % 8 == 0`); the scalar byte-per-item path is retained unchanged as a fallback so short-K unit tests keep passing. * Decode/GEMV path (`sycl_tla_moe_decode.hpp`), FP8/INT8 kernels, and the shared scalar decoders are untouched — decode↔prefill bit-parity is preserved by construction.

chensuyue · 2026-07-01T23:40:23Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-01T23:40:30Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

…x real" shapes

… "minimax real" shapes" This reverts commit 3da3d0a.

… shapes

chensuyue · 2026-07-02T03:07:24Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-02T03:07:31Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

…del_perf - Add ``--all-shapes`` pytest CLI flag in ark/test/conftest.py. - test_moe_prefill_perf.py: default sweep restricted to the 2K rows (4 shapes). ``--all-shapes`` restores the full 12-row matrix; the existing ``--minimax-real-only`` flag still composes. - test_moe_decode_perf.py: default sweep restricted to bs1 (2 shapes). ``--all-shapes`` re-adds the bs32 rows. - Delete test/test_ark/test_moe_model_perf.py.

…lse positives

chensuyue · 2026-07-03T04:39:03Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-03T04:39:09Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Signed-off-by: Dong, Bo1 <bo1.dong@intel.com>

Copilot AI and others added 2 commits May 14, 2026 04:03

Add XPU MoE decode kernel with INT4 sym/asym and FP16/BF16 baselines

58b0900

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Document int4 sign-extension trick

527eede

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot AI assigned Copilot and a32543254 May 14, 2026

Copilot created this pull request from a session on behalf of a32543254 May 14, 2026 04:27 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

78ecc0c

for more information, see https://pre-commit.ci

Copilot started work on behalf of a32543254 May 14, 2026 06:14 View session

Copilot finished work on behalf of a32543254 May 14, 2026 06:30

Copilot AI requested a review from a32543254 May 14, 2026 06:30

Copilot started work on behalf of a32543254 May 14, 2026 06:56 View session

Copilot finished work on behalf of a32543254 May 14, 2026 07:00

Copilot started work on behalf of a32543254 May 14, 2026 07:04 View session

Copilot AI and others added 2 commits May 14, 2026 07:16

Add INT8/INT2/FP8 decode MoE GEMV kernels and tests

5dc9d95

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

docs: clarify int2 bit-indexing notation in moe_gemm_decode

f15093a

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 14, 2026 07:20

a32543254 changed the title ~~Add XPU MoE decode kernel (FP16/BF16 + INT4 sym/asym)~~ Add XPU MoE decode kernel (FP16/BF16 + INT2/4/8 sym/asym + FP8) May 15, 2026

a32543254 requested a review from luoyu-intel May 18, 2026 15:33

a32543254 marked this pull request as ready for review May 18, 2026 15:33

Copilot AI review requested due to automatic review settings May 18, 2026 15:33

Copilot started reviewing on behalf of a32543254 May 18, 2026 15:34 View session

Copilot started work on behalf of a32543254 May 18, 2026 15:35 View session

Merge remote-tracking branch 'origin/main' into copilot/add-xpu-moe-d…

430868d

…ecode-implementation # Conflicts: # auto_round_extension/ark/auto_round_kernel/ark.cpp Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 18, 2026 15:37

Copilot AI reviewed May 18, 2026

View reviewed changes

Copilot started work on behalf of a32543254 May 19, 2026 06:19 View session

test: add perf comparison UT — moe_gemm_decode vs default XPU MoE

4395884

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/132db2ab-85c0-45b6-81a7-b9baaa533e5e Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 19, 2026 06:23

Copilot started work on behalf of a32543254 May 20, 2026 06:40 View session

Copilot AI added 10 commits July 1, 2026 18:13

test(moe): drop redundant 'int8' from perf header string

8ffbb5c

Header now reads 'INT8 per-expert scale (scales=[E] fp32, ...)' instead of 'INT8 per-expert scale int8 (...)'. Cosmetic fix from code review.

docs: add INT8 per-K-group DPAS section to README (EN + CN)

414b897

address review: drop no-op finally blocks and unused SG_K constant

9616dc8

feat(ark): implement INT8 asym per-K-group DPAS grouped GEMM

062b7a2

Revert "perf(ark): overlap asym asum pre-pass with DPAS launch (Lever…

152daf8

… 1)" This reverts commit 0363ace.

test(ark): populate dpas column in INT4 MoE prefill perf

827b7ff

Copilot AI added 2 commits July 1, 2026 22:20

feat(ark): S4-sym single-pass DPAS mixed-input mainloop

27babb3

Copilot AI added 5 commits July 2, 2026 01:52

test(ark): add dpas-vs-baseline speedup column to MoE prefill perf table

1d3b507

test(ark): drop env-var toggling in prefill perf; restrict to "minima…

3da3d0a

…x real" shapes

Revert "test(ark): drop env-var toggling in prefill perf; restrict to…

88f78a3

… "minimax real" shapes" This reverts commit 3da3d0a.

test(ark): add --minimax-real-only pytest flag to filter prefill perf…

43b1649

… shapes

test(ark): drop base+deq(ms) column from MoE prefill perf table

7cbc91e

Copilot AI added 2 commits July 3, 2026 03:43

review: tighten shape-label filters to endswith to avoid substring fa…

2b972f6

…lse positives

a32543254 and others added 3 commits July 3, 2026 14:40

fix the pre commit

0b5bb18

Signed-off-by: Dong, Bo1 <bo1.dong@intel.com>

Merge branch 'main' into copilot/add-xpu-moe-decode-implementation

7542c9c

Merge branch 'main' into copilot/add-xpu-moe-decode-implementation

2ebe276

Uh oh!

Conversation

Copilot AI commented May 14, 2026 • edited by a32543254 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist Before Submitting

Uh oh!

a32543254 commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chensuyue commented Jul 1, 2026

Uh oh!

azure-pipelines Bot commented Jul 1, 2026

Uh oh!

chensuyue commented Jul 1, 2026

Uh oh!

azure-pipelines Bot commented Jul 1, 2026

Uh oh!

chensuyue commented Jul 2, 2026

Uh oh!

azure-pipelines Bot commented Jul 2, 2026

Uh oh!

chensuyue commented Jul 3, 2026

Uh oh!

azure-pipelines Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented May 14, 2026 •

edited by a32543254

Loading