Skip to content

[Bug]: Triton MXFP4 MoE device capability check < (11, 0) breaks RDNA3.5 (gfx1151) support #40301

@kyuz0

Description

@kyuz0

Your current environment

System Info

OS: Linux (e.g., Fedora 43)
Hardware: AMD Strix Halo APU (gfx1151 / RDNA 3.5)
vLLM version: v0.19.2 (and recent nightlies/main)
Model: openai/gpt-oss-20b (or any gpt_oss_mxfp4 quantized MoE model)

🐛 Describe the bug

I am trying to run vLLM on an AMD Strix Halo (gfx1151) using ROCm. The environment is properly configured to compile Triton kernels. Previously, gpt-oss-20b (which initializes using gpt_oss_mxfp4 quantization) worked perfectly fine and used the Triton MXFP4 MoE backend as expected.

However, a recent update explicitly bounded the device_capability checks for the Triton MoE kernels to < (11, 0).

  • In vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:
    def _supports_current_device() -> bool:
        ...
        return (9, 0) <= (cap.major, cap.minor) < (11, 0)
  • In vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:
    triton_kernels_supported = has_triton_kernels() and (
        9,
        0,
    ) <= current_platform.get_device_capability() < (11, 0)

Because vLLM maps gfx1151 to a device capability of (11, 5), the < (11, 0) check completely fails for the entire RDNA3/RDNA3.5 family. As a result, the backend oracle drops the Triton kernels, cannot find any other fallback MXFP4 backends for ROCm, and crashes with:

NotImplementedError: No MXFP4 MoE backend supports the deployment configuration.

Could this check please be widened to (9, 0) <= cap < (12, 0) to allow RDNA3 architectures? Or was there a specific hardware-level bug on Blackwell/future architectures that necessitated this hard < (11,0) roof?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrocmRelated to AMD ROCm

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions