Skip to content

Parallelization along K-dimension (Parallel Reduction) for GEMM with small M/N and large K #5629

@seulchanson

Description

@seulchanson

Hi OpenBLAS team,

I noticed that zgemm (and other GEMM functions) falls back to single-threaded execution when M and N are small (e.g., 32) but K is extremely large (e.g., 1,000,000).

On my many-core system, this leaves most cores idle. Given the large K size, parallelizing the K-loop (via parallel reduction) should theoretically offer significant speedup. I perform the matrix partitioning (of k) externally, and then use multithreading to call zgemm, but the performance is only average.

Questions:

Does OpenBLAS currently support threading along the K-dimension for this shape?

If not, are there any plans to implement parallel reduction for large K?

My Machine Info:

Image Image

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions