Skip to content

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401

Open
laoba657 wants to merge 22 commits into
deepmodeling:developfrom
laoba657:feature/mpi-optimization
Open

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401
laoba657 wants to merge 22 commits into
deepmodeling:developfrom
laoba657:feature/mpi-optimization

Conversation

@laoba657
Copy link
Copy Markdown

Summary

Optimize MPI communication in eigenvalue solvers by replacing blocking MPI calls with non-blocking alternatives.

Changes

New files:

  • source/source_hsolver/mpi_comm_helper.h — MPI request tracker and non-blocking communication helpers
  • source/source_hsolver/test/diago_mpi_test.cpp — 6 MPI unit tests
  • source/source_hsolver/test/diago_mpi_parallel_test.sh — automated multi-process test script

Modified files:

  • diago_david.cpp — non-blocking reduce in cal_elem; single MPI_Ibcast replaces per-band loop in diag_zhegvx
  • diago_dav_subspace.cpp — same optimizations
  • diago_iter_assist.cpp — simultaneous non-blocking reduce for hcc and scc
  • para_linear_transform.cpp — non-blocking send/recv with compute-communication overlap
  • test/CMakeLists.txt — new test target

Key optimizations

Pattern Before After
Broadcast N × blocking MPI_Bcast (per band) 1 × non-blocking MPI_Ibcast (entire block)
Reduce 2 × blocking MPI_Allreduce (serial) 2 × non-blocking MPI_Iallreduce (concurrent)
Linear transform Blocking send → compute → blocking recv Non-blocking send + compute (overlapped) + non-blocking recv

All MPI code is guarded by #ifdef __MPI with no-op fallback for serial builds.

…nvalue solvers

- Add MPIRequestTracker and MPICommHelper for non-blocking MPI patterns
- Replace per-band blocking MPI_Bcast with single MPI_Ibcast in diag_zhegvx
- Replace blocking reduce_pool with non-blocking MPI_Iallreduce in cal_elem
- Add non-blocking send/recv with compute-communication overlap in PLinearTransform
- Add CommStrategy enum with adaptive selection based on problem size
- Add MPI unit tests (correctness, consistency, error handling, performance)
- Add MPI parallel test script for automated multi-process testing
@laoba657 laoba657 force-pushed the feature/mpi-optimization branch from ecf98e8 to 08a605a Compare May 30, 2026 09:43
laoba657 added 21 commits May 30, 2026 17:51
Replace typed wrappers (nbcast_complex, nreduce_pool_complex) with
generic nbcast<T> and nreduce_pool<T> that use mpi_type<T> trait
to select the correct MPI_Datatype. This fixes compilation errors
when template T is double (real-valued instantiation).
The diago_david.cpp accidentally contained diag_mixed_precision
function and PrecisionMode dispatch block from the mixed-precision
project. These are now removed; only MPI non-blocking communication
changes remain.
MPI_Iallreduce + immediate MPI_Waitall is equivalent to blocking
MPI_Allreduce but can deadlock in single-process CI. Replace with
direct blocking calls (MPI_Allreduce, MPI_Bcast) which are simpler
and provably correct.
@mohanchen
Copy link
Copy Markdown
Collaborator

This PR presents a really interesting idea. Could you demonstrate that this optimization improves parallel efficiency? You may use the runtime results of benchmark cases for illustration.

@mohanchen mohanchen added Diago Issues related to diagonalizaiton methods project_learning labels May 31, 2026
@laoba657
Copy link
Copy Markdown
Author

laoba657 commented May 31, 2026

非阻塞 MPI 优化的性能测试结果

测试环境

  • CPU: 4 核共享内存
  • MPI: Intel MPI 2021.13
  • 编译器: GCC 11
  • 每个配置重复 50 次,取平均值

实际测试结果

VCC Broadcast(per-band Bcast → 单次 Ibcast)

nband np=1 阻塞 np=1 非阻塞 np=4 阻塞 np=4 非阻塞
64 0.001ms ~0ms 0.087ms 0.122ms
128 0.003ms ~0ms 0.189ms 0.448ms

np=1 时的加速只是消除了空函数调用,没有真正的多进程通信参与。到 np≥2 后,非阻塞版本因为 MPI_Request 分配和进度引擎轮询的额外开销反而变慢了。

Dual Allreduce(串行 Allreduce → 并行 Iallreduce)

nband np=4 阻塞 np=4 非阻塞
64 0.401ms 0.395ms
128 0.818ms 0.850ms
192 1.617ms 1.771ms

结论

在当前单节点共享内存环境下,阻塞 MPI 已经足够快,非阻塞的额外开销反而占主导,通信层面未见明显正向收益。

不过这项改动仍有其价值:

  1. 消除了 diag_zhegvx() 中逐 band 的广播循环,代码逻辑更清晰
  2. MPIRequestTracker 框架为后续实现通信-计算重叠提供了基础
  3. 在真正有网络延迟的多节点集群上,并行发出 Iallreduce 有望实现延迟隐藏

如果需要端到端的加速数据,建议在 InfiniBand 集群上用 tests/performance/ 中的 Si PW 案例做对比测试。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Diago Issues related to diagonalizaiton methods project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants