perf(gint): shape-exact bucketing + tile ladder + wide-LDS vbatched GEMM by dzzz2001 · Pull Request #7395 · deepmodeling/abacus-develop

dzzz2001 · 2026-05-29T09:58:55Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

No issue. This is a self-contained performance optimization of the GPU gint batched-GEMM path.

What's changed?

Optimizes the GPU gint batched-GEMM path (gemm_{nn,tn}_vbatch, driven from phi_mul_phi / phi_mul_dm) for FP64 on V100/A100-class GPUs. Three commits:

perf(gint): shape-exact bucketing + tile ladder + wide-LDS vbatched GEMM
- phi_operator_gpu: replace the single max-shape vbatch launch with shape-exact bucketing — atom pairs are grouped by (nw1, nw2) via a counting-sort table (pre-enumerated once per batch), so each bucket hands the kernel a scalar (m, n, k) and the tile ladder picks the tightest tile per shape. No cross-species tile waste, no over-launched blocks.
- dgemm_vbatch: scalar (m, n, k) dispatch (drops the per-batchid M/N/K device arrays) feeding a 4×2 (NN) / 4×4 (TN) BLK_{M,N} ladder over {8, 16, 32, 48}.
- gemm_{nn,tn}_vbatch: K-inner shared-memory layout + wide (double2/float4) load inner loop — one 16-byte load feeds VK FMAs per (m, n); PAD keeps the shmem stride 16-byte aligned and spreads the warp's strided reads across banks.
refactor(gint): derive shape-bucket stride from ucell.nwmax — drop the hardcoded NW_MAX = 64 cap (both a magic number and an artificial ceiling that would abort() for nw > 64); size the bucket table exactly to the basis via gint_gpu_vars_->nwmax. The counting-sort tables move to std::vector members allocated once and re-zeroed per call.
refactor(gint): clarify GEMM kernel comments, hoist shape-bucket struct — rewrite kernel comments to describe the actual mechanism (dropping internal "V1/V3/Phase" development shorthand), hoist the duplicated local Bucket struct to a named GemmShapeBucket type, and reuse one buckets_ member vector across both passes. No behavior change.

Validation

FP64 15-case GPU benchmark: end-to-end ~1.05× (A800) / ~1.04× (V100), with cal_gint_vl up to ~1.5× and cal_gint_rho up to ~1.65×. Energies and pressures match develop to ~1e-10 on every case. No new unit tests added — the path has no behavioral change; validation is via the existing LCAO case suite plus the GPU benchmark above.

Any changes of core modules? (ignore if not applicable)

No. Changes are confined to source/source_lcao/module_gint/kernel/ (the GPU gint GEMM path). No public API change and no change to numerical results.

Optimize the GPU gint batched-GEMM path (gemm_{nn,tn}_vbatch, driven from phi_mul_phi / phi_mul_dm) for FP64 on V100/A100-class GPUs. - phi_operator_gpu: replace the single max-shape vbatch launch with shape-exact bucketing. Atom pairs are grouped by (nw1, nw2) via a dense NW_MAX*NW_MAX counting-sort table, pre-enumerated once per batch in set_bgrid_batch, so each bucket hands the kernel a scalar (m, n, k) and the tile ladder picks the tightest tile per shape -- no cross-species tile waste, no over-launched blocks. A guard aborts if any atom nw >= NW_MAX. - dgemm_vbatch: scalar (m, n, k) dispatch (drops the per-batchid M/N/K device arrays) feeding a 4x2 (NN) / 4x4 (TN) BLK_{M,N} ladder over {8,16,32,48}. - gemm_{nn,tn}_vbatch: K-inner shared-memory layout + wide (double2/float4) LDS inner loop -- one 16-byte LDS feeds VK FMAs per (m,n); PAD keeps the shmem stride 16-byte aligned and warp access bank-conflict-free. C accumulators stay double regardless of input type T, preserving the mixed-precision fp64-accumulator fix (deepmodeling#7368); the phi_operator kernel optimizations from deepmodeling#7366 (WantPhi dispatch, single-warp reduce) are retained. FP64 15-case GPU benchmark: end-to-end ~1.05x (A800) / ~1.04x (V100), with cal_gint_vl up to ~1.5x and cal_gint_rho up to ~1.65x; energies and pressures match develop to ~1e-10 on every case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…dcoded NW_MAX The (nw1, nw2) shape-bucketing in phi_mul_phi / phi_mul_dm flattened pairs into a dense table key via `nw1 * NW_MAX + nw2`, with NW_MAX a hardcoded 64. That was both a magic number and an artificial ceiling: a basis with nw > 64 would abort(), and 64 was only a guess at the real max. The true upper bound is already known to the code as ucell.nwmax (max orbital count over all atom types), exposed via gint_gpu_vars_->nwmax. Use it: set nw_stride_ = nwmax + 1 once in the ctor so the bucket table is sized exactly to the basis -- no cap to maintain. A runtime stride can't index std::array<int, NW_MAX*NW_MAX>, so the three counting-sort tables (counts / base / cursor) move to mutable std::vector members allocated once and re-zeroed per call. For typical nwmax~25 that's ~676 ints vs the old fixed 4096, so the hot path zeroes less and never reallocates. The set_bgrid_batch() abort guard becomes a structurally-unreachable assert, since nwmax is by definition the largest nw. Drop now-unused includes (<array>, <cstdio>, <cstdlib>); add <cassert>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Follow-up cleanup on the shape-exact vbatched GEMM path. No behavior change. - gemm_{nn,tn}_vbatch, dgemm_vbatch, gint_helper: rewrite the kernel comments to describe the actual mechanism (K-inner shared-memory layout, wide vector loads feeding VK FMAs per load, the tile ladder, fp64 cross-item accumulation) and drop the internal "V1/V3/Phase" development shorthand that carried no meaning outside the original work log. - phi_operator_gpu: the local `Bucket` struct was declared identically inside both phi_mul_phi and phi_mul_dm. Hoist it to a named GemmShapeBucket type and reuse a single buckets_ member vector (cleared, not reallocated) across both, reserved once in the ctor -- one less per-call heap allocation on the hot path. - phi_operator_gpu: pair_scratch_offset_ is fully overwritten in Pass 1 before Pass 2 reads it, so resize() it instead of assign(..., -1); the -1 sentinel was never observed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dzzz2001 and others added 4 commits May 29, 2026 09:58

Merge branch 'develop' into gemm-ladder-rebased

fbf2e54

dzzz2001 marked this pull request as draft May 29, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gint): shape-exact bucketing + tile ladder + wide-LDS vbatched GEMM#7395

perf(gint): shape-exact bucketing + tile ladder + wide-LDS vbatched GEMM#7395
dzzz2001 wants to merge 4 commits into
deepmodeling:developfrom
dzzz2001:gemm-ladder-rebased

dzzz2001 commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzzz2001 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

Linked Issue

What's changed?

Validation

Any changes of core modules? (ignore if not applicable)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dzzz2001 commented May 29, 2026 •

edited

Loading