Harden host-side sparse index width and allocation safety#2822
Harden host-side sparse index width and allocation safety#2822LwhJesse wants to merge 2 commits into
Conversation
|
After #2816, I think the follow-up direction needs a policy decision before I move #2822 or the resource-cache draft further. I currently have two related follow-ups:
The next steps depend on which CUDA sparse-backend direction the project prefers. I see four practical routes:
My technical preference is route 1, because it gives the cleanest correctness story and the cleanest future GPU-resident Krylov path. If that is too large as the immediate next step, route 2 is a reasonable limited scope for #2822. Could the maintainers confirm which direction they prefer before I mark #2822 ready or move the #2816-based resource-cache draft into the main project? |
|
I think both are going too deep into the software engineering weeds. |
Harden Host-Side Sparse Index Width and Allocation Safety
Summary
This PR hardens the host-side sparse linear algebra path against integer-width overflow and silent narrowing.
It introduces an explicit 64-bit host sparse-index alias,
and uses it in the host sparse-pattern / CSR metadata path where index width and allocation size matter. It also adds checked multiplication for the main storage-size calculations and checked narrowing at the remaining external boundaries.
This PR does not modify the CUDA backend implementation itself. In particular, it does not change:
Common/src/linear_algebra/CSysMatrixGPU.cuCommon/include/linear_algebra/GPUComms.cuhCUDA matvec correctness remains the responsibility of PR #2816.
Problem
Several sparse linear algebra paths in SU2 currently rely on
unsigned longorintfor sparse metadata, storage-size arithmetic, or boundary conversion.On LP64 systems this often goes unnoticed because
unsigned longis usually 64-bit. On LLP64 systems such as 64-bit Windows,unsigned longis still 32-bit. At sufficiently large local sparse-matrix sizes, that can lead to:nnz * nVar * nEqnFor the main matrix storage, the critical host-side product is:
If that is evaluated in 32-bit arithmetic, the largest representable unsigned value is:
So the first overflowing matrix-scalar count is:
For
double, that means:which is:
32 GiBbinary34.36 GBdecimalThat is the origin of the host-side estimate quoted above.
To connect that number to an actual sparse matrix, take a common square-block case with
nVar = nEqn = 5. Then each nonzero block contributes25matrix scalars, so the first overflowing host-side case is:This is already larger than
2^32 - 1 = 4,294,967,295, so the host-side 32-bit product has crossed the limit. In memory terms, that same case corresponds to:which is again about:
32 GiBbinary34.36 GBdecimalSo the quoted
34.36 GBnumber is not a separate back-of-the-envelope estimate; it is simply the byte size of the first5 x 5double-precision sparse matrix whosennz_blocks * nVar * nEqnproduct no longer fits in 32-bit unsigned arithmetic.Using a rough relation
nnz_blocks ~= N_local * z, withzthe average nonzero-block count per row, this5 x 5threshold corresponds approximately to:17.18 millionlocal points atz = 108.59 millionlocal points atz = 20The same argument gives the corresponding first-overflow cases for larger square blocks:
6 x 6:119,304,648 * 36 = 4,294,967,3287 x 7:87,652,394 * 49 = 4,294,967,306So the threshold is high, but it is still a local sparse-matrix scale that can be reached in large implicit runs.
For reference, the current legacy CUDA kernel in
developcan overflow earlier because it uses a signed 32-bitintmatrix-scalar offset internally. The largest positive signed 32-bit value is:so the first overflowing matrix-scalar count is:
For
double, that means:which is:
16 GiBbinary17.18 GBdecimalThat estimate is specific to the legacy custom CUDA matvec path currently used in
develop.PR #2816 naturally addresses that CUDA-side issue by replacing the old custom matvec kernel, including the legacy signed-
intmatrix-offset path that creates this earlier overflow limit. That is why this PR does not modify the CUDA backend itself.What this PR changes
su2_index_t = std::uint64_tfor the host sparse-index chain that needs it.CSysMatrixsparse metadata,CSysVectorsize/index API,CPastixWrappersparse input handling, and the sparse-pattern-facing geometry accessors required for that propagation.nnz * nVar * nEqnnnz_ilu * nVar * nEqnnPointDomain * nVar * nEqnnumBlk * numVarnumBlkDomain * numVarunsigned long, but range-checks host sparse indices before converting them for the current CUDA upload path.After this change, the targeted host-side sparse-index and allocation path no longer fails by silent wraparound or silent truncation. If a value exceeds the remaining boundary types, the code now fails explicitly through
SU2_MPI::Error(...).Validation
git diff --checkpassed.For numerical validation, I ran the existing six-case correctness harness with a two-way CPU comparison:
develop CPUthis branch CPUCases:
periodic2d_sectorudf_lam_flatplate_sudf_lam_flatplate_mudf_lam_flatplate_ludf_test_11_probes_sudf_test_11_probes_mResult:
this branch CPUmatcheddevelop CPUin all 6 caseshistory.csvmatched exactly in all 6 casesmax_abs_delta = 0.0for every caseFor CUDA, the relevant correctness discussion remains PR #2816, since this branch still uses the legacy pre-#2816 custom CUDA matvec implementation and this PR intentionally does not modify that backend. In particular, the current
develop-side signed-intoffset limit described above is handled naturally by the#2816backend replacement rather than by this PR.Notes on testing
This PR does not add a direct large-memory reproducer for the original host-side overflow. The original failure mode depends on LLP64-style 32-bit
unsigned longarithmetic, and a faithful end-to-end reproduction would require a much larger local sparse structure than is practical for a routine test here.PR Checklist
--warnlevel=3when using meson).pre-commit run --allto format old commits.