test(lammps): skip spin DPA3 empty-subdomain pt2 case#5478
Conversation
Skip the spin DPA3 MPI empty-subdomain test while the with-comm AOTI artifact lacks an empty-local-atom fast path. The skipped test keeps a TODO pointing at the nloc_real == 0 divide-by-zero/SIGFPE follow-up. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA pytest skip marker is added to the ChangesTest Skip Marker
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5478 +/- ##
==========================================
- Coverage 82.25% 82.25% -0.01%
==========================================
Files 833 833
Lines 89100 89099 -1
Branches 4225 4227 +2
==========================================
- Hits 73290 73289 -1
+ Misses 14518 14517 -1
- Partials 1292 1293 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
wanghan-iapcm
left a comment
There was a problem hiding this comment.
check if pr #5485 solves this issue.
…eepmodeling#5485) ## Problem Multi-rank spin MD can leave a rank with zero real local atoms (`nloc_real == 0`) when atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses `nloc` as a divisor. Reproduced on master CI run [`26667802665`](https://github.com/deepmodeling/deepmd-kit/actions/runs/26667802665): ``` Caught signal 8 (Floating point exception: integer divide by zero) 4 forward_lower_with_comm/.../wrapper.so(AOTInductorModel::run_impl+0xf482) ``` Root cause: - The graph was traced with `nloc_min=1` (`serialization.py:362`) and inductor lowered an even stricter `nloc >= 2` runtime-check (visible in the generated `wrapper.cpp`'s `check_input_3`). - That runtime-check is gated by env var `AOTI_RUNTIME_CHECK_INPUTS` (default OFF), so with `nloc = 0` the check is silently bypassed and the compiled graph runs through its own divide-by-zero on shape arithmetic. - Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles — hence the intermittent nature. ## Fix Prepend two phantom atoms with empty neighbour lists when `nloc_real == 0` so the AOTI graph runs with `nloc == 2` and never reaches the integer-divide-by-zero path. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct "this rank has no real atoms" result. Key details (all in `source/api_cc/src/DeepSpinPTExpt.cc`): - `dcoord` / `datype` / `dspin` get two zero-valued rows prepended. - `firstneigh_tensor` gets two `-1` rows prepended (no neighbours). - `mapping_tensor` gets two identity entries prepended. - `comm_dict.nlocal` is set to `2` (not the LAMMPS-reported `0`) so `border_op` writes received ghost features past the phantom slots. - Output arrays (`dforce`, `dforce_mag`, `datom_energy`, `datom_virial`) get the phantom prefix stripped before being scattered back to LAMMPS via `select_map`. ## Why phantoms rather than `Dim(min=0)` re-export Bumping the trace constraint to `min=0` would require: 1. auditing every `nloc`-dependent divide in `deepmd/dpmodel/{descriptor,fitting,model}/` and protecting with `xp.maximum(nloc, 1)`; 2. `torch.export` re-emitting compatible guards (currently fails because spin-side shape relationships require `nloc >= 1` to be inferable); 3. inductor cooperating with the relaxed bound (it makes independent specialization choices downstream); 4. re-exporting every `.pt2` archive in `source/tests/infer/`. The phantom approach is a strict superset of correctness and self-contained in one C++ file. The two approaches aren't mutually exclusive — the `min=0` route can land as a follow-up once the dpmodel audit is done. ## Test plan - [x] Local CPU rebuild + `runUnitTests_cc --gtest_filter='*Spin*'`: **42 / 42 spin C++ regression tests pass** (12 TF-backend tests skipped, as expected in the PT-only venv). - [ ] CI: the multi-rank LAMMPS test `test_pair_deepmd_mpi_dpa3_spin_empty_subdomain` should now pass deterministically. Local Python LAMMPS-MPI verification is blocked by a pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin's `ompi_mpi_*` symbols can't resolve against MPICH's `libmpi.so.12`), so end-to-end verification falls to CI. ## Known limitations - The phantom path is structurally inert for `nloc_real > 0` (the `if (phantom_n > 0)` branch never fires), so the common path is unchanged. - If a future inductor version bumps the `nloc` lower-bound to >2, `phantom_n` will need to track that minimum. - This fix is in `DeepSpinPTExpt` only. The corresponding non-spin path in `DeepPotPTExpt` has the same code shape; non-spin DPA3 empty-subdomain currently passes in CI but could regress similarly with a future inductor change. Deferred to a follow-up if observed. - Supersedes deepmodeling#5478 (which proposed skipping the test); this PR fixes the underlying bug instead. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Prevented crashes and incorrect results when some processes have no local atoms during distributed runs; placeholder (phantom) atoms are ignored in neighbor, force, energy, and per-atom outputs so reported values match real atoms. * **Tests** * Updated tests to pass explicit per-atom parameters and exercise empty-subdomain and multi-rank behaviors. * **Documentation** * Clarified test docstrings and model config comments about per-atom parameter handling. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Problem
nloc_real == 0).Change
test_pair_deepmd_mpi_dpa3_spin_empty_subdomain.Notes
git diff --checkpassed.uv run pytest --collect-only -q source/lmp/tests/test_lammps_spin_dpa3_pt2.pyis blocked locally by the existing aarch64 uv dependency conflict betweenpin-jax-cpu(jax==0.10.0) andpin-jax-gpu(jax[cuda12]==0.5.0).Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary by CodeRabbit