Skip to content

test(lammps): skip spin DPA3 empty-subdomain pt2 case#5478

Closed
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bothub:test-skip-spin-dpa3-pt2-empty-subdomain
Closed

test(lammps): skip spin DPA3 empty-subdomain pt2 case#5478
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bothub:test-skip-spin-dpa3-pt2-empty-subdomain

Conversation

@njzjz-bot
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot commented May 30, 2026

Problem

  • The spin DPA3 PT2 MPI empty-subdomain test can fail with a divide-by-zero/SIGFPE when one rank has no real local atoms (nloc_real == 0).
  • This blocks CI while the with-comm AOTI artifact still needs a proper empty-local-atom fast path.

Change

  • Temporarily skip test_pair_deepmd_mpi_dpa3_spin_empty_subdomain.
  • Leave an inline TODO in the skip reason documenting the required follow-up fix.

Notes

  • This is intentionally a narrow unblocker; the test body is kept in place as coverage to re-enable once the fast path lands.
  • Local validation: git diff --check passed. uv run pytest --collect-only -q source/lmp/tests/test_lammps_spin_dpa3_pt2.py is blocked locally by the existing aarch64 uv dependency conflict between pin-jax-cpu (jax==0.10.0) and pin-jax-gpu (jax[cuda12]==0.5.0).

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

  • Tests
    • Disabled a multi-rank MPI test due to a known issue currently under investigation.

Review Change Stack

Skip the spin DPA3 MPI empty-subdomain test while the with-comm AOTI artifact lacks an empty-local-atom fast path. The skipped test keeps a TODO pointing at the nloc_real == 0 divide-by-zero/SIGFPE follow-up.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7ea3a518-596a-453a-abb3-33fc3afa2e44

📥 Commits

Reviewing files that changed from the base of the PR and between e679b8d and 2a15337.

📒 Files selected for processing (1)
  • source/lmp/tests/test_lammps_spin_dpa3_pt2.py

📝 Walkthrough

Walkthrough

A pytest skip marker is added to the test_pair_deepmd_mpi_dpa3_spin_empty_subdomain test in the LAMMPS spin model tests, disabling it with an explicit reason citing a divide-by-zero/SIGFPE failure that occurs when a rank has zero local real atoms during multi-subdomain MPI execution.

Changes

Test Skip Marker

Layer / File(s) Summary
Skip empty-subdomain MPI test with divide-by-zero issue
source/lmp/tests/test_lammps_spin_dpa3_pt2.py
The test_pair_deepmd_mpi_dpa3_spin_empty_subdomain test is marked with @pytest.mark.skip() to suppress a SIGFPE crash occurring when nloc_real == 0 on any MPI rank in multi-subdomain execution. A TODO comment documents the issue for future resolution.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

Python, LAMMPS

Suggested reviewers

  • njzjz
  • iProzd
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: skipping a particular test case for spin DPA3 empty-subdomain in LAMMPS PT2.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@njzjz njzjz requested a review from wanghan-iapcm May 30, 2026 07:48
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.25%. Comparing base (e679b8d) to head (2a15337).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5478      +/-   ##
==========================================
- Coverage   82.25%   82.25%   -0.01%     
==========================================
  Files         833      833              
  Lines       89100    89099       -1     
  Branches     4225     4227       +2     
==========================================
- Hits        73290    73289       -1     
+ Misses      14518    14517       -1     
- Partials     1292     1293       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if pr #5485 solves this issue.

zhaiwenxi pushed a commit to zhaiwenxi/deepmd-kit that referenced this pull request Jun 4, 2026
…eepmodeling#5485)

## Problem

Multi-rank spin MD can leave a rank with zero real local atoms
(`nloc_real == 0`) when atoms migrate to other subdomains. The with-comm
AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at
runtime in inductor-generated shape arithmetic that uses `nloc` as a
divisor.

Reproduced on master CI run
[`26667802665`](https://github.com/deepmodeling/deepmd-kit/actions/runs/26667802665):
```
Caught signal 8 (Floating point exception: integer divide by zero)
4  forward_lower_with_comm/.../wrapper.so(AOTInductorModel::run_impl+0xf482)
```

Root cause:
- The graph was traced with `nloc_min=1` (`serialization.py:362`) and
inductor lowered an even stricter `nloc >= 2` runtime-check (visible in
the generated `wrapper.cpp`'s `check_input_3`).
- That runtime-check is gated by env var `AOTI_RUNTIME_CHECK_INPUTS`
(default OFF), so with `nloc = 0` the check is silently bypassed and the
compiled graph runs through its own divide-by-zero on shape arithmetic.
- Whether the offending divide is actually emitted depends on inductor's
code-gen choices, which vary across compiles — hence the intermittent
nature.

## Fix

Prepend two phantom atoms with empty neighbour lists when `nloc_real ==
0` so the AOTI graph runs with `nloc == 2` and never reaches the
integer-divide-by-zero path. Phantoms have no neighbours so they
contribute zero atomic energy / force / virial, preserving the
physically-correct "this rank has no real atoms" result.

Key details (all in `source/api_cc/src/DeepSpinPTExpt.cc`):
- `dcoord` / `datype` / `dspin` get two zero-valued rows prepended.
- `firstneigh_tensor` gets two `-1` rows prepended (no neighbours).
- `mapping_tensor` gets two identity entries prepended.
- `comm_dict.nlocal` is set to `2` (not the LAMMPS-reported `0`) so
`border_op` writes received ghost features past the phantom slots.
- Output arrays (`dforce`, `dforce_mag`, `datom_energy`, `datom_virial`)
get the phantom prefix stripped before being scattered back to LAMMPS
via `select_map`.

## Why phantoms rather than `Dim(min=0)` re-export

Bumping the trace constraint to `min=0` would require:
1. auditing every `nloc`-dependent divide in
`deepmd/dpmodel/{descriptor,fitting,model}/` and protecting with
`xp.maximum(nloc, 1)`;
2. `torch.export` re-emitting compatible guards (currently fails because
spin-side shape relationships require `nloc >= 1` to be inferable);
3. inductor cooperating with the relaxed bound (it makes independent
specialization choices downstream);
4. re-exporting every `.pt2` archive in `source/tests/infer/`.

The phantom approach is a strict superset of correctness and
self-contained in one C++ file. The two approaches aren't mutually
exclusive — the `min=0` route can land as a follow-up once the dpmodel
audit is done.

## Test plan

- [x] Local CPU rebuild + `runUnitTests_cc --gtest_filter='*Spin*'`:
**42 / 42 spin C++ regression tests pass** (12 TF-backend tests skipped,
as expected in the PT-only venv).
- [ ] CI: the multi-rank LAMMPS test
`test_pair_deepmd_mpi_dpa3_spin_empty_subdomain` should now pass
deterministically. Local Python LAMMPS-MPI verification is blocked by a
pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin's
`ompi_mpi_*` symbols can't resolve against MPICH's `libmpi.so.12`), so
end-to-end verification falls to CI.

## Known limitations

- The phantom path is structurally inert for `nloc_real > 0` (the `if
(phantom_n > 0)` branch never fires), so the common path is unchanged.
- If a future inductor version bumps the `nloc` lower-bound to >2,
`phantom_n` will need to track that minimum.
- This fix is in `DeepSpinPTExpt` only. The corresponding non-spin path
in `DeepPotPTExpt` has the same code shape; non-spin DPA3
empty-subdomain currently passes in CI but could regress similarly with
a future inductor change. Deferred to a follow-up if observed.
- Supersedes deepmodeling#5478 (which proposed skipping the test); this PR fixes the
underlying bug instead.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Prevented crashes and incorrect results when some processes have no
local atoms during distributed runs; placeholder (phantom) atoms are
ignored in neighbor, force, energy, and per-atom outputs so reported
values match real atoms.
* **Tests**
* Updated tests to pass explicit per-atom parameters and exercise
empty-subdomain and multi-rank behaviors.
* **Documentation**
* Clarified test docstrings and model config comments about per-atom
parameter handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
@njzjz njzjz closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants