fix(train): allow zero-step training with bias adjustment by njzjz-bot · Pull Request #5477 · deepmodeling/deepmd-kit

njzjz-bot · 2026-05-30T06:14:20Z

Problem

numb_steps=0 is a valid no-optimization path that should save the initial checkpoint.
When change_bias_after_training is enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step -1.

Change

Skip post-training bias adjustment unless at least one training step has run.
Keep the existing zero-step initial checkpoint save path for both PyTorch and Paddle backends.
Add PT/PD regression tests that run zero-step training with change_bias_after_training=true and verify the saved *-0 checkpoint metadata.

Notes

python3 -m pytest ... could not run in this workspace because pytest is not installed in the available Python environment.
uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
Closes Runtime Error when Step is 0 #4988.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

Bug Fixes
- Prevented unintended bias-adjustment during zero-step PyTorch training so the initial checkpoint is created and recorded correctly.
Refactor
- Clarified the post-training bias-adjustment conditional in Paddle for readability (no behavior change).
Tests
- Added tests for zero-step training with bias-adjustment enabled for both Paddle and PyTorch, verifying initial checkpoint creation and training metadata.

Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

coderabbitai · 2026-05-30T06:17:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f8a752d1-f3f6-40b1-9843-8e4062b15342

📥 Commits

Reviewing files that changed from the base of the PR and between d27334c and 3d7168f.

📒 Files selected for processing (2)

source/tests/pd/test_training.py
source/tests/pt/test_training.py

🚧 Files skipped from review as they are similar to previous changes (2)

source/tests/pt/test_training.py
source/tests/pd/test_training.py

📝 Walkthrough

Walkthrough

This PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata.

Changes

Trainer Zero-Step Guard and Test Coverage

Layer / File(s)	Summary
PyTorch trainer step-count guard for bias-change block `deepmd/pt/train/training.py`	The `change_bias_after_training` conditional block at the end of `Trainer.run()` now requires `self.num_steps > self.start_step` in addition to the rank-0 check, preventing execution when the training loop performed zero steps.
Paddle trainer post-training block formatting `deepmd/pd/train/training.py`	Reformatted the `change_bias_after_training` conditional in Paddle's `Trainer.run()` to span multiple lines, preserving the identical rank-0 check logic.
Zero-step training test coverage (Paddle) `source/tests/pd/test_training.py`	Adds `import paddle` and `test_zero_step_with_change_bias_saves_initial_checkpoint` which runs zero-step training with `change_bias_after_training=True`, asserts `trainer.save_ckpt-0.pd` is created and matches `trainer.latest_model`, and verifies `_extra_state.train_infos.step == 0` and `lr == 0.0`.
Zero-step training test coverage (PyTorch) `source/tests/pt/test_training.py`	Adds `test_zero_step_with_change_bias_saves_initial_checkpoint` which runs zero-step training with `change_bias_after_training=True`, asserts `trainer.save_ckpt-0.pt` is created and is the latest model, checks the `checkpoint` pointer file, and verifies `model._extra_state.train_infos.step == 0` and `lr == 0.0`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

Chengqian-Zhang
iProzd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enabling zero-step training with bias adjustment, which is the core functional fix in this PR.
Linked Issues check	✅ Passed	The PR addresses issue `#4988` by allowing zero-step training with bias adjustment; it skips post-training bias adjustment when no steps have run and saves the initial checkpoint correctly.
Out of Scope Changes check	✅ Passed	All changes directly address the zero-step training issue: logic updates to Trainer.run() in both backends and regression tests verifying the fix align with issue `#4988` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.

In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56

📥 Commits

Reviewing files that changed from the base of the PR and between e679b8d and ef84d6c.

📒 Files selected for processing (4)

deepmd/pd/train/training.py
deepmd/pt/train/training.py
source/tests/pd/test_training.py
source/tests/pt/test_training.py

Copilot

Pull request overview

This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.

Changes:

Adds a num_steps > start_step guard before bias adjustment in PT/PD trainers.
Adds regression tests for zero-step training with bias adjustment enabled.
Verifies saved checkpoint metadata reports step=0 and lr=0.0.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`deepmd/pt/train/training.py`	Skips PT bias adjustment when no training step ran.
`deepmd/pd/train/training.py`	Skips Paddle bias adjustment when no training step ran.
`source/tests/pt/test_training.py`	Adds PT regression coverage for zero-step checkpoint save.
`source/tests/pd/test_training.py`	Adds Paddle regression coverage for zero-step checkpoint save.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+        self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model)
+        self.assertTrue(Path("model.ckpt-0.pt").exists())
+        self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text())


+
+        self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model)
+        self.assertTrue(Path("model.ckpt-0.pd").exists())
+        self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text())


+
+        self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model)
+        self.assertTrue(Path("model.ckpt-0.pt").exists())
+        self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text())


+
+        self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model)
+        self.assertTrue(Path("model.ckpt-0.pd").exists())
+        self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text())


codecov · 2026-05-30T07:02:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.24%. Comparing base (e679b8d) to head (3d7168f).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5477      +/-   ##
==========================================
- Coverage   82.25%   81.24%   -1.01%     
==========================================
  Files         833      868      +35     
  Lines       89100    96358    +7258     
  Branches     4225     4235      +10     
==========================================
+ Hits        73290    78287    +4997     
- Misses      14518    16771    +2253     
- Partials     1292     1300       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T07:55:44Z

Thanks, fixed in 631039c:

compare the checkpoint pointer as a Path after stripping the file content
add timeout guards to the zero-step training regression tests

Validation:

uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

This reverts commit 631039c.

Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T08:05:40Z

Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c.

This regression covers numb_steps=0, so it verifies the no-op path and should not enter the training loop. A training-test timeout is useful for tests that actually run optimization, but it adds noise here, especially for the Paddle test file where no timeout helper existed.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T09:58:23Z

Fixed the failing tests in 3d7168f.

The fixtures configure training.save_ckpt as model, so the zero-step checkpoint is model-0.{pt,pd}, not model.ckpt-0.{pt,pd}. The tests now derive the expected path from trainer.save_ckpt and still verify that the checkpoint pointer and saved metadata are correct.

Validation:

uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

dosubot Bot added the bug label May 30, 2026

github-actions Bot added the Python label May 30, 2026

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread source/tests/pd/test_training.py

Comment thread source/tests/pd/test_training.py Outdated

Comment thread source/tests/pt/test_training.py

Comment thread source/tests/pt/test_training.py Outdated

njzjz requested a review from Copilot May 30, 2026 06:50

Copilot started reviewing on behalf of njzjz May 30, 2026 06:50 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

test(train): fix zero-step checkpoint assertions

631039c

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot added 2 commits May 30, 2026 08:03

Revert "test(train): fix zero-step checkpoint assertions"

5da1412

This reverts commit 631039c.

test(train): fix zero-step checkpoint assertions

d27334c

Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

test(train): use configured zero-step checkpoint path

3d7168f

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(train): allow zero-step training with bias adjustment#5477

fix(train): allow zero-step training with bias adjustment#5477
njzjz-bot wants to merge 5 commits into
deepmodeling:masterfrom
njzjz-bothub:fix-4988-zero-step-training

njzjz-bot commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 30, 2026 •

edited

Loading

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njzjz-bot commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njzjz-bot commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading