Skip to content

fix(train): allow zero-step training with bias adjustment#5477

Open
njzjz-bot wants to merge 5 commits into
deepmodeling:masterfrom
njzjz-bothub:fix-4988-zero-step-training
Open

fix(train): allow zero-step training with bias adjustment#5477
njzjz-bot wants to merge 5 commits into
deepmodeling:masterfrom
njzjz-bothub:fix-4988-zero-step-training

Conversation

@njzjz-bot
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot commented May 30, 2026

Problem

  • numb_steps=0 is a valid no-optimization path that should save the initial checkpoint.
  • When change_bias_after_training is enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step -1.

Change

  • Skip post-training bias adjustment unless at least one training step has run.
  • Keep the existing zero-step initial checkpoint save path for both PyTorch and Paddle backends.
  • Add PT/PD regression tests that run zero-step training with change_bias_after_training=true and verify the saved *-0 checkpoint metadata.

Notes

  • python3 -m pytest ... could not run in this workspace because pytest is not installed in the available Python environment.
  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
  • Closes Runtime Error when Step is 0 #4988.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

  • Bug Fixes

    • Prevented unintended bias-adjustment during zero-step PyTorch training so the initial checkpoint is created and recorded correctly.
  • Refactor

    • Clarified the post-training bias-adjustment conditional in Paddle for readability (no behavior change).
  • Tests

    • Added tests for zero-step training with bias-adjustment enabled for both Paddle and PyTorch, verifying initial checkpoint creation and training metadata.

Review Change Stack

Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@dosubot dosubot Bot added the bug label May 30, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f8a752d1-f3f6-40b1-9843-8e4062b15342

📥 Commits

Reviewing files that changed from the base of the PR and between d27334c and 3d7168f.

📒 Files selected for processing (2)
  • source/tests/pd/test_training.py
  • source/tests/pt/test_training.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • source/tests/pt/test_training.py
  • source/tests/pd/test_training.py

📝 Walkthrough

Walkthrough

This PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata.

Changes

Trainer Zero-Step Guard and Test Coverage

Layer / File(s) Summary
PyTorch trainer step-count guard for bias-change block
deepmd/pt/train/training.py
The change_bias_after_training conditional block at the end of Trainer.run() now requires self.num_steps > self.start_step in addition to the rank-0 check, preventing execution when the training loop performed zero steps.
Paddle trainer post-training block formatting
deepmd/pd/train/training.py
Reformatted the change_bias_after_training conditional in Paddle's Trainer.run() to span multiple lines, preserving the identical rank-0 check logic.
Zero-step training test coverage (Paddle)
source/tests/pd/test_training.py
Adds import paddle and test_zero_step_with_change_bias_saves_initial_checkpoint which runs zero-step training with change_bias_after_training=True, asserts trainer.save_ckpt-0.pd is created and matches trainer.latest_model, and verifies _extra_state.train_infos.step == 0 and lr == 0.0.
Zero-step training test coverage (PyTorch)
source/tests/pt/test_training.py
Adds test_zero_step_with_change_bias_saves_initial_checkpoint which runs zero-step training with change_bias_after_training=True, asserts trainer.save_ckpt-0.pt is created and is the latest model, checks the checkpoint pointer file, and verifies model._extra_state.train_infos.step == 0 and lr == 0.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • Chengqian-Zhang
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling zero-step training with bias adjustment, which is the core functional fix in this PR.
Linked Issues check ✅ Passed The PR addresses issue #4988 by allowing zero-step training with bias adjustment; it skips post-training bias adjustment when no steps have run and saves the initial checkpoint correctly.
Out of Scope Changes check ✅ Passed All changes directly address the zero-step training issue: logic updates to Trainer.run() in both backends and regression tests verifying the fix align with issue #4988 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.

In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56

📥 Commits

Reviewing files that changed from the base of the PR and between e679b8d and ef84d6c.

📒 Files selected for processing (4)
  • deepmd/pd/train/training.py
  • deepmd/pt/train/training.py
  • source/tests/pd/test_training.py
  • source/tests/pt/test_training.py

Comment thread source/tests/pd/test_training.py
Comment thread source/tests/pd/test_training.py Outdated
Comment thread source/tests/pt/test_training.py
Comment thread source/tests/pt/test_training.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.

Changes:

  • Adds a num_steps > start_step guard before bias adjustment in PT/PD trainers.
  • Adds regression tests for zero-step training with bias adjustment enabled.
  • Verifies saved checkpoint metadata reports step=0 and lr=0.0.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
deepmd/pt/train/training.py Skips PT bias adjustment when no training step ran.
deepmd/pd/train/training.py Skips Paddle bias adjustment when no training step ran.
source/tests/pt/test_training.py Adds PT regression coverage for zero-step checkpoint save.
source/tests/pd/test_training.py Adds Paddle regression coverage for zero-step checkpoint save.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/tests/pt/test_training.py Outdated

self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model)
self.assertTrue(Path("model.ckpt-0.pt").exists())
self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text())
Comment thread source/tests/pd/test_training.py Outdated

self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model)
self.assertTrue(Path("model.ckpt-0.pd").exists())
self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text())
Comment thread source/tests/pt/test_training.py Outdated

self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model)
self.assertTrue(Path("model.ckpt-0.pt").exists())
self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text())
Comment thread source/tests/pd/test_training.py Outdated

self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model)
self.assertTrue(Path("model.ckpt-0.pd").exists())
self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text())
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.24%. Comparing base (e679b8d) to head (3d7168f).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5477      +/-   ##
==========================================
- Coverage   82.25%   81.24%   -1.01%     
==========================================
  Files         833      868      +35     
  Lines       89100    96358    +7258     
  Branches     4225     4235      +10     
==========================================
+ Hits        73290    78287    +4997     
- Misses      14518    16771    +2253     
- Partials     1292     1300       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot
Copy link
Copy Markdown
Contributor Author

Thanks, fixed in 631039c:

  • compare the checkpoint pointer as a Path after stripping the file content
  • add timeout guards to the zero-step training regression tests

Validation:

  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot added 2 commits May 30, 2026 08:03
Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot
Copy link
Copy Markdown
Contributor Author

Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c.

This regression covers numb_steps=0, so it verifies the no-op path and should not enter the training loop. A training-test timeout is useful for tests that actually run optimization, but it adds noise here, especially for the Paddle test file where no timeout helper existed.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot
Copy link
Copy Markdown
Contributor Author

Fixed the failing tests in 3d7168f.

The fixtures configure training.save_ckpt as model, so the zero-step checkpoint is model-0.{pt,pd}, not model.ckpt-0.{pt,pd}. The tests now derive the expected path from trainer.save_ckpt and still verify that the checkpoint pointer and saved metadata are correct.

Validation:

  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runtime Error when Step is 0

2 participants