fix(train): allow zero-step training with bias adjustment#5477
fix(train): allow zero-step training with bias adjustment#5477njzjz-bot wants to merge 5 commits into
Conversation
Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata. ChangesTrainer Zero-Step Guard and Test Coverage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.
In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56
📒 Files selected for processing (4)
deepmd/pd/train/training.pydeepmd/pt/train/training.pysource/tests/pd/test_training.pysource/tests/pt/test_training.py
There was a problem hiding this comment.
Pull request overview
This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.
Changes:
- Adds a
num_steps > start_stepguard before bias adjustment in PT/PD trainers. - Adds regression tests for zero-step training with bias adjustment enabled.
- Verifies saved checkpoint metadata reports
step=0andlr=0.0.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
deepmd/pt/train/training.py |
Skips PT bias adjustment when no training step ran. |
deepmd/pd/train/training.py |
Skips Paddle bias adjustment when no training step ran. |
source/tests/pt/test_training.py |
Adds PT regression coverage for zero-step checkpoint save. |
source/tests/pd/test_training.py |
Adds Paddle regression coverage for zero-step checkpoint save. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model) | ||
| self.assertTrue(Path("model.ckpt-0.pt").exists()) | ||
| self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text()) |
|
|
||
| self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model) | ||
| self.assertTrue(Path("model.ckpt-0.pd").exists()) | ||
| self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text()) |
|
|
||
| self.assertEqual(Path("model.ckpt-0.pt"), trainer.latest_model) | ||
| self.assertTrue(Path("model.ckpt-0.pt").exists()) | ||
| self.assertEqual(Path("model.ckpt-0.pt"), Path("checkpoint").read_text()) |
|
|
||
| self.assertEqual(Path("model.ckpt-0.pd"), trainer.latest_model) | ||
| self.assertTrue(Path("model.ckpt-0.pd").exists()) | ||
| self.assertEqual(Path("model.ckpt-0.pd"), Path("checkpoint").read_text()) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5477 +/- ##
==========================================
- Coverage 82.25% 81.24% -1.01%
==========================================
Files 833 868 +35
Lines 89100 96358 +7258
Branches 4225 4235 +10
==========================================
+ Hits 73290 78287 +4997
- Misses 14518 16771 +2253
- Partials 1292 1300 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Thanks, fixed in 631039c:
Validation:
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
This reverts commit 631039c.
Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c. This regression covers Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Fixed the failing tests in 3d7168f. The fixtures configure Validation:
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
Problem
numb_steps=0is a valid no-optimization path that should save the initial checkpoint.change_bias_after_trainingis enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step-1.Change
change_bias_after_training=trueand verify the saved*-0checkpoint metadata.Notes
python3 -m pytest ...could not run in this workspace because pytest is not installed in the available Python environment.uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.pypassed.uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.pypassed.Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary by CodeRabbit
Bug Fixes
Refactor
Tests