feat: add WSD LR Scheduler by OutisLi · Pull Request #5326 · deepmodeling/deepmd-kit

OutisLi · 2026-03-18T04:51:43Z

The doc will be added through PR #5276

Summary by CodeRabbit

New Features
- Introduced a "wsd" (warmup → stable → decay) learning-rate schedule with configurable warmup, stable duration, decay ratio, and three decay modes: inverse_linear, cosine, linear. Added parameter validation to prevent invalid settings.
- Made the new schedule available to the public API surface.
Tests
- Added comprehensive tests covering decay modes, warmup/stable behavior, edge cases, array/JIT inputs, and beyond-endstep behavior.

deepmd/dpmodel/utils/learning_rate.py

Copilot

Pull request overview

Adds a new warmup-stable-decay (WSD) learning-rate scheduler to the backend-agnostic BaseLR registry, wires it into argument checking and backend exports, and expands cross-backend test coverage to validate behavior and consistency.

Changes:

Implement LearningRateWSD (type="wsd") with warmup support plus configurable decay modes (inverse_linear, cosine, linear).
Extend CLI/config arg validation to accept and validate WSD-specific parameters.
Add/extend tests across universal, TF, PT, PD, and consistency suites for WSD behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`deepmd/dpmodel/utils/learning_rate.py`	Adds the `LearningRateWSD` scheduler implementation and registers it under `BaseLR`.
`deepmd/utils/argcheck.py`	Adds WSD-specific validation and exposes WSD arguments via the lr args plugin registry.
`deepmd/pt/utils/learning_rate.py`	Re-exports `LearningRateWSD` for the PyTorch backend utilities.
`deepmd/pd/utils/learning_rate.py`	Re-exports `LearningRateWSD` for the Paddle backend utilities.
`source/tests/universal/dpmodel/utils/test_learning_rate.py`	Adds extensive unit tests for WSD (modes, warmup, array input, boundary conditions).
`source/tests/tf/test_lr.py`	Adds TF wrapper build/value tests for WSD (default + cosine).
`source/tests/pt/test_lr.py`	Adds PT-side WSD curve tests for all decay modes.
`source/tests/pd/test_lr.py`	Adds PD-side WSD curve tests for all decay modes.
`source/tests/consistent/test_learning_rate.py`	Extends consistency parameterization to include WSD and adjusts the sampling step to hit the decay phase.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/dpmodel/utils/learning_rate.py

coderabbitai · 2026-03-18T05:02:06Z

📝 Walkthrough

Walkthrough

Adds a new plugin-registered learning rate schedule LearningRateWSD implementing warmup → stable plateau → decay with selectable decay types (inverse_linear, cosine, linear), input validation, registration, re-exports, and comprehensive tests across backends.

Changes

Cohort / File(s)	Summary
Core Implementation `deepmd/dpmodel/utils/learning_rate.py`	Added `LearningRateWSD` (registered as `"wsd"`) implementing warmup, stable, and decay phases; validates params, computes `decay_phase_steps`, `stable_steps`, and per-step decay via `_decay_value()` supporting three decay types. Note: duplicate insertion observed in diff.
Public Re-exports `deepmd/pd/utils/learning_rate.py`, `deepmd/pt/utils/learning_rate.py`	Imported and re-exported `LearningRateWSD` in PD and PT public modules (`__all__` updated).
Argument Validation `deepmd/utils/argcheck.py`	Added `_check_wsd_args()` and `learning_rate_wsd()` registration to validate WSD-specific args (start/stop LR, `decay_phase_ratio`, `decay_type`).
Cross-backend Tests — Core & Universal `source/tests/universal/dpmodel/utils/test_learning_rate.py`, `source/tests/consistent/test_learning_rate.py`	Added extensive WSD tests covering plateau, mid-decay values, warmup interaction, invalid params, clamping behavior, array/JIT compatibility, and beyond-`num_steps` handling.
Framework-specific Tests (PD/PT/TF) `source/tests/pd/test_lr.py`, `source/tests/pt/test_lr.py`, `source/tests/tf/test_lr.py`	Added `TestLearningRateWSD` suites (PD, PT) and TF integration tests validating instantiation and value matching for different `decay_type` variants.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat(pt/dp): add cosine LR & BaseLR #5142: Modifies deepmd/dpmodel/utils/learning_rate.py and introduces BaseLR-related infrastructure that WSD integrates with.
refactor: unify learning rate schedulers with array API #5154: Adds a BaseLR-derived schedule and related arg-checking/tests that are directly extended by LearningRateWSD.

Suggested labels

enhancement

Suggested reviewers

iProzd
wanghan-iapcm

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.79% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add WSD LR Scheduler' clearly and concisely describes the main change—introducing a new WSD (warmup-stable-decay) learning rate scheduler to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

source/tests/tf/test_lr.py (1)

106-140: These assertions don't exercise the TF graph path.

lr_schedule.value() just delegates to base_lr.value(), so both new tests still pass if LearningRateSchedule.build() is wrong for type="wsd". Please evaluate the tensor returned by build() at a mid-decay step and compare that result instead.

🧪 Suggested coverage improvement

-        global_step = tf.constant(0, dtype=tf.int64)
-        lr_schedule.build(global_step, num_steps=10000)
+        g = tf.Graph()
+        with g.as_default():
+            global_step = tf.placeholder(shape=[], dtype=tf.int64)
+            lr_tensor = lr_schedule.build(global_step, num_steps=10000)
 
         self.assertIsInstance(lr_schedule.base_lr, LearningRateWSD)
-        np.testing.assert_allclose(
-            lr_schedule.value(9500), lr_schedule.base_lr.value(9500), rtol=1e-10
-        )
+        with tf.Session(graph=g) as sess:
+            tensor_value = sess.run(lr_tensor, feed_dict={global_step: 9500})
+        np.testing.assert_allclose(
+            tensor_value, lr_schedule.base_lr.value(9500), rtol=1e-10
+        )

Apply the same pattern to the cosine WSD variant as well.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/tests/tf/test_lr.py` around lines 106 - 140, The tests are only
exercising the Python path because lr_schedule.value(9500) passes a plain int;
update both tests to pass a TF tensor so the built TF graph is evaluated: after
calling LearningRateSchedule.build(global_step, num_steps=10000), call
lr_schedule.value(tf.constant(9500, dtype=tf.int64)) and compare it to
lr_schedule.base_lr.value(tf.constant(9500, dtype=tf.int64)) (and do the same
change in test_wsd_cosine_build_and_value) to ensure LearningRateSchedule.build,
base_lr and LearningRateWSD TF-paths are exercised.

source/tests/universal/dpmodel/utils/test_learning_rate.py (1)

142-165: Consider splitting mixed validation assertions into separate tests.

test_invalid_decay_phase_ratio also checks invalid decay_type; splitting improves failure localization.

♻️ Suggested test split

-    def test_invalid_decay_phase_ratio(self) -> None:
-        """Test invalid WSD decay_phase_ratio values."""
+    def test_invalid_decay_phase_ratio(self) -> None:
+        """Test invalid WSD decay_phase_ratio values."""
         with self.assertRaises(ValueError):
             LearningRateWSD(
                 start_lr=1e-3,
                 stop_lr=1e-5,
                 num_steps=10000,
                 decay_phase_ratio=0.0,
             )
         with self.assertRaises(ValueError):
             LearningRateWSD(
                 start_lr=1e-3,
                 stop_lr=1e-5,
                 num_steps=10000,
                 decay_phase_ratio=1.1,
             )
+
+    def test_invalid_decay_type(self) -> None:
+        """Test invalid WSD decay_type values."""
         with self.assertRaises(ValueError):
             LearningRateWSD(
                 start_lr=1e-3,
                 stop_lr=1e-5,
                 num_steps=10000,
                 decay_type="bad_mode",
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/tests/universal/dpmodel/utils/test_learning_rate.py` around lines 142
- 165, test_invalid_decay_phase_ratio currently asserts multiple unrelated
invalid cases (invalid decay_phase_ratio and invalid decay_type) in one test;
split it into separate tests so each assertion checks a single validation rule.
Create one test (e.g., test_invalid_decay_phase_ratio_values) that calls
LearningRateWSD with decay_phase_ratio=0.0 and decay_phase_ratio=1.1 and asserts
ValueError, and another test (e.g., test_invalid_decay_type) that calls
LearningRateWSD with decay_type="bad_mode" and asserts ValueError; keep
function/class names LearningRateWSD, decay_phase_ratio, and decay_type to
locate the code. Ensure each new test has a clear name and only one
responsibility for better failure localization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/learning_rate.py`:
- Around line 514-525: The current int(self.decay_phase_ratio * self.num_steps)
can produce 0 or exceed post-warmup steps; change the computation of
decay_phase_steps to derive from and clamp against decay_num_steps instead of
raising: compute e.g. desired = max(0, int(round(self.decay_phase_ratio *
self.num_steps))) (or 0 if you prefer), then set self.decay_phase_steps =
min(desired, self.decay_num_steps) and if self.decay_phase_steps == 0 and
self.decay_num_steps > 0 set it to 1 (or otherwise handle the zero-decay case),
replacing the two ValueError checks; update code around decay_phase_steps,
decay_phase_ratio, num_steps and decay_num_steps to use this clamped value so
short runs or heavy-warmup configs don’t raise at runtime.

---

Nitpick comments:
In `@source/tests/tf/test_lr.py`:
- Around line 106-140: The tests are only exercising the Python path because
lr_schedule.value(9500) passes a plain int; update both tests to pass a TF
tensor so the built TF graph is evaluated: after calling
LearningRateSchedule.build(global_step, num_steps=10000), call
lr_schedule.value(tf.constant(9500, dtype=tf.int64)) and compare it to
lr_schedule.base_lr.value(tf.constant(9500, dtype=tf.int64)) (and do the same
change in test_wsd_cosine_build_and_value) to ensure LearningRateSchedule.build,
base_lr and LearningRateWSD TF-paths are exercised.

In `@source/tests/universal/dpmodel/utils/test_learning_rate.py`:
- Around line 142-165: test_invalid_decay_phase_ratio currently asserts multiple
unrelated invalid cases (invalid decay_phase_ratio and invalid decay_type) in
one test; split it into separate tests so each assertion checks a single
validation rule. Create one test (e.g., test_invalid_decay_phase_ratio_values)
that calls LearningRateWSD with decay_phase_ratio=0.0 and decay_phase_ratio=1.1
and asserts ValueError, and another test (e.g., test_invalid_decay_type) that
calls LearningRateWSD with decay_type="bad_mode" and asserts ValueError; keep
function/class names LearningRateWSD, decay_phase_ratio, and decay_type to
locate the code. Ensure each new test has a clear name and only one
responsibility for better failure localization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a283f5a5-f445-4ee4-858e-4f089ee4287d

📥 Commits

Reviewing files that changed from the base of the PR and between b2805fb and f271bf8.

📒 Files selected for processing (9)

deepmd/dpmodel/utils/learning_rate.py
deepmd/pd/utils/learning_rate.py
deepmd/pt/utils/learning_rate.py
deepmd/utils/argcheck.py
source/tests/consistent/test_learning_rate.py
source/tests/pd/test_lr.py
source/tests/pt/test_lr.py
source/tests/tf/test_lr.py
source/tests/universal/dpmodel/utils/test_learning_rate.py

deepmd/dpmodel/utils/learning_rate.py

codecov · 2026-03-18T05:28:11Z

Codecov Report

❌ Patch coverage is 70.31250% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.32%. Comparing base (b2805fb) to head (689d357).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/utils/argcheck.py	38.46%	16 Missing ⚠️
deepmd/dpmodel/utils/learning_rate.py	92.10%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5326      +/-   ##
==========================================
+ Coverage   82.29%   82.32%   +0.02%     
==========================================
  Files         775      777       +2     
  Lines       77627    77877     +250     
  Branches     3675     3678       +3     
==========================================
+ Hits        63887    64112     +225     
- Misses      12566    12590      +24     
- Partials     1174     1175       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

♻️ Duplicate comments (1)

deepmd/dpmodel/utils/learning_rate.py (1)

514-518: ⚠️ Potential issue | 🟡 Minor

Handle edge case when decay_num_steps is zero.

The clamping logic at line 515-516 ensures decay_phase_steps >= 1, but when decay_num_steps = 0 (i.e., num_steps = warmup_steps = 0), this results in stable_steps = -1. While the _decay_value logic handles this gracefully by returning stop_lr immediately, having negative stable_steps is semantically incorrect.

Consider adding early validation or adjusting the clamping:

🛡️ Suggested fix

         # === Derive stable and decay phase lengths ===
         self.decay_phase_ratio = decay_phase_ratio
         self.decay_type = decay_type
+        # When decay_num_steps is 0, there's no room for decay
+        if self.decay_num_steps == 0:
+            self.decay_phase_steps = 0
+            self.stable_steps = 0
+            return
         # Clamp decay_phase_steps to valid range [1, decay_num_steps]
         self.decay_phase_steps = max(
             1, min(int(self.decay_phase_ratio * self.num_steps), self.decay_num_steps)
         )
         self.stable_steps = self.decay_num_steps - self.decay_phase_steps

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/learning_rate.py` around lines 514 - 518, The current
clamping forces decay_phase_steps >= 1 which yields stable_steps negative when
decay_num_steps == 0; update the logic so decay_phase_steps is clamped to the
range [0, decay_num_steps] (allowing zero) or explicitly handle the
decay_num_steps == 0 case by setting decay_phase_steps = 0 and stable_steps = 0
and/or validating decay_num_steps >= 0; adjust the block that sets
self.decay_phase_steps and self.stable_steps (and any callers like _decay_value)
to rely on non-negative stable_steps and ensure no negative values are produced.

🧹 Nitpick comments (1)

deepmd/dpmodel/utils/learning_rate.py (1)

559-562: Redundant xp.asarray wrapping.

Since tau is already an array with step_dtype, the expression xp.pi * tau will produce an array of the same dtype. The xp.asarray(..., dtype=step_dtype) call is redundant.

♻️ Suggested simplification

         elif self.decay_type == "cosine":
             decay_lr = stop_lr + (start_lr - stop_lr) * 0.5 * (
-                one + xp.cos(xp.asarray(xp.pi * tau, dtype=step_dtype))
+                one + xp.cos(xp.pi * tau)
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/learning_rate.py` around lines 559 - 562, In the cosine
decay branch (where self.decay_type == "cosine") the expression xp.asarray(xp.pi
* tau, dtype=step_dtype) is redundant because xp.pi * tau already yields an
array with the correct dtype; replace that subexpression with xp.pi * tau so
decay_lr is computed as stop_lr + (start_lr - stop_lr) * 0.5 * (one +
xp.cos(xp.pi * tau)), keeping references to decay_lr, start_lr, stop_lr, tau,
xp.cos and step_dtype unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@deepmd/dpmodel/utils/learning_rate.py`:
- Around line 514-518: The current clamping forces decay_phase_steps >= 1 which
yields stable_steps negative when decay_num_steps == 0; update the logic so
decay_phase_steps is clamped to the range [0, decay_num_steps] (allowing zero)
or explicitly handle the decay_num_steps == 0 case by setting decay_phase_steps
= 0 and stable_steps = 0 and/or validating decay_num_steps >= 0; adjust the
block that sets self.decay_phase_steps and self.stable_steps (and any callers
like _decay_value) to rely on non-negative stable_steps and ensure no negative
values are produced.

---

Nitpick comments:
In `@deepmd/dpmodel/utils/learning_rate.py`:
- Around line 559-562: In the cosine decay branch (where self.decay_type ==
"cosine") the expression xp.asarray(xp.pi * tau, dtype=step_dtype) is redundant
because xp.pi * tau already yields an array with the correct dtype; replace that
subexpression with xp.pi * tau so decay_lr is computed as stop_lr + (start_lr -
stop_lr) * 0.5 * (one + xp.cos(xp.pi * tau)), keeping references to decay_lr,
start_lr, stop_lr, tau, xp.cos and step_dtype unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1363b61c-f8fb-4804-b834-420b56c86723

📥 Commits

Reviewing files that changed from the base of the PR and between f271bf8 and 689d357.

📒 Files selected for processing (2)

deepmd/dpmodel/utils/learning_rate.py
source/tests/universal/dpmodel/utils/test_learning_rate.py

🚧 Files skipped from review as they are similar to previous changes (1)

source/tests/universal/dpmodel/utils/test_learning_rate.py

feat: add WSD scheduler

f271bf8

Copilot AI review requested due to automatic review settings March 18, 2026 04:51

github-actions bot added the Python label Mar 18, 2026

Copilot started reviewing on behalf of OutisLi March 18, 2026 04:52 View session

dosubot bot added the new feature label Mar 18, 2026

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

deepmd/dpmodel/utils/learning_rate.py Fixed Show fixed Hide fixed

OutisLi marked this pull request as draft March 18, 2026 04:54

Copilot AI reviewed Mar 18, 2026

View reviewed changes

deepmd/dpmodel/utils/learning_rate.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

deepmd/dpmodel/utils/learning_rate.py Outdated Show resolved Hide resolved

fix

689d357

OutisLi marked this pull request as ready for review March 19, 2026 01:28

OutisLi requested review from iProzd, njzjz and wanghan-iapcm March 19, 2026 01:28

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

OutisLi changed the title ~~feat: add WSD scheduler~~ feat(pt): add WSD LR Scheduler Mar 19, 2026

OutisLi changed the title ~~feat(pt): add WSD LR Scheduler~~ feat: add WSD LR Scheduler Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add WSD LR Scheduler#5326

feat: add WSD LR Scheduler#5326
OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
OutisLi:pr/wsd

OutisLi commented Mar 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OutisLi commented Mar 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OutisLi commented Mar 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

codecov bot commented Mar 18, 2026 •

edited

Loading