fix: Validate fp16.loss_scale is finite and non-negative by nathon-lee · Pull Request #7889 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-06T07:44:29Z

Validate fp16.loss_scale is finite and non-negative

Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.

Test:
Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py

Result:

root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
=================================================================== test session starts ===================================================================
platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
Using --randomly-seed=1526199052
rootdir: /home/DeepSpeed_woo/tests
configfile: pytest.ini
plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0
collected 10 items

tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED                                         [ 10%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED                                         [ 20%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED                                     [ 30%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED                                         [ 40%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED                                     [ 50%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED                                       [ 60%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED                                    [ 70%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED                       [ 80%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED                                      [ 90%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED                       [100%]

(30 durations < 1s hidden.  Use -vv to show these durations.)
============================================================= 10 passed, 16 warnings in 4.18s =============================================================

Fix issue #7852

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0059a795a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T07:47:08Z

deepspeed/runtime/precision_config.py

    Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale.
    """

+    @field_validator("loss_scale")


Run loss_scale validator before type coercion

This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).

Useful? React with 👍 / 👎.

I think this comment makes sense. Can you address it? @nathon-lee

Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

PKUWZP

Switch to mode = before and add some tests.

PKUWZP · 2026-03-06T21:25:15Z

deepspeed/runtime/precision_config.py

    """

+    @field_validator("loss_scale")
+    @classmethod


Consider using mode="before" for the entire validator rather than splitting into two validators. A single
mode="before" validator can handle both the bool check and the finite/negative checks:

@classmethod def _validate_loss_scale(cls, v): if isinstance(v, bool): raise ValueError("fp16.loss_scale must be a number, not bool") v = float(v) if not math.isfinite(v): raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)") if v < 0: raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)") return v ```

Test coverage: There are no tests included. A few unit tests in tests/unit/runtime/ asserting that invalid loss_scale values (-1, float('inf'), float('nan'), True) raise ValidationError would strengthen this PR and prevent regressions.

The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.

Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

@nathon-lee If we pass an invalid value like [] and {}, won't float() raise TypeError now?
The current master raises Pydantic's ValidationError for these, which is clearer than a raw TypeError.

@tohtana Thanks for catching this.

I added a try/except (TypeError, ValueError) around float(v) in the mode="before" validator so invalid types (e.g. [], {}) are converted into a clear ValueError, which Pydantic wraps as ValidationError (instead of surfacing a raw TypeError). I also added unit tests covering [] and {} to prevent regressions.

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

tohtana

Thank you for the fix! @nathon-lee

…minimal change) Signed-off-by: nathon-lee <leejianwoo@gmail.com>

…into fix_issue_7852

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

tohtana · 2026-03-13T00:56:40Z

@PKUWZP Can you confirm that your request has been met? We need your confirmation to merge this PR.

PKUWZP

The changes look good to me.

nathon-lee requested review from tjruwase and tohtana as code owners March 6, 2026 07:44

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

nathon-lee changed the title ~~Validate fp16.loss_scale is finite and non-negative~~ fix: Validate fp16.loss_scale is finite and non-negative Mar 6, 2026

PKUWZP self-requested a review March 6, 2026 21:17

PKUWZP requested changes Mar 6, 2026

View reviewed changes

nathon-lee force-pushed the fix_issue_7852 branch from f0059a7 to 5c7b12e Compare March 7, 2026 02:53

nathon-lee requested review from GuanhuaWang, hwchen2017 and loadams as code owners March 7, 2026 02:53

nathon-lee force-pushed the fix_issue_7852 branch 2 times, most recently from f0059a7 to 3ead20d Compare March 7, 2026 03:20

fix: Validate fp16.loss_scale is finite and non-negative

225ab4e

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7852 branch from 3ead20d to 225ab4e Compare March 7, 2026 03:27

fix: validate fp16.loss_scale before coercion

f2fc309

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7852 branch from 9ad6000 to f2fc309 Compare March 7, 2026 04:10

tohtana mentioned this pull request Mar 8, 2026

[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config #7892

Open

nathon-lee force-pushed the fix_issue_7852 branch from 403cd4c to 39b2b3a Compare March 9, 2026 05:50

nathon-lee added 2 commits March 9, 2026 05:55

Refactor loss scale validation in precision_config.py

0ac4684

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

chore: DCO signoff

322707a

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7852 branch from 39b2b3a to 322707a Compare March 9, 2026 05:57

nathon-lee and others added 2 commits March 9, 2026 06:24

test: ensure invalid fp16.loss_scale types raise ValidationError

a8cb9dc

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Merge branch 'master' into fix_issue_7852

dcd1e05

tohtana approved these changes Mar 10, 2026

View reviewed changes

nathon-lee and others added 4 commits March 10, 2026 03:06

Fix: remove trailing whitespace, yapf format, update license header (…

e59f7b8

…minimal change) Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Merge branch 'fix_issue_7852' of github.com:nathon-lee/DeepSpeed_woo …

a29c6ea

…into fix_issue_7852

Add license header to precision config loss scale unit test

85f3821

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Merge branch 'master' into fix_issue_7852

b7bb966

tohtana enabled auto-merge (squash) March 13, 2026 00:15

Merge branch 'master' into fix_issue_7852

63744bd

PKUWZP self-requested a review March 13, 2026 01:11

PKUWZP approved these changes Mar 13, 2026

View reviewed changes

Merge branch 'master' into fix_issue_7852

0708001

tohtana merged commit 63eeb11 into deepspeedai:master Mar 13, 2026
1 check passed

Conversation

nathon-lee commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP left a comment

Choose a reason for hiding this comment

Uh oh!

PKUWZP Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana commented Mar 13, 2026

Uh oh!

PKUWZP left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nathon-lee commented Mar 6, 2026 •

edited

Loading

tohtana Mar 8, 2026 •

edited

Loading