Skip to content

fix: Validate fp16.loss_scale is finite and non-negative#7889

Merged
tohtana merged 12 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7852
Mar 13, 2026
Merged

fix: Validate fp16.loss_scale is finite and non-negative#7889
tohtana merged 12 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7852

Conversation

@nathon-lee
Copy link
Contributor

@nathon-lee nathon-lee commented Mar 6, 2026

Validate fp16.loss_scale is finite and non-negative

Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.

Test:
Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py

Result:

root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
=================================================================== test session starts ===================================================================
platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
Using --randomly-seed=1526199052
rootdir: /home/DeepSpeed_woo/tests
configfile: pytest.ini
plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0
collected 10 items

tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED                                         [ 10%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED                                         [ 20%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED                                     [ 30%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED                                         [ 40%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED                                     [ 50%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED                                       [ 60%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED                                    [ 70%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED                       [ 80%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED                                      [ 90%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED                       [100%]

(30 durations < 1s hidden.  Use -vv to show these durations.)
============================================================= 10 passed, 16 warnings in 4.18s =============================================================

Fix issue #7852

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0059a795a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale.
"""

@field_validator("loss_scale")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run loss_scale validator before type coercion

This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).

Useful? React with 👍 / 👎.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment makes sense. Can you address it? @nathon-lee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

@nathon-lee nathon-lee changed the title Validate fp16.loss_scale is finite and non-negative fix: Validate fp16.loss_scale is finite and non-negative Mar 6, 2026
@PKUWZP PKUWZP self-requested a review March 6, 2026 21:17
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to mode = before and add some tests.

"""

@field_validator("loss_scale")
@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Consider using mode="before" for the entire validator rather than splitting into two validators. A single
    mode="before" validator can handle both the bool check and the finite/negative checks:

    @classmethod
    def _validate_loss_scale(cls, v):
        if isinstance(v, bool):
            raise ValueError("fp16.loss_scale must be a number, not bool")
        v = float(v)
        if not math.isfinite(v):
            raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)")
        if v < 0:
            raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)")
        return v ```
    
    
    
  • Test coverage: There are no tests included. A few unit tests in tests/unit/runtime/ asserting that invalid loss_scale values (-1, float('inf'), float('nan'), True) raise ValidationError would strengthen this PR and prevent regressions.

The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

Copy link
Collaborator

@tohtana tohtana Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathon-lee If we pass an invalid value like [] and {}, won't float() raise TypeError now?
The current master raises Pydantic's ValidationError for these, which is clearer than a raw TypeError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tohtana Thanks for catching this.

I added a try/except (TypeError, ValueError) around float(v) in the mode="before" validator so invalid types (e.g. [], {}) are converted into a clear ValueError, which Pydantic wraps as ValidationError (instead of surfacing a raw TypeError). I also added unit tests covering [] and {} to prevent regressions.

@nathon-lee nathon-lee force-pushed the fix_issue_7852 branch 2 times, most recently from f0059a7 to 3ead20d Compare March 7, 2026 03:20
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! @nathon-lee

@tohtana tohtana enabled auto-merge (squash) March 13, 2026 00:15
@tohtana
Copy link
Collaborator

tohtana commented Mar 13, 2026

@PKUWZP Can you confirm that your request has been met? We need your confirmation to merge this PR.

@PKUWZP PKUWZP self-requested a review March 13, 2026 01:11
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me.

@tohtana tohtana merged commit 63eeb11 into deepspeedai:master Mar 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants