[DeepSeek v3] Add grad mask and update MLA init by gagika · Pull Request #3864 · AI-Hypercomputer/maxtext

gagika · 2026-05-10T21:35:25Z

Description

When training DeepSeek-V3 671B with load_balance_loss_weight > 0 deterministically NaNs on step 1 (per-layer MLA backward gradient overflows bf16). Adds two opt-in fixes that default to no-op:

Per-token gradient mask (grad_mask_threshold): zeros tokens whose feature-axis backward-RMS exceeds threshold.
Constant-std MLA init (mla_init_std): N(0, std) with output proj scaled by 1/sqrt(2*num_decoder_layers). No effect when loading a checkpoint.

DeepSeek-V3 model config defaults these to 0.001 and 100.0.

Tests

E2E test with random weights. https://cloudlogging.app.goo.gl/FU6FtqvXr6pegYv57
E2E on mlperf ckpt0424: runs to lm_loss 3.622 over 84 steps.
Unit Tests intests/unit/grad_mask_utils_test.py.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-10T21:41:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-10T22:00:40Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-10T22:03:05Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-05-10T22:08:17Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces two stability-focused improvements for DeepSeek-V3 training: a per-token gradient mask and a constant-standard-deviation initialization for MLA projections. These changes are well-integrated into the existing configuration and layer structure, with comprehensive unit tests provided for the new gradient masking utility.

🔍 General Feedback

Stability: The opt-in gradient mask is a robust defensive mechanism against gradient overflows in bf16, particularly useful for large-scale training.
Precision: Using float32 for RMS calculation in the gradient mask is a good practice to maintain precision.
Initialization: The specialized initialization for MLA projections correctly follows the DeepSeek-V3 architecture's requirements.
Testing: The new unit tests cover edge cases and dtype preservation effectively.

github-actions · 2026-05-11T18:21:48Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T18:29:20Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

gagika added pull ready gemini-review and removed pull ready labels May 10, 2026

gagika force-pushed the agagik-deepseek-conv branch from ab75f0c to 766dc25 Compare May 10, 2026 22:02

gagika added gemini-review and removed gemini-review labels May 10, 2026

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/maxtext/utils/grad_mask_utils.py Outdated

DeepSeek conv: grad mask + MLA init

24fb91c

gagika force-pushed the agagik-deepseek-conv branch from 766dc25 to 24fb91c Compare May 10, 2026 22:17

gagika marked this pull request as ready for review May 11, 2026 04:17

gagika requested review from NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, abhinavclemson, bvandermoon, gobbleturk, jesselu-google, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners May 11, 2026 04:17

gagika requested review from A9isha, aireenmei, dipannita08, hengtaoguo and igorts-git as code owners May 11, 2026 04:17

gagika added gemini-review and removed gemini-review labels May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSeek v3] Add grad mask and update MLA init#3864

[DeepSeek v3] Add grad mask and update MLA init#3864
gagika wants to merge 1 commit into
mainfrom
agagik-deepseek-conv

gagika commented May 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gagika commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gagika commented May 10, 2026 •

edited

Loading

codecov Bot commented May 10, 2026 •

edited

Loading