Support option to skip the optimizer for training step by RissyRan · Pull Request #3490 · AI-Hypercomputer/maxtext

RissyRan · 2026-03-24T04:15:28Z

Description

This PR introduces a mechanism to skip training steps during severe loss or gradient anomalies (b/489540436). Reference implementation at OLMo-core.

Add configs in base.yml & types.py
Implemented skip_step_on_spikes as an optax.GradientTransformationExtraArgs wrapper
Training loop integration for optimizer updates

Tests

Add a unit test in tests/unit/optimizers_test.py
End-to-end training functional comparison: link

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-03-24T04:29:57Z

Codecov Report

❌ Patch coverage is 87.03704% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/pre_train/train.py	28.57%	4 Missing and 1 partial ⚠️
src/maxtext/optimizers/optimizers.py	95.74%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-24T05:21:49Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR successfully implements an optimizer wrapper to skip training steps during severe loss or gradient anomalies, effectively porting the OLMo-core logic to MaxText using JAX. The core logic elegantly computes rolling statistics and appropriately bypasses the inner optimizer during a spike to prevent momentum poisoning.

🔍 General Feedback

JAX Idioms: The usage of jax.lax.cond to defer and conditionalize the inner optimizer step is very cleanly implemented.
Resilience: Added a few critical suggestions to explicitly handle NaN or Inf loss cases. Preventing buffer poisoning and explicitly skipping on non-finite metrics will make this logic foolproof against catastrophic anomalies.
Kwargs Forwarding: Recommended using .pop() on **extra_args to ensure consumed arguments like loss aren't passed downstream, guaranteeing better compatibility with inner optimizers.

src/maxtext/optimizers/optimizers.py

RissyRan force-pushed the skip_optimizer branch 3 times, most recently from 54b4b54 to 8682ecf Compare March 24, 2026 04:25

RissyRan force-pushed the skip_optimizer branch 3 times, most recently from d2f5c75 to a51468c Compare March 24, 2026 05:14

RissyRan marked this pull request as ready for review March 24, 2026 05:21

RissyRan requested review from A9isha, NicoGrande, NuojCheng, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners March 24, 2026 05:21

RissyRan added the gemini-review label Mar 24, 2026

github-actions bot reviewed Mar 24, 2026

View reviewed changes

src/maxtext/optimizers/optimizers.py Show resolved Hide resolved

src/maxtext/optimizers/optimizers.py Show resolved Hide resolved

src/maxtext/optimizers/optimizers.py Outdated Show resolved Hide resolved

RissyRan force-pushed the skip_optimizer branch from a51468c to 5b8835d Compare March 24, 2026 05:56

Support option to skip the optimizer for training step

960398a

RissyRan force-pushed the skip_optimizer branch from 5b8835d to 960398a Compare March 24, 2026 05:58

RissyRan assigned gagika Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support option to skip the optimizer for training step#3490

Support option to skip the optimizer for training step#3490
RissyRan wants to merge 1 commit intomainfrom
skip_optimizer

RissyRan commented Mar 24, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RissyRan commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RissyRan commented Mar 24, 2026 •

edited

Loading

codecov bot commented Mar 24, 2026 •

edited

Loading