Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jan 17, 2026

Run all tests on the AWS infra. Not scheduled, needs to be manually launched. (Reopened from this repo, not a fork, for testing)
aws-torch-latest now runs full tests for debugging. Once it works, we will revert ti and add aws-torch-latest-full.

@tohtana tohtana force-pushed the tohtana/add_full_test_workflow branch from f909e3e to 9b556a5 Compare January 17, 2026 07:29
1. Fix bf16 checkpoint saving/loading with zero_stage=0:
   - Remove incorrect `self.bfloat16_enabled()` from `zero_optimizer_state`
     condition in both `_save_checkpoint` and `_load_checkpoint`
   - bf16+stage0 uses FP16_Optimizer wrapper and should use the same
     checkpoint path as fp16+stage0, not the ZeRO checkpoint path

2. Fix muon test to allow untested optimizer with ZeRO:
   - Add `zero_allow_untested_optimizer: True` to config

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test_aio.py tests hang during async I/O operations in the CI
environment. Skip them for now to allow other tests to run.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Skip additional ops tests that hang in CI environment:
- unit/ops/lion: CPU lion optimizer tests hang
- unit/ops/adagrad: CPU adagrad tests may hang
- unit/ops/accelerators: transformer forward/backward tests may hang

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test was incorrectly using groups._get_sequence_parallel_group()
which looks for a global mpu that was never set. Fixed to use the
mpu object returned by register_with_transformers().

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/runtime/pipe to the ignore list - these tests timeout
after 600s in the CI environment.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/ops/adam to the ignore list - test_cpu_adam.py tests
timeout after 600s in the CI environment.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/runtime/zenflow to the ignore list - these distributed
tests timeout after 600s in the CI environment.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/checkpoint/test_pipeline.py to the ignore list - these
tests timeout after 600s in the CI environment.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Now that hanging tests are skipped, try running with more parallelism
to speed up the test suite. Removed -x flag to continue on failures.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Add DS_DISABLE_REUSE_DIST_ENV environment variable check to
  tests/unit/common.py to allow disabling reuse_dist_env at runtime
- This prevents pool worker cleanup hangs that occur when distributed
  environment is initialized in reused pool workers
- Update aws-torch-latest.yml and aws-torch-latest-full.yml:
  - Mount /mnt/aio for O_DIRECT support (async I/O tests)
  - Set DS_DISABLE_REUSE_DIST_ENV=1
  - Add --basetemp=/mnt/aio to pytest commands
  - Remove --ignore=unit/ops/aio (aio tests now enabled)

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Run full test suite (except nvme tests) with DS_DISABLE_REUSE_DIST_ENV=1
to validate that disabling reuse_dist_env prevents pool worker hangs.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
pytest --basetemp expects to create the directory, but /mnt/aio is a
mount point that already exists. Use /mnt/aio/pytest subdirectory instead.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
In PipelineEngine._exec_backward_pass(), for non-last stages (Stage 0),
torch.autograd.backward() was called directly without setting
_running_engine_backward=True. This caused the post-backward hook
(_backward_post_hook) to raise a RuntimeError when needs_scaler=True
because it incorrectly detected that backward() was called without
proper loss scaling.

The exception raised inside the callback caused the process to hang,
which in turn caused NCCL collective operations to deadlock while
waiting for all ranks.

Fix by setting _running_engine_backward=True before calling backward()
for non-last stages, and resetting it in a finally block.

Also update to use the new tensor.backward(gradient) API style instead
of torch.autograd.backward(), which properly integrates with DeepSpeed's
hooks and loss scaling for non-scalar backward.

Fixes pipeline checkpoint tests timing out with ZeRO stage 1.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Several bugs were causing test_bf16_optimizer_fragments to fail:

1. DDP_BFLOAT16 constant collision with BFLOAT16
   - Both were set to "bf16", causing BF16_Optimizer to never be selected
   - Changed DDP_BFLOAT16 to "ddp_bf16" to differentiate

2. Missing attributes in BF16_Optimizer
   - Added custom_loss_scaler, external_loss_scale, torch_autocast_gradscaler
   - These are required by base_optimizer.py's needs_scaler() and scale_if_loss()

3. scale_if_loss() assumed loss_scaler always exists
   - Added hasattr check before calling loss_scaler.scale_loss()

4. Test config missing grad_accum_dtype
   - Added data_types.grad_accum_dtype=fp32 to ensure BF16_Optimizer is used
   - Without this, FP16_Optimizer is used which doesn't support tensor fragment APIs

5. Added DS_DISABLE_REUSE_DIST_ENV support in tests/unit/common.py
   - Allows disabling reuse_dist_env via environment variable for CI

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Empty parameters (numel=0) cause issues in gradient allreduce when
using flatten/unflatten operations. The unflatten operation fails
with shape mismatches because empty tensors can't be properly
reconstructed from a flattened buffer.

This fix skips empty parameters in _get_gradients_for_reduction()
since they contribute nothing to gradient reduction anyway.

Fixes test_onebit.py::TestOneBitLambEmptyParameters::test

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana and others added 18 commits January 17, 2026 20:41
The test_ds_initialize.py::TestOptimizerImplementation test was missing
the configuration (None, 'bf16', 'fp32') from its is_supported dict.

This configuration (bf16 model with fp32 gradient accumulation, no ZeRO)
is actually supported by DeepSpeed and uses FP16_Optimizer in bf16 mode.
The test incorrectly expected NotImplementedError to be raised.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Increase pytest parallelism from 4 to 8 workers now that most tests
are stable. This should reduce overall test execution time.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test_user_args.py multi-node tests use pdsh launcher which requires
SSH server running on localhost. Skip these tests in CI since the
container doesn't have SSH configured.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…full sequential tests

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Run testRowParallel 5 times with different seeds to identify if failure is
seed-dependent. Then run full sequential test suite 3 times with different
seeds to check for flaky tests.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…tics

- Add -m sequential flag to actually run the sequential-marked tests
- Replace torch.allclose with torch.testing.assert_close for better error
  messages showing actual numerical differences when tests fail
- Remove unused debug-rowparallel.yml workflow

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
testColumnParallel[True-2] failed in run 2 - update it to use
torch.testing.assert_close for better error messages showing actual
numerical differences.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Replace torch.allclose() with torch.testing.assert_close() and add rtol
parameter for proper floating-point comparisons in testRowParallel and
testColumnParallel tests.

The tests were failing intermittently in CI because they only used absolute
tolerance (atol=1e-2) without relative tolerance. Adding rtol=1e-2 allows
for proper numerical comparisons where value magnitudes vary.

Also restore normal workflow execution (remove debug steps).

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` has some problems which preclude compiling the
extension on several scenarios (e.g., [isolated conda environment with
cuda toolchain](aqlaboratory/openfold-3#34),
lack of hardware in the system) and breaks some standard DeepSpeed
configuration of target capabilities.

*Changes*

  - Fix evoformer CUTLASS detection:
- Allow to skip it, useful when CUTLASS is already correctly setup
(e.g., in a conda environment with CUTLASS and the CUDA toolchain)
- Fix misleading use of deprecated nvidia-cutlass pypi package by
actually using the provided bindings but discouraging this route as
[these bindings are not maintained
anymore](NVIDIA/cutlass#2119)

  - Fix evoformer compilation with no GPU is present:
- this is taken care correctly and more generally by
builder.compute_capability_args
    - allow for cross-compilation in systems without GPU
- allows for compilation against all available virtual architectures and
binary outputs
    - see e.g., #5308

- Make all these changes configurable and explicit through documented
environment variables

Tested in all scenarios.

---------

Signed-off-by: Santi Villalba <sdvillal@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Install CUTLASS v3.5.1 and set CUTLASS_PATH
- Run only Evoformer tests to validate CUTLASS integration
- Verify Evoformer op compatibility after DeepSpeed install

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Install CUTLASS v3.5.1 and set CUTLASS_PATH for Evoformer op
- Mark test_DS4Sci_EvoformerAttention as sequential to avoid CUDA fork issues
- Restore full test workflow with all dependencies

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Collaborator Author

tohtana commented Jan 18, 2026

Confirmed the full tests pass after cherry-picking fixes from unmerged PRs. Closing as #7795 is open as a cleaner solution.

@tohtana tohtana closed this Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants