Add workflow to run full tests #7783

tohtana · 2026-01-17T00:56:19Z

Run all tests on the AWS infra. Not scheduled, needs to be manually launched. (Reopened from this repo, not a fork, for testing)
aws-torch-latest now runs full tests for debugging. Once it works, we will revert ti and add aws-torch-latest-full.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

1. Fix bf16 checkpoint saving/loading with zero_stage=0: - Remove incorrect `self.bfloat16_enabled()` from `zero_optimizer_state` condition in both `_save_checkpoint` and `_load_checkpoint` - bf16+stage0 uses FP16_Optimizer wrapper and should use the same checkpoint path as fp16+stage0, not the ZeRO checkpoint path 2. Fix muon test to allow untested optimizer with ZeRO: - Add `zero_allow_untested_optimizer: True` to config Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

The test_aio.py tests hang during async I/O operations in the CI environment. Skip them for now to allow other tests to run. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip additional ops tests that hang in CI environment: - unit/ops/lion: CPU lion optimizer tests hang - unit/ops/adagrad: CPU adagrad tests may hang - unit/ops/accelerators: transformer forward/backward tests may hang Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

The test was incorrectly using groups._get_sequence_parallel_group() which looks for a global mpu that was never set. Fixed to use the mpu object returned by register_with_transformers(). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add unit/runtime/pipe to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add unit/ops/adam to the ignore list - test_cpu_adam.py tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add unit/runtime/zenflow to the ignore list - these distributed tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add unit/checkpoint/test_pipeline.py to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Now that hanging tests are skipped, try running with more parallelism to speed up the test suite. Removed -x flag to continue on failures. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

- Add DS_DISABLE_REUSE_DIST_ENV environment variable check to tests/unit/common.py to allow disabling reuse_dist_env at runtime - This prevents pool worker cleanup hangs that occur when distributed environment is initialized in reused pool workers - Update aws-torch-latest.yml and aws-torch-latest-full.yml: - Mount /mnt/aio for O_DIRECT support (async I/O tests) - Set DS_DISABLE_REUSE_DIST_ENV=1 - Add --basetemp=/mnt/aio to pytest commands - Remove --ignore=unit/ops/aio (aio tests now enabled) Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Run full test suite (except nvme tests) with DS_DISABLE_REUSE_DIST_ENV=1 to validate that disabling reuse_dist_env prevents pool worker hangs. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

pytest --basetemp expects to create the directory, but /mnt/aio is a mount point that already exists. Use /mnt/aio/pytest subdirectory instead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

In PipelineEngine._exec_backward_pass(), for non-last stages (Stage 0), torch.autograd.backward() was called directly without setting _running_engine_backward=True. This caused the post-backward hook (_backward_post_hook) to raise a RuntimeError when needs_scaler=True because it incorrectly detected that backward() was called without proper loss scaling. The exception raised inside the callback caused the process to hang, which in turn caused NCCL collective operations to deadlock while waiting for all ranks. Fix by setting _running_engine_backward=True before calling backward() for non-last stages, and resetting it in a finally block. Also update to use the new tensor.backward(gradient) API style instead of torch.autograd.backward(), which properly integrates with DeepSpeed's hooks and loss scaling for non-scalar backward. Fixes pipeline checkpoint tests timing out with ZeRO stage 1. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Several bugs were causing test_bf16_optimizer_fragments to fail: 1. DDP_BFLOAT16 constant collision with BFLOAT16 - Both were set to "bf16", causing BF16_Optimizer to never be selected - Changed DDP_BFLOAT16 to "ddp_bf16" to differentiate 2. Missing attributes in BF16_Optimizer - Added custom_loss_scaler, external_loss_scale, torch_autocast_gradscaler - These are required by base_optimizer.py's needs_scaler() and scale_if_loss() 3. scale_if_loss() assumed loss_scaler always exists - Added hasattr check before calling loss_scaler.scale_loss() 4. Test config missing grad_accum_dtype - Added data_types.grad_accum_dtype=fp32 to ensure BF16_Optimizer is used - Without this, FP16_Optimizer is used which doesn't support tensor fragment APIs 5. Added DS_DISABLE_REUSE_DIST_ENV support in tests/unit/common.py - Allows disabling reuse_dist_env via environment variable for CI Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Empty parameters (numel=0) cause issues in gradient allreduce when using flatten/unflatten operations. The unflatten operation fails with shape mismatches because empty tensors can't be properly reconstructed from a flattened buffer. This fix skips empty parameters in _get_gradients_for_reduction() since they contribute nothing to gradient reduction anyway. Fixes test_onebit.py::TestOneBitLambEmptyParameters::test Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

The test_ds_initialize.py::TestOptimizerImplementation test was missing the configuration (None, 'bf16', 'fp32') from its is_supported dict. This configuration (bf16 model with fp32 gradient accumulation, no ZeRO) is actually supported by DeepSpeed and uses FP16_Optimizer in bf16 mode. The test incorrectly expected NotImplementedError to be raised. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Increase pytest parallelism from 4 to 8 workers now that most tests are stable. This should reduce overall test execution time. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

The test_user_args.py multi-node tests use pdsh launcher which requires SSH server running on localhost. Skip these tests in CI since the container doesn't have SSH configured. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

…full sequential tests Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Run testRowParallel 5 times with different seeds to identify if failure is seed-dependent. Then run full sequential test suite 3 times with different seeds to check for flaky tests. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

…tics - Add -m sequential flag to actually run the sequential-marked tests - Replace torch.allclose with torch.testing.assert_close for better error messages showing actual numerical differences when tests fail - Remove unused debug-rowparallel.yml workflow Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

testColumnParallel[True-2] failed in run 2 - update it to use torch.testing.assert_close for better error messages showing actual numerical differences. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Replace torch.allclose() with torch.testing.assert_close() and add rtol parameter for proper floating-point comparisons in testRowParallel and testColumnParallel tests. The tests were failing intermittently in CI because they only used absolute tolerance (atol=1e-2) without relative tolerance. Adding rtol=1e-2 allows for proper numerical comparisons where value magnitudes vary. Also restore normal workflow execution (remove debug steps). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

`EvoformerAttnBuilder` has some problems which preclude compiling the extension on several scenarios (e.g., [isolated conda environment with cuda toolchain](aqlaboratory/openfold-3#34), lack of hardware in the system) and breaks some standard DeepSpeed configuration of target capabilities. *Changes* - Fix evoformer CUTLASS detection: - Allow to skip it, useful when CUTLASS is already correctly setup (e.g., in a conda environment with CUTLASS and the CUDA toolchain) - Fix misleading use of deprecated nvidia-cutlass pypi package by actually using the provided bindings but discouraging this route as [these bindings are not maintained anymore](NVIDIA/cutlass#2119) - Fix evoformer compilation with no GPU is present: - this is taken care correctly and more generally by builder.compute_capability_args - allow for cross-compilation in systems without GPU - allows for compilation against all available virtual architectures and binary outputs - see e.g., #5308 - Make all these changes configurable and explicit through documented environment variables Tested in all scenarios. --------- Signed-off-by: Santi Villalba <sdvillal@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

- Install CUTLASS v3.5.1 and set CUTLASS_PATH - Run only Evoformer tests to validate CUTLASS integration - Verify Evoformer op compatibility after DeepSpeed install Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

- Install CUTLASS v3.5.1 and set CUTLASS_PATH for Evoformer op - Mark test_DS4Sci_EvoformerAttention as sequential to avoid CUDA fork issues - Restore full test workflow with all dependencies Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-01-18T23:20:55Z

Confirmed the full tests pass after cherry-picking fixes from unmerged PRs. Closing as #7795 is open as a cleaner solution.

tohtana added 2 commits January 16, 2026 16:49

add workflow to run full tests

7ff51dc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Test with -n 1 to debug parallel execution issues

9b556a5

tohtana force-pushed the tohtana/add_full_test_workflow branch from f909e3e to 9b556a5 Compare January 17, 2026 07:29

tohtana added 27 commits January 16, 2026 23:30

Merge: set TORCH_CUDA_ARCH_LIST=8.9 and use -n 1 for debugging

96d3674

Skip aio tests that hang in CI environment

fa213d0

The test_aio.py tests hang during async I/O operations in the CI environment. Skip them for now to allow other tests to run. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip pipeline parallelism tests that timeout in CI

b77f9c8

Add unit/runtime/pipe to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip CPU adam tests that timeout in CI

29eff21

Add unit/ops/adam to the ignore list - test_cpu_adam.py tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip zenflow tests that timeout in CI

e0e1cab

Add unit/runtime/zenflow to the ignore list - these distributed tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip pipeline checkpoint tests that timeout in CI

70d90c5

Add unit/checkpoint/test_pipeline.py to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_multiple_models.py that timeouts

5af1f37

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Run tests in parallel with -n 4 instead of sequential

a3a9f8e

Now that hanging tests are skipped, try running with more parallelism to speed up the test suite. Removed -x flag to continue on failures. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip onebit tests that timeout with pipeline config

02d5ed7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_ds_initialize.py tests that timeout

0fd0a10

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_zero_leaf_module.py tests that timeout

a6fb0cb

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_zero_tensor_fragment.py tests that timeout

35ec1e2

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_mup_optimizers.py tests that timeout

78d0580

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip test_user_args.py shell quoting edge cases

6077494

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip nvme checkpointing tests (no nvme device in CI)

ad2a74d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Remove test ignores to validate DS_DISABLE_REUSE_DIST_ENV fix

ee61faa

Run full test suite (except nvme tests) with DS_DISABLE_REUSE_DIST_ENV=1 to validate that disabling reuse_dist_env prevents pool worker hangs. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Fix: Use /mnt/aio/pytest subdirectory for basetemp

bfa4831

pytest --basetemp expects to create the directory, but /mnt/aio is a mount point that already exists. Use /mnt/aio/pytest subdirectory instead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip GDS tests in CI (no GPUDirect Storage hardware)

3d137c6

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Install pdsh for launcher tests

30a80d0

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add pdsh, skip zenflow tests (timeout)

c6e6008

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana and others added 18 commits January 17, 2026 20:41

ci: increase parallel test workers to 8

12a6e95

Increase pytest parallelism from 4 to 8 workers now that most tests are stable. This should reduce overall test execution time. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

ci: enable zenflow tests

121c7e0

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

ci: skip launcher tests requiring SSH

5daced1

The test_user_args.py multi-node tests use pdsh launcher which requires SSH server running on localhost. Skip these tests in CI since the container doesn't have SSH configured. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip zenflow tests due to pre-existing Stage 3 bugs

274d361

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Skip ZenFlow torch adam test (CUDA/fork incompatibility)

ba296eb

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Mark manual dist init tests as sequential to avoid port conflicts

c993e84

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add debug test for RowParallel numerical differences

ded0436

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Update debug workflow to run testRowParallel with multiple seeds and …

ee5e166

…full sequential tests Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Debug: Also update testColumnParallel to use assert_close

7a5eebd

testColumnParallel[True-2] failed in run 2 - update it to use torch.testing.assert_close for better error messages showing actual numerical differences. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix format

a4500e6

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Temp: Add CUTLASS and run only Evoformer tests

e668921

- Install CUTLASS v3.5.1 and set CUTLASS_PATH - Run only Evoformer tests to validate CUTLASS integration - Verify Evoformer op compatibility after DeepSpeed install Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Fix: Remove --forked from Evoformer test to avoid CUDA fork issue

f61ed51

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana closed this Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow to run full tests #7783

Add workflow to run full tests #7783

Uh oh!

tohtana commented Jan 17, 2026 •

edited

Loading

Uh oh!

tohtana commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add workflow to run full tests #7783

Add workflow to run full tests #7783

Uh oh!

Conversation

tohtana commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tohtana commented Jan 17, 2026 •

edited

Loading