-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Add workflow to run full tests #7783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
f909e3e to
9b556a5
Compare
1. Fix bf16 checkpoint saving/loading with zero_stage=0:
- Remove incorrect `self.bfloat16_enabled()` from `zero_optimizer_state`
condition in both `_save_checkpoint` and `_load_checkpoint`
- bf16+stage0 uses FP16_Optimizer wrapper and should use the same
checkpoint path as fp16+stage0, not the ZeRO checkpoint path
2. Fix muon test to allow untested optimizer with ZeRO:
- Add `zero_allow_untested_optimizer: True` to config
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test_aio.py tests hang during async I/O operations in the CI environment. Skip them for now to allow other tests to run. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Skip additional ops tests that hang in CI environment: - unit/ops/lion: CPU lion optimizer tests hang - unit/ops/adagrad: CPU adagrad tests may hang - unit/ops/accelerators: transformer forward/backward tests may hang Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test was incorrectly using groups._get_sequence_parallel_group() which looks for a global mpu that was never set. Fixed to use the mpu object returned by register_with_transformers(). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/runtime/pipe to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/ops/adam to the ignore list - test_cpu_adam.py tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/runtime/zenflow to the ignore list - these distributed tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add unit/checkpoint/test_pipeline.py to the ignore list - these tests timeout after 600s in the CI environment. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Now that hanging tests are skipped, try running with more parallelism to speed up the test suite. Removed -x flag to continue on failures. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Add DS_DISABLE_REUSE_DIST_ENV environment variable check to tests/unit/common.py to allow disabling reuse_dist_env at runtime - This prevents pool worker cleanup hangs that occur when distributed environment is initialized in reused pool workers - Update aws-torch-latest.yml and aws-torch-latest-full.yml: - Mount /mnt/aio for O_DIRECT support (async I/O tests) - Set DS_DISABLE_REUSE_DIST_ENV=1 - Add --basetemp=/mnt/aio to pytest commands - Remove --ignore=unit/ops/aio (aio tests now enabled) Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Run full test suite (except nvme tests) with DS_DISABLE_REUSE_DIST_ENV=1 to validate that disabling reuse_dist_env prevents pool worker hangs. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
pytest --basetemp expects to create the directory, but /mnt/aio is a mount point that already exists. Use /mnt/aio/pytest subdirectory instead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
In PipelineEngine._exec_backward_pass(), for non-last stages (Stage 0), torch.autograd.backward() was called directly without setting _running_engine_backward=True. This caused the post-backward hook (_backward_post_hook) to raise a RuntimeError when needs_scaler=True because it incorrectly detected that backward() was called without proper loss scaling. The exception raised inside the callback caused the process to hang, which in turn caused NCCL collective operations to deadlock while waiting for all ranks. Fix by setting _running_engine_backward=True before calling backward() for non-last stages, and resetting it in a finally block. Also update to use the new tensor.backward(gradient) API style instead of torch.autograd.backward(), which properly integrates with DeepSpeed's hooks and loss scaling for non-scalar backward. Fixes pipeline checkpoint tests timing out with ZeRO stage 1. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Several bugs were causing test_bf16_optimizer_fragments to fail: 1. DDP_BFLOAT16 constant collision with BFLOAT16 - Both were set to "bf16", causing BF16_Optimizer to never be selected - Changed DDP_BFLOAT16 to "ddp_bf16" to differentiate 2. Missing attributes in BF16_Optimizer - Added custom_loss_scaler, external_loss_scale, torch_autocast_gradscaler - These are required by base_optimizer.py's needs_scaler() and scale_if_loss() 3. scale_if_loss() assumed loss_scaler always exists - Added hasattr check before calling loss_scaler.scale_loss() 4. Test config missing grad_accum_dtype - Added data_types.grad_accum_dtype=fp32 to ensure BF16_Optimizer is used - Without this, FP16_Optimizer is used which doesn't support tensor fragment APIs 5. Added DS_DISABLE_REUSE_DIST_ENV support in tests/unit/common.py - Allows disabling reuse_dist_env via environment variable for CI Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Empty parameters (numel=0) cause issues in gradient allreduce when using flatten/unflatten operations. The unflatten operation fails with shape mismatches because empty tensors can't be properly reconstructed from a flattened buffer. This fix skips empty parameters in _get_gradients_for_reduction() since they contribute nothing to gradient reduction anyway. Fixes test_onebit.py::TestOneBitLambEmptyParameters::test Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test_ds_initialize.py::TestOptimizerImplementation test was missing the configuration (None, 'bf16', 'fp32') from its is_supported dict. This configuration (bf16 model with fp32 gradient accumulation, no ZeRO) is actually supported by DeepSpeed and uses FP16_Optimizer in bf16 mode. The test incorrectly expected NotImplementedError to be raised. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Increase pytest parallelism from 4 to 8 workers now that most tests are stable. This should reduce overall test execution time. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The test_user_args.py multi-node tests use pdsh launcher which requires SSH server running on localhost. Skip these tests in CI since the container doesn't have SSH configured. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…full sequential tests Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Run testRowParallel 5 times with different seeds to identify if failure is seed-dependent. Then run full sequential test suite 3 times with different seeds to check for flaky tests. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…tics - Add -m sequential flag to actually run the sequential-marked tests - Replace torch.allclose with torch.testing.assert_close for better error messages showing actual numerical differences when tests fail - Remove unused debug-rowparallel.yml workflow Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
testColumnParallel[True-2] failed in run 2 - update it to use torch.testing.assert_close for better error messages showing actual numerical differences. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Replace torch.allclose() with torch.testing.assert_close() and add rtol parameter for proper floating-point comparisons in testRowParallel and testColumnParallel tests. The tests were failing intermittently in CI because they only used absolute tolerance (atol=1e-2) without relative tolerance. Adding rtol=1e-2 allows for proper numerical comparisons where value magnitudes vary. Also restore normal workflow execution (remove debug steps). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` has some problems which preclude compiling the extension on several scenarios (e.g., [isolated conda environment with cuda toolchain](aqlaboratory/openfold-3#34), lack of hardware in the system) and breaks some standard DeepSpeed configuration of target capabilities. *Changes* - Fix evoformer CUTLASS detection: - Allow to skip it, useful when CUTLASS is already correctly setup (e.g., in a conda environment with CUTLASS and the CUDA toolchain) - Fix misleading use of deprecated nvidia-cutlass pypi package by actually using the provided bindings but discouraging this route as [these bindings are not maintained anymore](NVIDIA/cutlass#2119) - Fix evoformer compilation with no GPU is present: - this is taken care correctly and more generally by builder.compute_capability_args - allow for cross-compilation in systems without GPU - allows for compilation against all available virtual architectures and binary outputs - see e.g., #5308 - Make all these changes configurable and explicit through documented environment variables Tested in all scenarios. --------- Signed-off-by: Santi Villalba <sdvillal@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Install CUTLASS v3.5.1 and set CUTLASS_PATH - Run only Evoformer tests to validate CUTLASS integration - Verify Evoformer op compatibility after DeepSpeed install Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
- Install CUTLASS v3.5.1 and set CUTLASS_PATH for Evoformer op - Mark test_DS4Sci_EvoformerAttention as sequential to avoid CUDA fork issues - Restore full test workflow with all dependencies Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Collaborator
Author
|
Confirmed the full tests pass after cherry-picking fixes from unmerged PRs. Closing as #7795 is open as a cleaner solution. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Run all tests on the AWS infra. Not scheduled, needs to be manually launched. (Reopened from this repo, not a fork, for testing)
aws-torch-latestnow runs full tests for debugging. Once it works, we will revert ti and addaws-torch-latest-full.