Skip to content

Pull requests: awslabs/awsome-distributed-training

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Reviews
Assignee
Filter by who’s assigned
Assigned to nobody Loading
Sort

Pull requests list

torchtitan: replace conda env with venv and pin all versions
#1074 opened Apr 29, 2026 by KeitaW Collaborator Draft
4 tasks done
FSDP: pin nccl-tests base, bump torch to cu130, log TFLOPS/MFU
#1073 opened Apr 29, 2026 by KeitaW Collaborator Draft
6 of 8 tasks
nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles
#1072 opened Apr 29, 2026 by KeitaW Collaborator Draft
4 of 6 tasks
megatron-lm: bump to NGC pytorch:26.02 and add Llama 3 8B sbatch
#1071 opened Apr 29, 2026 by KeitaW Collaborator Draft
5 tasks done
feat: add DETR-ResNet50 object detection fine-tuning test case
#1068 opened Apr 28, 2026 by aravneelaws Contributor Loading…
7 tasks done
Adding a megatron-bridge sample
#1065 opened Apr 27, 2026 by allela-roy Contributor Loading…
Bump transformers from 4.48.0 to 5.0.0rc3 in /3.test_cases/pytorch/nvrx dependencies Pull requests that update a dependency file python Pull requests that update python code
#1057 opened Apr 8, 2026 by dependabot Bot Loading…
Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge
#1054 opened Apr 4, 2026 by nkumaraws Contributor Loading…
Bump requests from 2.32.3 to 2.33.0 in /3.test_cases/pytorch/nvrx dependencies Pull requests that update a dependency file python Pull requests that update python code
#1036 opened Mar 25, 2026 by dependabot Bot Loading…
Add V-JEPA 2 (Meta FAIR) distributed training test case
#1035 opened Mar 23, 2026 by paragao Contributor Loading…
Add DeepSpeed CI regression tests for QLoRA and GPT-103B
#1029 opened Mar 20, 2026 by paragao Contributor Loading…
fix: overhaul CI workflows for FSDP regression tests
#1024 opened Mar 17, 2026 by paragao Contributor Loading…
Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+
#1022 opened Mar 13, 2026 by aravneelaws Contributor Loading…
7 tasks done
docs: comprehensive instance hardware profiles (16 families)
#1021 opened Mar 13, 2026 by KeitaW Collaborator Draft
4 tasks
Add OSMO AMR Navigation test case
#1018 opened Mar 12, 2026 by KeitaW Collaborator Loading…
1 of 3 tasks
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS
#1010 opened Mar 9, 2026 by dmvevents Contributor Loading…
6 tasks
Add LeRobot pi0-FAST DROID multi-node training test case
#1003 opened Feb 26, 2026 by KeitaW Collaborator Draft
7 tasks
Updating CF stack for GB200 local zone deployments
#968 opened Feb 17, 2026 by KeitaW Collaborator Loading…
ProTip! no:milestone will show everything without a milestone.