Merge latest changes from main to 'Documentation' branch #192
Open
rsareddy0329 wants to merge 190 commits intodocumentationfrom
Open
Merge latest changes from main to 'Documentation' branch #192rsareddy0329 wants to merge 190 commits intodocumentationfrom
rsareddy0329 wants to merge 190 commits intodocumentationfrom
Conversation
Co-authored-by: adishaa <adishaa@amazon.com>
… with minor improvements and bug fixes (#137)
… with minor improvements and bug fixes. (#139)
…ception count data (#140)
* manual release v3.0.1
… regionalized HMA URI (#141)
* Add unique time string to integ test * Update syntax
* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* Update inferenece SDK examples * Update readme
* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed
Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update inference config and integ tests * Update integ tests for new canaries
* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally
…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.
* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries
…189) Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin
…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* inferenceendpointconfigs remove tab to fix syntax * comment * Add addOn nodeAffinity CRD and helm version update * Add Migration README * Seperate migration into follow up PR --------- Co-authored-by: Xuan Lu <xua@amazon.com>
* inferenceendpointconfigs remove tab to fix syntax * comment * Add addOn nodeAffinity CRD and helm version update * Add Migration README * Seperate migration into follow up PR * Add Migration README --------- Co-authored-by: Xuan Lu <xua@amazon.com>
…rPodHelmChart (#375) * inferenceendpointconfigs remove tab to fix syntax * comment * Add addOn nodeAffinity CRD and helm version update * Add Migration README * Seperate migration into follow up PR * Add Migration README * Update hyperpod-inference-operator to 2.0.0 in HyperPodHelmChart --------- Co-authored-by: Xuan Lu <xua@amazon.com>
Co-authored-by: Chad Chiang <chadchc@amazon.com>
* feat: Implement elastic training cli arguments (#273) * feat: Implement elastic training cli arguments * Add elastic training unified config and unit test * Add graceful shutdown and scaling timeout to cli args * Revert "feat: Implement elastic training cli arguments (#273)" This reverts commit 18428ef. * Remove space CLI error traceback --------- Co-authored-by: Sophia <yungwenh@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> --- Implement port forwarding for space (#312) --- Implement MIG profile validation for spaces (#315)
* Update CHANGELOG for version 3.7.0 Added new features, enhancements, and bug fixes for v3.7.0. * chore: bump version for release --------- Co-authored-by: Syed Jafri <syedjfr@amazon.com>
Co-authored-by: Farhan Tejani <8650465+FarhanTejani@users.noreply.github.com>
…with feature improvements and bug fixes. (#388) Features * Enhanced EFA monitoring with error counter tracking for improved network health visibility Bug Fixes * Marked Xid 163 as warning-only error instead of requiring immediate reboot * Added handling for Nvidia GPU Xid 94 errors (ROBUST_CHANNEL_CONTAINED_ERROR) as a new fault category with no action triggering on Kubernetes platforms Co-authored-by: Amanuel Taddesse <amantad@amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-5-101.us-west-2.compute.internal>
Add g7e instance types to values.yaml: - nvidia-device-plugin nodeAffinity: all 6 g7e sizes - aws-efa-k8s-device-plugin supportedInstanceLabels: 4 EFA-capable g7e sizes (8xlarge, 12xlarge, 24xlarge, 48xlarge)
Add MIG configuration profiles for RTX PRO 6000 Blackwell (g7e): - all-1g.24gb: 4x 1g.24gb instances - all-2g.48gb: 2x 2g.48gb instances - all-4g.96gb: 1x 4g.96gb instance (full GPU) - mixed-2-1g.24gb-1-2g.48gb: 2x 1g.24gb + 1x 2g.48gb Ref: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html
- INSTANCE_RESOURCES: add all 6 g7e sizes with cpu/gpu/memory/efa specs - INSTANCE_TYPE_MIG_PROFILES: add g7e MIG profiles (1g.24gb, 2g.48gb, 4g.96gb) - HyperpodInstanceType enum: add 6 g7e entries
Add ml.g7e.{2,4,8,12,24,48}xlarge to the health-monitoring-agent
DaemonSet node affinity allowlist so the agent runs on g7e instances.
Part of g7e instance type onboarding for HyperPod.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Approval Steps
For Requester
For Reviewer
For Requestersection to double check each item.