test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077
test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077kryanbeane wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1077 +/- ##
=======================================
Coverage 96.61% 96.61%
=======================================
Files 23 23
Lines 2306 2306
=======================================
Hits 2228 2228
Misses 78 78 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/hold |
|
@kryanbeane I am getting resources issues on a standard PSI cluster. one of the workers is stuck pending (for this step: also from the race condition mentioned in another commit, the test got stuck on this: and eventually itmed out: |
…esource pressure Co-authored-by: Cursor <cursoragent@cursor.com>
Issue link
https://redhat.atlassian.net/browse/RHOAIENG-57445
What changes have been made
Add E2E tests for Ray in-tree autoscaling (non-Kueue path):
tests/e2e/autoscaling_raycluster_sdk_kind_test.py— KinD lifecycle test (scale up + scale down)tests/e2e/autoscaling_raycluster_sdk_oauth_test.py— OpenShift/OAuth lifecycle test (scale up + scale down)tests/e2e/autoscaling_load.py— Ray workload script that creates CPU-bound tasks to trigger autoscalingtests/e2e/support.py— Addedwait_for_worker_count()andrun_autoscaling_load_in_head_pod()helpersBoth tests creaate an autoscaling-enabled RayCluster without Kueue resources, verify scale-up under load, then verify scale-down after idle timeout.
Verification steps
KinD:
kubectl config use-context kind-kind cd tests/e2e poetry run pytest -vv -s autoscaling_raycluster_sdk_kind_test.py -m kindOpenShift:
Expected: cluster scales from
min_workers=1to>=2under load, then back to1after idle timeout.Checks