Skip to content

test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077

Open
kryanbeane wants to merge 2 commits into
project-codeflare:mainfrom
kryanbeane:RHOAIENG-57445
Open

test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077
kryanbeane wants to merge 2 commits into
project-codeflare:mainfrom
kryanbeane:RHOAIENG-57445

Conversation

@kryanbeane
Copy link
Copy Markdown
Contributor

@kryanbeane kryanbeane commented May 7, 2026

Issue link

https://redhat.atlassian.net/browse/RHOAIENG-57445

What changes have been made

Add E2E tests for Ray in-tree autoscaling (non-Kueue path):

  • tests/e2e/autoscaling_raycluster_sdk_kind_test.py — KinD lifecycle test (scale up + scale down)
  • tests/e2e/autoscaling_raycluster_sdk_oauth_test.py — OpenShift/OAuth lifecycle test (scale up + scale down)
  • tests/e2e/autoscaling_load.py — Ray workload script that creates CPU-bound tasks to trigger autoscaling
  • tests/e2e/support.py — Added wait_for_worker_count() and run_autoscaling_load_in_head_pod() helpers

Both tests creaate an autoscaling-enabled RayCluster without Kueue resources, verify scale-up under load, then verify scale-down after idle timeout.

Verification steps

KinD:

kubectl config use-context kind-kind
cd tests/e2e
poetry run pytest -vv -s autoscaling_raycluster_sdk_kind_test.py -m kind

OpenShift:

oc login <cluster>
cd tests/e2e
poetry run pytest -vv -s autoscaling_raycluster_sdk_oauth_test.py -m openshift

Expected: cluster scales from min_workers=1 to >=2 under load, then back to 1 after idle timeout.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci Bot requested review from Fiona-Waters and szaher May 7, 2026 15:10
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kryanbeane for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kryanbeane kryanbeane changed the title sptest(RHOAIENG-57445): added E2E test for RayCluster Autoscaling test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.61%. Comparing base (112357e) to head (b940332).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1077   +/-   ##
=======================================
  Coverage   96.61%   96.61%           
=======================================
  Files          23       23           
  Lines        2306     2306           
=======================================
  Hits         2228     2228           
  Misses         78       78           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kryanbeane
Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026
Comment thread tests/e2e/support.py Outdated
Comment thread tests/e2e/autoscaling_raycluster_sdk_oauth_test.py Outdated
@pawelpaszki
Copy link
Copy Markdown
Contributor

@kryanbeane I am getting resources issues on a standard PSI cluster. one of the workers is stuck pending (for this step: (autoscaler +2s) Resized to 4 CPUs, 0 GPUs.) with the following info:

0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. no new claims to deallocate, preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.

also from the race condition mentioned in another commit, the test got stuck on this:

(autoscaler +2s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2s) Adding 1 node(s) of type small-group-autoscale-iontx.
(autoscaler +2s) Resized to 4 CPUs, 0 GPUs.
(autoscaler +4m5s) Removing 1 nodes of type small-group-autoscale-iontx (idle).
(autoscaler +4m5s) Resized to 3 CPUs, 0 GPUs.

and eventually itmed out:

FAILED

…esource pressure

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants