Skip to content

Conversation

@michael-johnston
Copy link
Member

@michael-johnston michael-johnston commented Jan 14, 2026

This PR add a new stopper, BayesianDifferenceStopper, to ray tune operator.

This stopper enables stopping sample when the mean difference in two metrics is determined to be above or below a threshold with a target certainty.

Use cases including determining if a model is drifting i.e. if the mean difference is above or below a tolerance threshold given new data; and detecting performance regressions i.e. if given a new software version software performance has changed substantially (beyond a threshold)

if an operation that created a nested op was interrupted during the nested op the relationship between the parent and child would not be captured.

this change fixes this.
Stopper which stops when the difference between too metrics is greater/less than a threshold with some probability

e.g. stop if perf_version_a and perf_version_b < 10 tokens/sec with 95% probability

also add tests
@AlessandroPomponio AlessandroPomponio changed the title feat: difference stopper feat(ray_tune): difference stopper Jan 14, 2026
Copy link
Member

@AlessandroPomponio AlessandroPomponio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to check again tomorrow, but these are some issues I see

@michael-johnston
Copy link
Member Author

@AlessandroPomponio hold off on review as I noticed a more fundamental issue I need to fix

will convert to draft and ping you when ready

@michael-johnston michael-johnston marked this pull request as draft January 14, 2026 18:45
michael-johnston and others added 2 commits January 15, 2026 14:38
Deprecated mode parameter

Removes cases where the condition specified by the user is known to never be satisfied but the run did not stop e.g. 95% prop the m ean difference > 10, but the mean difference is 2 - this would not stop even when the probability the mean difference <10 is 1.0
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
@DRL-NextGen
Copy link
Member

DRL-NextGen commented Jan 15, 2026

Checks Summary

Last run: 2026-01-16T11:23:13.286Z

Code Risk Analyzer vulnerability scan found 2 vulnerabilities:

Severity Identifier Package Details Fix
🔷 Medium CVE-2026-22773 vllm
vLLM is vulnerable to DoS in Idefics3 vision models via image payload with ambiguous dimensionsGHSA-grg2-63fw-f2qr

vllm:0.11.1
0.12.0
◻ Unknown CVE-2025-53000 nbconvert
nbconvert has an uncontrolled search path that leads to unauthorized code execution on WindowsGHSA-xm59-rqc7-hhvf

nbconvert:7.16.6->ado-core:1.3.3
>7.16.6

Mend Unified Agent vulnerability scan found 1 vulnerabilities:

Severity Identifier Package Details Fix
🔺 High CVE-2025-53000 nbconvert-7.16.6-py3-none-any.whl
The nbconvert tool, jupyter nbconvert, converts Jupyter notebooks to various other formats via Jinja...The nbconvert tool, jupyter nbconvert, converts Jupyter notebooks to various other formats via Jinja templates. Versions of nbconvert up to and including 7.16.6 on Windows have a vulnerability in which converting a notebook containing SVG output to a PDF results in unauthorized code execution. Specifically, a third party can create a "inkscape.bat" file that defines a Windows batch script, capable of arbitrary code execution. When a user runs "jupyter nbconvert --to pdf" on a notebook containing SVG output to a PDF on a Windows platform from this directory, the "inkscape.bat" file is run unexpectedly. As of time of publication, no known patches exist.
Not Available

michael-johnston and others added 6 commits January 15, 2026 19:05
Now allows setting if the identifier is in target or observed format. Default is either (the existing behaviour)
MeasurementSpace.propertyWithIdentifierInSpace
…erved format

target is default, keeping existing behaviour
@michael-johnston michael-johnston marked this pull request as ready for review January 15, 2026 19:33
michael-johnston and others added 4 commits January 16, 2026 10:12
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
@michael-johnston
Copy link
Member Author

michael-johnston commented Jan 16, 2026

@VassilisVassiliadis suggested operation for testing with sfttrainer. Change fields as desired.

metadata:
  description: "Perform latin hypercube sampling with difference stopper for space using sfttrainer lora benchmark experiment"
  name: "lhc-difference-sfttrainer-lora"
operation:
  module:
    operatorName: "ray_tune"
    operationType: "search"
  parameters:
    runtimeConfig:
      stop:
      - name: "BayesianMetricDifferenceStopper"
        keywordParams:
          metric_a: "finetune_lora_benchmark-v1.0.0-fms_hf_tuning_version.2.6.0-dataset_tokens_per_second_per_gpu"  # v1 measurement
          metric_b: "finetune_lora_benchmark-v1.0.0-fms_hf_tuning_version.3.0.0-dataset_tokens_per_second_per_gpu"  # v2 measurement
          threshold: 100                  # Stop when we know |v1-v2| > or < 100 with target probability
          target_probability: 0.95        # 95% confidence
          min_samples: 10                 # Wait for 10 trials minimum
    orchestratorConfig:
      metric_format: "observed" # We need to use observed property value as the target property id is the same for both experiment versions
    tuneConfig:
      metric: "finetune_lora_benchmark-v1.0.0-fms_hf_tuning_version.2.6.0-dataset_tokens_per_second_per_gpu" #ray tune needs primary metric to track
      max_concurrent_trials: 1 # This is set for debugging. Increase if you want multiple measurements at once.
      mode: min 
      num_samples: 32
      search_alg:
        name: lhu_sampler
spaces:
  - space-60b5c0-12e5dd

config.py: import typing
operator.py import literal from typing
Suppresses errors in IDE if plugin is not installed
@michael-johnston
Copy link
Member Author

@VassilisVassiliadis We will wait on your update that above YAML works before merging.

@VassilisVassiliadis
Copy link
Member

I'll reply here when my test is over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants