Skip to content

Add output validation to benchmarks for regression testing (Issue #267)#295

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/issue-267-benchmark-validation
Draft

Add output validation to benchmarks for regression testing (Issue #267)#295
Copilot wants to merge 2 commits intomasterfrom
copilot/issue-267-benchmark-validation

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 15, 2026

Regression tests currently only verify that benchmark functions execute without errors — they don't validate correctness of the output. This adds a validate_output hook to each benchmark that the regression framework calls after each successful invocation.

Interface

  • Added optional validate_output(input_config, output) -> bool to BenchmarkModuleInterface (defaults to True for backwards compatibility)
  • Added Benchmark.validate_output() method that delegates to the benchmark's input.py if the function is defined
  • regression.py now calls benchmark.validate_output(input_config, ret.output.get("result", {})) after each successful trigger invocation; failures are logged and mark the test as failed

The output passed to validators is the function handler's full return value (containing result and measurement keys).

Per-benchmark validators added to input.py

Benchmark Validation logic
010.sleep result == input['sleep'] (exact match)
110.dynamic-html Non-empty HTML string containing the username
120.uploader Non-empty storage key, URL echoed back correctly
130.crud-api GET returns expected fields; PUT returns {}
210.thumbnailer, 220.video-processing, 504.dna-visualisation Non-empty storage key in result
311.compression Result key ends with .zip
411.image-recognition Non-empty class label string and non-negative integer index
501.graph-pagerank Float in [0.0, 1.0]
502.graph-mst Result is not None
503.graph-bfs Non-empty list/tuple

Example

# benchmarks/000.microbenchmarks/010.sleep/input.py
def validate_output(input_config: dict, output: dict) -> bool:
    return output.get('result') == input_config.get('sleep')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants