[FEAT] mcp server for correctness verification and benchmarking by mohammedahmed18 · Pull Request #2147 · codeflash-ai/codeflash

mohammedahmed18 · 2026-05-08T15:44:18Z

Example e2e workflow (on bubble_sort.py)

============================================================
MCP WORKFLOW TEST
============================================================

[1] Running behavioral tests (baseline)...
    run_id:      e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4
    total_tests: 3
    passed:      3
    failed:      0
    runtime_ns:  1901084104

[2] Running benchmark (baseline, original code)...
    run_id:         e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-bench-baseline
    total_runtime:  1820472624ns
    loops_executed: 3

[3] Applying optimization to bubble_sort.py...
    module: /home/mohammed/Work/codeflash/code_to_optimize/bubble_sort.py (optimized in-place)

[4] Running behavioral tests (candidate)...
    run_id:      e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-candidate
    total_tests: 3
    passed:      3
    failed:      0
    runtime_ns:  47862

[5] Comparing baseline vs candidate...
    equivalent:     True
    total_compared: 3

[6] Running benchmark (candidate, optimized code)...
    run_id:         e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-bench-candidate
    total_runtime:  29319ns
    loops_executed: 10
    speedup_x:      62091.907x
    speedup_pct:    6209090.7%
    baseline_ns:    1820472624
    candidate_ns:   29319

============================================================
RESULTS SUMMARY
============================================================
  Baseline:   3/3 passed
  Candidate:  3/3 passed
  Equivalent: True
  Benchmark baseline:  1820.47ms
  Benchmark candidate: 0.03ms
  Speedup:    62091.907x

planning to integrate this with a coding agent (like claude-code), the workflow should look like this:

1- the agent decides to optimize a function
2- it generates a test file(s)
3- calls codeflash-mcp to run the test file(s) and the mcp will store the results into a sqlite db (path is passed via env)
4- calls codeflash-mcp to benchmark the original code by the agent generated tests
5- the agent then applies the optimization (replaces the code into the main file(s))
6- calls the behavioural test mcp tool again, then calls the comparator tool to compare test results (it knows what to compare based on the run-id (agent can provide it or the mcp will generate a random uuid and return it back to the agent))
7- agent keeps on fixing the optimized code until step 6 passes
8- calls codeflash-mcp benchmarking tool on the new code and make sure it's faster
keeps on iterating

add mcp server for behavioural correctness and benchmarking

c884d99

mohammedahmed18 requested a review from KRRT7 as a code owner May 8, 2026 15:44

mohammedahmed18 requested a review from a team May 8, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] mcp server for correctness verification and benchmarking#2147

[FEAT] mcp server for correctness verification and benchmarking#2147
mohammedahmed18 wants to merge 1 commit intomainfrom
feat/mcp-server

mohammedahmed18 commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohammedahmed18 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example e2e workflow (on bubble_sort.py)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mohammedahmed18 commented May 8, 2026 •

edited

Loading