Skip to content

[FEAT] mcp server for correctness verification and benchmarking#2147

Open
mohammedahmed18 wants to merge 1 commit intomainfrom
feat/mcp-server
Open

[FEAT] mcp server for correctness verification and benchmarking#2147
mohammedahmed18 wants to merge 1 commit intomainfrom
feat/mcp-server

Conversation

@mohammedahmed18
Copy link
Copy Markdown
Contributor

@mohammedahmed18 mohammedahmed18 commented May 8, 2026

Example e2e workflow (on bubble_sort.py)

============================================================
MCP WORKFLOW TEST
============================================================

[1] Running behavioral tests (baseline)...
    run_id:      e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4
    total_tests: 3
    passed:      3
    failed:      0
    runtime_ns:  1901084104

[2] Running benchmark (baseline, original code)...
    run_id:         e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-bench-baseline
    total_runtime:  1820472624ns
    loops_executed: 3

[3] Applying optimization to bubble_sort.py...
    module: /home/mohammed/Work/codeflash/code_to_optimize/bubble_sort.py (optimized in-place)

[4] Running behavioral tests (candidate)...
    run_id:      e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-candidate
    total_tests: 3
    passed:      3
    failed:      0
    runtime_ns:  47862

[5] Comparing baseline vs candidate...
    equivalent:     True
    total_compared: 3

[6] Running benchmark (candidate, optimized code)...
    run_id:         e2e-test-b08abaaf-ca92-448e-ad8d-16be8ec7d1b4-bench-candidate
    total_runtime:  29319ns
    loops_executed: 10
    speedup_x:      62091.907x
    speedup_pct:    6209090.7%
    baseline_ns:    1820472624
    candidate_ns:   29319

============================================================
RESULTS SUMMARY
============================================================
  Baseline:   3/3 passed
  Candidate:  3/3 passed
  Equivalent: True
  Benchmark baseline:  1820.47ms
  Benchmark candidate: 0.03ms
  Speedup:    62091.907x

planning to integrate this with a coding agent (like claude-code), the workflow should look like this:

1- the agent decides to optimize a function
2- it generates a test file(s)
3- calls codeflash-mcp to run the test file(s) and the mcp will store the results into a sqlite db (path is passed via env)
4- calls codeflash-mcp to benchmark the original code by the agent generated tests
5- the agent then applies the optimization (replaces the code into the main file(s))
6- calls the behavioural test mcp tool again, then calls the comparator tool to compare test results (it knows what to compare based on the run-id (agent can provide it or the mcp will generate a random uuid and return it back to the agent))
7- agent keeps on fixing the optimized code until step 6 passes
8- calls codeflash-mcp benchmarking tool on the new code and make sure it's faster
keeps on iterating

@mohammedahmed18 mohammedahmed18 requested a review from KRRT7 as a code owner May 8, 2026 15:44
@mohammedahmed18 mohammedahmed18 requested a review from a team May 8, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant