-
Notifications
You must be signed in to change notification settings - Fork 6
blog: Qiskit IVR functional validation #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ajbozarth
wants to merge
1
commit into
generative-computing:main
Choose a base branch
from
ajbozarth:docs/qiskit-ivr-functional-validation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+246
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,246 @@ | ||
| --- | ||
| title: "From Linting to Tests: Doubling Functional Correctness in Qiskit IVR" | ||
| date: "2026-05-27" | ||
| author: "Alex Bozarth" | ||
| excerpt: "Wiring functional tests into Mellea's Instruct-Validate-Repair loop nearly doubled functional correctness on the Qiskit Human Eval benchmark, on top of what static validation already provided." | ||
| tags: ["qiskit", "IVR", "validation", "benchmark", "LLM", "code-generation"] | ||
| --- | ||
|
|
||
| In our [previous post](/blogs/qiskit-ivr-code-validation/) we showed how Mellea's | ||
| Instruct-Validate-Repair (IVR) pattern, paired with `flake8-qiskit-migration`, | ||
| catches deprecated Qiskit APIs in LLM-generated code automatically. With a | ||
| Qiskit-specialized model that piece works well: across 302 benchmark runs, the | ||
| linter accepted the model's output 100% of the time. Static validation does | ||
| real work, and the repair loop is a reliable safety net for migration rules. | ||
|
|
||
| That left an obvious next question: a linter can tell us the imports are | ||
| modern and the API names exist, but it cannot tell us whether the code does | ||
| what was asked. What happens when the validator inside the IVR loop is a | ||
| functional test instead of a static check? | ||
|
|
||
| We had a clean way to answer that. The | ||
| [Qiskit Human Eval (QHE)](https://github.com/qiskit-community/qiskit-human-eval) | ||
| dataset ships a `check()` function with each problem: a small unit test that | ||
| runs the generated code and asserts on its output. We wired `check()` into | ||
| Mellea's IVR loop as a second `Requirement`, alongside the existing QKT | ||
| linter, and reran the benchmark. Same model, same prompts, same repair | ||
| budget, two validators instead of one. | ||
|
|
||
| Functional correctness on the same benchmark went from **27.8%** | ||
| to **50.3%** — a +22.5pp jump on a benchmark the model was already saturating | ||
| from a linting standpoint. This post is about how that integration was wired | ||
| together, why dynamic feedback unlocked it, and what it does and doesn't | ||
| generalize to. | ||
|
|
||
| ## What "linter pass" was and wasn't telling us | ||
|
|
||
| The first post left one finding implicit that's worth pulling out, because it | ||
| sets up everything below. The Qiskit-specialized model used in our benchmark | ||
| (`hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF`) passed the QKT linter on | ||
| 100% of runs. If you stopped there, you'd conclude the IVR loop had nothing | ||
| left to do. | ||
|
|
||
| When we ran QHE's per-problem `check()` functions over the same outputs as a | ||
| post-hoc analysis step, only **27.8%** of those QKT-clean programs actually | ||
| behaved correctly. The remaining 72% imported the right modules, called | ||
| non-deprecated functions, and still produced the wrong answer, raised an | ||
| exception, or built a circuit with the wrong shape. | ||
|
|
||
| This is the gap a linter cannot see. `flake8-qiskit-migration` parses the AST | ||
| and checks for deprecated symbols. It cannot detect a 5-qubit circuit when | ||
| the prompt asked for 3, a missing measurement, or a control flow bug. | ||
| Catching those requires running the code. | ||
|
|
||
| ## Wiring `check()` into the loop | ||
|
|
||
| The Mellea side of this turned out to be small. The example's | ||
| `generate_validated_qiskit_code()` function already accepted a list of | ||
| requirements, but the QKT rule was hardcoded inside it. To inject a | ||
| per-problem functional test from the benchmark harness without forking | ||
| the example, we added an | ||
| [`extra_requirements`](https://github.com/generative-computing/mellea/blob/365f8631/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py#L77) | ||
| parameter: | ||
|
|
||
| ```python | ||
| def generate_validated_qiskit_code( | ||
| m: MelleaSession, | ||
| prompt: str, | ||
| strategy: MultiTurnStrategy | RepairTemplateStrategy, | ||
| *, | ||
| system_prompt: str | None = None, | ||
| grounding_context: dict[str, str] | None = None, | ||
| extra_requirements: list[Requirement] | None = None, | ||
| ) -> tuple[str, bool, int]: | ||
| code_candidate = m.instruct( | ||
| prompt, | ||
| requirements=[ | ||
| req( | ||
| "Code must pass Qiskit migration validation (QKT rules)", | ||
| validation_fn=simple_validate(validate_qiskit_migration), | ||
| ), | ||
| *(extra_requirements or []), | ||
| ], | ||
| strategy=strategy, | ||
| return_sampling_results=True, | ||
| ) | ||
| # ... | ||
| ``` | ||
|
|
||
| With that in place, the benchmark builds a `check()` validator from each | ||
| prompt's per-problem test and passes it through: | ||
|
|
||
| ```python | ||
| def make_qhe_check_validator(check_fn: str, entry_point: str): | ||
| def validator(code: str) -> tuple[bool, str]: | ||
| try: | ||
| namespace: dict = {} | ||
| exec(code, namespace) | ||
| fn = namespace[entry_point] | ||
| exec(check_fn, namespace) | ||
| namespace["check"](fn) | ||
| return True, "" | ||
| except AssertionError as e: | ||
| return False, f"check() assertion failed: {e}" | ||
| except Exception as e: | ||
| return False, f"check() raised {type(e).__name__}: {e}" | ||
| return validator | ||
|
|
||
| extra = [ | ||
| req( | ||
| "Code must pass QHE check() function", | ||
| validation_fn=simple_validate( | ||
| make_qhe_check_validator(check_fn, entry_point) | ||
| ), | ||
| ), | ||
| ] | ||
|
|
||
| code, success, attempts = generate_validated_qiskit_code( | ||
| m, prompt, strategy, extra_requirements=extra, | ||
| ) | ||
| ``` | ||
|
|
||
| Mellea handles the rest: a run is `success: true` only if both validators | ||
| pass, and any failure (static or dynamic) feeds its message back into the | ||
| repair prompt for the next attempt. | ||
|
|
||
| ## What changed | ||
|
|
||
| Same Qiskit-specialized model, same 302 runs (151 prompts × 2 strategies), | ||
| same 10-attempt repair budget. The only change is that the IVR loop can now | ||
| see test failures while it still has attempts left, instead of after the fact. | ||
|
|
||
| | Validator inside the loop | Pass rate | | ||
| | --- | --- | | ||
| | QKT only | 84/302 = **27.8%** | | ||
| | QKT + `check()` | 152/302 = **50.3%** | | ||
|
|
||
| The lift held across difficulty tiers, not just the easy ones: | ||
|
|
||
| | Difficulty | QKT only | QKT + tests | Lift | | ||
| | --- | --- | --- | --- | | ||
| | basic | 41.0% | 61.4% | +20.4pp | | ||
| | intermediate | 14.7% | 39.6% | +24.9pp | | ||
| | difficult | 0% | 20.0% | +20.0pp | | ||
|
|
||
| The "difficult" row is the most striking. With QKT alone, no run on a | ||
| difficult prompt was functionally correct. With dynamic feedback in the | ||
| loop, two of ten passed, including a for-loop circuit construction problem | ||
| (`qiskitHumanEval/150`) that one strategy solved on the first attempt and | ||
| the other reached after nine repair iterations. | ||
|
|
||
| The repair loop also activated more often, exactly as you'd hope. In the | ||
| QKT-only run, only 3.6% of passes needed any repair at all; the loop was | ||
| mostly idle. With `check()` in the loop, **21% of passes required at least | ||
| one repair attempt**, and every failure exhausted the full 10-attempt | ||
| budget. The model wasn't giving up; it was working, and sometimes that work | ||
| paid off late in the budget. | ||
|
|
||
| ## Why feedback shape matters more than feedback presence | ||
|
|
||
| Wiring a dynamic validator into the loop is necessary but not | ||
| sufficient. A `check()` that fails with a bare `AssertionError` and no | ||
| message gives the repair prompt nothing to work with. The model sees "your | ||
| code failed a test" and is left to guess which one and how. | ||
|
|
||
| That's what QHE assertions still look like today. An | ||
| [open PR](https://github.com/qiskit-community/qiskit-human-eval/pull/88) | ||
| adds f-string messages to every assertion, and we benchmarked against that PR: | ||
|
|
||
| ```python | ||
| # Before | ||
| assert result.num_qubits == 3 | ||
|
|
||
| # After | ||
| assert result.num_qubits == 3, f"Expected 3 qubits, got {result.num_qubits}" | ||
| ``` | ||
|
|
||
| Those messages are what the repair loop consumes. When the validator says | ||
| "Expected 3 qubits, got 5," the next instruction includes that text, and | ||
| the model has a concrete thing to fix. This is the same property that made | ||
| QKT useful in the first post: the validator names the violation in words | ||
| the model can act on. Pass/fail without context is not enough feedback to | ||
| repair against, regardless of whether the validator is static or | ||
| dynamic. | ||
|
|
||
| ## What transfers, and what doesn't | ||
|
|
||
| The QHE `check()` functions are not a generic Qiskit validator. Each one is | ||
| a hand-written test for a specific benchmark problem: it knows the function | ||
| name to call, the inputs to pass, and the properties to assert. They exist | ||
| because QHE is an evaluation dataset, not a Qiskit toolkit, and there is no | ||
| equivalent "check any Qiskit code for correctness" tool. The numbers above | ||
| shouldn't be read as drop-in deployment results. | ||
|
|
||
| What does transfer is the pattern. Many real codebases already ship | ||
| something `check()`-shaped: a pytest suite, a smoke-test script, an | ||
| import-and-run sanity check, a runtime assertion, a domain-specific | ||
| validator. Anything that runs the generated code and returns human-readable | ||
| failure text can be plugged into Mellea's IVR loop the same way `check()` | ||
| was. Adding dynamic signal to a static-only loop produces substantial gains, | ||
| as our benchmark above shows. What you plug in is up to | ||
| your project. | ||
|
|
||
| ## Strategy choice flipped under richer feedback | ||
|
|
||
| We tested two of Mellea's sampling strategies in our benchmarks: | ||
| `RepairTemplateStrategy`, which appends each validation failure to a single | ||
| growing instruction, and `MultiTurnStrategy`, which adds each failure as a | ||
| new turn in the conversation history. Which one to use depends on what kind of | ||
| feedback the loop is working with, and adding `check()` flipped that for us. | ||
|
|
||
| In our QKT-only baseline, `MultiTurnStrategy` had a small edge. With | ||
| dynamic feedback added, the picture flipped: `MultiTurnStrategy` passed | ||
| 92% of its successes on the first attempt but rescued only 6 cases through | ||
| repair, while `RepairTemplateStrategy` passed 66% on the first attempt and | ||
| **rescued 26 cases through sustained repair**, some at 7, 9, and 10 | ||
| attempts. Both strategies reached similar overall pass rates (51% vs | ||
| 49.7%), but they got there differently. | ||
|
|
||
| When failure feedback is rich enough to keep iterating on, accumulating | ||
| that feedback into a single growing prompt (RepairTemplate) seems to help | ||
| the model converge on hard cases. When the failure signal is thin, | ||
| conversational turn structure (MultiTurn) is steadier. If your validator | ||
| returns specific, descriptive errors, RepairTemplate is the better choice on | ||
| hard problems. | ||
|
|
||
| ## Takeaways | ||
|
|
||
| Static validation in the IVR loop is a strong floor. For Qiskit migration | ||
| specifically, it gets a specialized model to 100% lint compliance with | ||
| minimal effort, and that's a genuinely useful property to deploy. But the | ||
| ceiling on functional correctness is set by what the validator can see, and | ||
| a linter cannot see behavior. | ||
|
|
||
| When you can put a dynamic validator into the loop, even a narrow one | ||
| that only covers part of your code's intent, the lift in our benchmark was | ||
| ~22 percentage points across all difficulty tiers, including problems that | ||
| were previously unreachable. The integration cost was a list of | ||
| `Requirement` objects. | ||
|
|
||
| The full benchmark code, results, and analysis are in my | ||
| [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking#phase-4--check-as-live-ivr-validator-v3-qiskit-model-lsf). | ||
| The Qiskit IVR example is in the | ||
| [Mellea repo](https://github.com/generative-computing/mellea/tree/main/docs/examples/instruct_validate_repair/qiskit_code_validation). | ||
| If your codebase has a test suite, a runtime check, or any validator that | ||
| returns human-readable failure text, it's worth ten minutes to see what | ||
| plugging it into the repair loop does for you. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
namespace["check"]key is hardcoded. QHE's per-problem tests all happen to be namedcheck, so this works for the benchmark as written. The prose later in the post invites readers to plug in their own test suites, and anyone who does will hit aKeyErrorunless their function is also namedcheck. One inline note in the snippet would save them the debug session.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code snippet is actually just a cleaned up snippet from the benchmark script I ran. It's not meant to be an example of what to do, but an insight into what I did.
I would argue that by adding this comment I introduce an expectation in the reader that this is meant to be copied and used.