Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions content/blogs/qiskit-ivr-functional-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
---
title: "From Linting to Tests: Doubling Functional Correctness in Qiskit IVR"
date: "2026-05-27"
author: "Alex Bozarth"
excerpt: "Wiring functional tests into Mellea's Instruct-Validate-Repair loop nearly doubled functional correctness on the Qiskit Human Eval benchmark, on top of what static validation already provided."
tags: ["qiskit", "IVR", "validation", "benchmark", "LLM", "code-generation"]
---

In our [previous post](/blogs/qiskit-ivr-code-validation/) we showed how Mellea's
Instruct-Validate-Repair (IVR) pattern, paired with `flake8-qiskit-migration`,
catches deprecated Qiskit APIs in LLM-generated code automatically. With a
Qiskit-specialized model that piece works well: across 302 benchmark runs, the
linter accepted the model's output 100% of the time. Static validation does
real work, and the repair loop is a reliable safety net for migration rules.

That left an obvious next question: a linter can tell us the imports are
modern and the API names exist, but it cannot tell us whether the code does
what was asked. What happens when the validator inside the IVR loop is a
functional test instead of a static check?

We had a clean way to answer that. The
[Qiskit Human Eval (QHE)](https://github.com/qiskit-community/qiskit-human-eval)
dataset ships a `check()` function with each problem: a small unit test that
runs the generated code and asserts on its output. We wired `check()` into
Mellea's IVR loop as a second `Requirement`, alongside the existing QKT
linter, and reran the benchmark. Same model, same prompts, same repair
budget, two validators instead of one.

Functional correctness on the same benchmark went from **27.8%**
to **50.3%** — a +22.5pp jump on a benchmark the model was already saturating
from a linting standpoint. This post is about how that integration was wired
together, why dynamic feedback unlocked it, and what it does and doesn't
generalize to.

## What "linter pass" was and wasn't telling us

The first post left one finding implicit that's worth pulling out, because it
sets up everything below. The Qiskit-specialized model used in our benchmark
(`hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF`) passed the QKT linter on
100% of runs. If you stopped there, you'd conclude the IVR loop had nothing
left to do.

When we ran QHE's per-problem `check()` functions over the same outputs as a
post-hoc analysis step, only **27.8%** of those QKT-clean programs actually
behaved correctly. The remaining 72% imported the right modules, called
non-deprecated functions, and still produced the wrong answer, raised an
exception, or built a circuit with the wrong shape.

This is the gap a linter cannot see. `flake8-qiskit-migration` parses the AST
and checks for deprecated symbols. It cannot detect a 5-qubit circuit when
the prompt asked for 3, a missing measurement, or a control flow bug.
Catching those requires running the code.

## Wiring `check()` into the loop

The Mellea side of this turned out to be small. The example's
`generate_validated_qiskit_code()` function already accepted a list of
requirements, but the QKT rule was hardcoded inside it. To inject a
per-problem functional test from the benchmark harness without forking
the example, we added an
[`extra_requirements`](https://github.com/generative-computing/mellea/blob/365f8631/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py#L77)
parameter:

```python
def generate_validated_qiskit_code(
m: MelleaSession,
prompt: str,
strategy: MultiTurnStrategy | RepairTemplateStrategy,
*,
system_prompt: str | None = None,
grounding_context: dict[str, str] | None = None,
extra_requirements: list[Requirement] | None = None,
) -> tuple[str, bool, int]:
code_candidate = m.instruct(
prompt,
requirements=[
req(
"Code must pass Qiskit migration validation (QKT rules)",
validation_fn=simple_validate(validate_qiskit_migration),
),
*(extra_requirements or []),
],
strategy=strategy,
return_sampling_results=True,
)
# ...
```

With that in place, the benchmark builds a `check()` validator from each
prompt's per-problem test and passes it through:

```python
def make_qhe_check_validator(check_fn: str, entry_point: str):
def validator(code: str) -> tuple[bool, str]:
try:
namespace: dict = {}
exec(code, namespace)
fn = namespace[entry_point]
exec(check_fn, namespace)
namespace["check"](fn)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace["check"] key is hardcoded. QHE's per-problem tests all happen to be named check, so this works for the benchmark as written. The prose later in the post invites readers to plug in their own test suites, and anyone who does will hit a KeyError unless their function is also named check. One inline note in the snippet would save them the debug session.

Suggested change
namespace["check"](fn)
namespace["check"](fn) # QHE tests are named "check"; change to match yours

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code snippet is actually just a cleaned up snippet from the benchmark script I ran. It's not meant to be an example of what to do, but an insight into what I did.

I would argue that by adding this comment I introduce an expectation in the reader that this is meant to be copied and used.

return True, ""
except AssertionError as e:
return False, f"check() assertion failed: {e}"
except Exception as e:
return False, f"check() raised {type(e).__name__}: {e}"
return validator

extra = [
req(
"Code must pass QHE check() function",
validation_fn=simple_validate(
make_qhe_check_validator(check_fn, entry_point)
),
),
]

code, success, attempts = generate_validated_qiskit_code(
m, prompt, strategy, extra_requirements=extra,
)
```

Mellea handles the rest: a run is `success: true` only if both validators
pass, and any failure (static or dynamic) feeds its message back into the
repair prompt for the next attempt.

## What changed

Same Qiskit-specialized model, same 302 runs (151 prompts × 2 strategies),
same 10-attempt repair budget. The only change is that the IVR loop can now
see test failures while it still has attempts left, instead of after the fact.

| Validator inside the loop | Pass rate |
| --- | --- |
| QKT only | 84/302 = **27.8%** |
| QKT + `check()` | 152/302 = **50.3%** |

The lift held across difficulty tiers, not just the easy ones:

| Difficulty | QKT only | QKT + tests | Lift |
| --- | --- | --- | --- |
| basic | 41.0% | 61.4% | +20.4pp |
| intermediate | 14.7% | 39.6% | +24.9pp |
| difficult | 0% | 20.0% | +20.0pp |

The "difficult" row is the most striking. With QKT alone, no run on a
difficult prompt was functionally correct. With dynamic feedback in the
loop, two of ten passed, including a for-loop circuit construction problem
(`qiskitHumanEval/150`) that one strategy solved on the first attempt and
the other reached after nine repair iterations.

The repair loop also activated more often, exactly as you'd hope. In the
QKT-only run, only 3.6% of passes needed any repair at all; the loop was
mostly idle. With `check()` in the loop, **21% of passes required at least
one repair attempt**, and every failure exhausted the full 10-attempt
budget. The model wasn't giving up; it was working, and sometimes that work
paid off late in the budget.

## Why feedback shape matters more than feedback presence

Wiring a dynamic validator into the loop is necessary but not
sufficient. A `check()` that fails with a bare `AssertionError` and no
message gives the repair prompt nothing to work with. The model sees "your
code failed a test" and is left to guess which one and how.

That's what QHE assertions still look like today. An
[open PR](https://github.com/qiskit-community/qiskit-human-eval/pull/88)
adds f-string messages to every assertion, and we benchmarked against that PR:

```python
# Before
assert result.num_qubits == 3

# After
assert result.num_qubits == 3, f"Expected 3 qubits, got {result.num_qubits}"
```

Those messages are what the repair loop consumes. When the validator says
"Expected 3 qubits, got 5," the next instruction includes that text, and
the model has a concrete thing to fix. This is the same property that made
QKT useful in the first post: the validator names the violation in words
the model can act on. Pass/fail without context is not enough feedback to
repair against, regardless of whether the validator is static or
dynamic.

## What transfers, and what doesn't

The QHE `check()` functions are not a generic Qiskit validator. Each one is
a hand-written test for a specific benchmark problem: it knows the function
name to call, the inputs to pass, and the properties to assert. They exist
because QHE is an evaluation dataset, not a Qiskit toolkit, and there is no
equivalent "check any Qiskit code for correctness" tool. The numbers above
shouldn't be read as drop-in deployment results.

What does transfer is the pattern. Many real codebases already ship
something `check()`-shaped: a pytest suite, a smoke-test script, an
import-and-run sanity check, a runtime assertion, a domain-specific
validator. Anything that runs the generated code and returns human-readable
failure text can be plugged into Mellea's IVR loop the same way `check()`
was. Adding dynamic signal to a static-only loop produces substantial gains,
as our benchmark above shows. What you plug in is up to
your project.

## Strategy choice flipped under richer feedback

We tested two of Mellea's sampling strategies in our benchmarks:
`RepairTemplateStrategy`, which appends each validation failure to a single
growing instruction, and `MultiTurnStrategy`, which adds each failure as a
new turn in the conversation history. Which one to use depends on what kind of
feedback the loop is working with, and adding `check()` flipped that for us.

In our QKT-only baseline, `MultiTurnStrategy` had a small edge. With
dynamic feedback added, the picture flipped: `MultiTurnStrategy` passed
92% of its successes on the first attempt but rescued only 6 cases through
repair, while `RepairTemplateStrategy` passed 66% on the first attempt and
**rescued 26 cases through sustained repair**, some at 7, 9, and 10
attempts. Both strategies reached similar overall pass rates (51% vs
49.7%), but they got there differently.

When failure feedback is rich enough to keep iterating on, accumulating
that feedback into a single growing prompt (RepairTemplate) seems to help
the model converge on hard cases. When the failure signal is thin,
conversational turn structure (MultiTurn) is steadier. If your validator
returns specific, descriptive errors, RepairTemplate is the better choice on
hard problems.

## Takeaways

Static validation in the IVR loop is a strong floor. For Qiskit migration
specifically, it gets a specialized model to 100% lint compliance with
minimal effort, and that's a genuinely useful property to deploy. But the
ceiling on functional correctness is set by what the validator can see, and
a linter cannot see behavior.

When you can put a dynamic validator into the loop, even a narrow one
that only covers part of your code's intent, the lift in our benchmark was
~22 percentage points across all difficulty tiers, including problems that
were previously unreachable. The integration cost was a list of
`Requirement` objects.

The full benchmark code, results, and analysis are in my
[toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking#phase-4--check-as-live-ivr-validator-v3-qiskit-model-lsf).
The Qiskit IVR example is in the
[Mellea repo](https://github.com/generative-computing/mellea/tree/main/docs/examples/instruct_validate_repair/qiskit_code_validation).
If your codebase has a test suite, a runtime check, or any validator that
returns human-readable failure text, it's worth ten minutes to see what
plugging it into the repair loop does for you.
Loading