Name
mpo_unswapping_cpu_laptop_56_1917
Circuit
peaked_circuit_P9_Hqap_56x1917
Value
100
Method
MPO + Unswapping (CPU)
Method proof
Peaked MPO Solver: CPU-Oriented Implementation
Overview & Enhancements
This is a CPU-oriented implementation of the midpoint MPO + greedy-unswapping strategy from Kremer & Dupuis (as applied in IBM's GPU submission #106). While the high-level approach is the same, this repository extends it with several optional features optimized for the CPU regime:
- Symmetric (Both-Sided) Unswapping: Evaluates swap candidates from the left, the right, and from both sides simultaneously, keeping the variant that maximally reduces the bond dimension.
- Additional Optimizations:
- Cached SWAP-layer MPO construction
- Full parity-swap probe reuse
- Route-aware unswap candidate selection
- Faster Sabre-based local rerouting
- Fail-fast stall guard
- Production Configuration: The verified production path keeps the unswap-select-mode at
"bond" with the pass order set to "both, left, right".
Repository Artifacts
See BENCHMARKS.md in the repository for the full per-machine data table. The run folders under runs/ contain the following artifacts for each measurement:
summary.json
stats.csv
samples.tsv
plot.png
samples.png
Benchmarks
Verified-clean runs executed at --cutoff 0.0006.
| Apple Silicon SKU |
Model Number |
Execution Time |
Peak counts |
#2 bitstring counts |
| M5 Pro |
T6050 |
734.2 s |
92/1000 |
10/1000 |
| M4 |
T8132 |
747.2 s |
96/1000 |
9/1000 |
| M2 Pro |
T6020 |
973.0 s |
46/1000 |
28/1000 |
| M1 Max |
T6000 |
1339.5 s |
36/1000 |
6/1000 |
Run Validation:
All four benchmark runs successfully hit the following target states:
last_work_consumed = 1885
termination_reason = completed
matches_expected_bitstring = true
Comparison to the Original GPU Baseline (#106)
The original Kremer & Dupuis result on this exact circuit (submission #106, verified) used the same MPO + unswapping method on a single datacenter GPU. The headline improvement is the compute class — the same simulation runs here on a consumer laptop CPU with no GPU:
| Implementation |
Hardware |
Runtime |
Peak counts |
| Kremer & Dupuis (#106) |
1× Nvidia A100 80 GB GPU |
4059 s |
~100/1000 |
| This work (M5 Pro) |
Apple M5 Pro laptop CPU |
734.2 s |
92/1000 |
Wall-clock is also ~5.5× faster.
📝 Note on Gate Counts
The initial QASM circuit consists of 1,917 rzz and 3,890 u gates. During preprocessing, Qiskit's Collect2qBlocks and ConsolidateBlocks passes fuse these losslessly into 1,885 generic 2q-unitary blocks before MPO compression begins.
Progress is reported in these consolidated-block units (0..1885), directly matching the "Total 2q Unitaries Consumed" convention established in the Kremer & Dupuis reference notebook.
Authors
Alexey Galda
Institutions
Moderna
Quantum runtime (seconds)
No response
Classical runtime (seconds)
734
Compute resources (quantum)
No response
Compute resources (classical)
Apple M5 Pro (T6050) laptop, single Python process, Apple Accelerate BLAS
Notes
12 min on a consumer Apple Silicon laptop CPU
Name
mpo_unswapping_cpu_laptop_56_1917
Circuit
peaked_circuit_P9_Hqap_56x1917
Value
100
Method
MPO + Unswapping (CPU)
Method proof
Peaked MPO Solver: CPU-Oriented Implementation
arXiv:2604.21908)Overview & Enhancements
This is a CPU-oriented implementation of the midpoint MPO + greedy-unswapping strategy from Kremer & Dupuis (as applied in IBM's GPU submission #106). While the high-level approach is the same, this repository extends it with several optional features optimized for the CPU regime:
"bond"with the pass order set to"both, left, right".Repository Artifacts
See
BENCHMARKS.mdin the repository for the full per-machine data table. The run folders underruns/contain the following artifacts for each measurement:summary.jsonstats.csvsamples.tsvplot.pngsamples.pngBenchmarks
Verified-clean runs executed at
--cutoff 0.0006.#2bitstring countsRun Validation:
All four benchmark runs successfully hit the following target states:
last_work_consumed= 1885termination_reason= completedmatches_expected_bitstring= trueComparison to the Original GPU Baseline (#106)
The original Kremer & Dupuis result on this exact circuit (submission #106, verified) used the same MPO + unswapping method on a single datacenter GPU. The headline improvement is the compute class — the same simulation runs here on a consumer laptop CPU with no GPU:
Wall-clock is also ~5.5× faster.
📝 Note on Gate Counts
The initial QASM circuit consists of 1,917
rzzand 3,890ugates. During preprocessing, Qiskit'sCollect2qBlocksandConsolidateBlockspasses fuse these losslessly into 1,885 generic 2q-unitary blocks before MPO compression begins.Progress is reported in these consolidated-block units (
0..1885), directly matching the "Total 2q Unitaries Consumed" convention established in the Kremer & Dupuis reference notebook.Authors
Alexey Galda
Institutions
Moderna
Quantum runtime (seconds)
No response
Classical runtime (seconds)
734
Compute resources (quantum)
No response
Compute resources (classical)
Apple M5 Pro (T6050) laptop, single Python process, Apple Accelerate BLAS
Notes
12 min on a consumer Apple Silicon laptop CPU