Skip to content

MPO + Unswapping on a laptop CPU for peaked_circuit_P9_Hqap_56x1917 #153

@alexgalda-m

Description

@alexgalda-m

Name

mpo_unswapping_cpu_laptop_56_1917

Circuit

peaked_circuit_P9_Hqap_56x1917

Value

100

Method

MPO + Unswapping (CPU)

Method proof

Peaked MPO Solver: CPU-Oriented Implementation

Overview & Enhancements

This is a CPU-oriented implementation of the midpoint MPO + greedy-unswapping strategy from Kremer & Dupuis (as applied in IBM's GPU submission #106). While the high-level approach is the same, this repository extends it with several optional features optimized for the CPU regime:

  • Symmetric (Both-Sided) Unswapping: Evaluates swap candidates from the left, the right, and from both sides simultaneously, keeping the variant that maximally reduces the bond dimension.
  • Additional Optimizations:
    • Cached SWAP-layer MPO construction
    • Full parity-swap probe reuse
    • Route-aware unswap candidate selection
    • Faster Sabre-based local rerouting
    • Fail-fast stall guard
  • Production Configuration: The verified production path keeps the unswap-select-mode at "bond" with the pass order set to "both, left, right".

Repository Artifacts

See BENCHMARKS.md in the repository for the full per-machine data table. The run folders under runs/ contain the following artifacts for each measurement:

  • summary.json
  • stats.csv
  • samples.tsv
  • plot.png
  • samples.png

Benchmarks

Verified-clean runs executed at --cutoff 0.0006.

Apple Silicon SKU Model Number Execution Time Peak counts #2 bitstring counts
M5 Pro T6050 734.2 s 92/1000 10/1000
M4 T8132 747.2 s 96/1000 9/1000
M2 Pro T6020 973.0 s 46/1000 28/1000
M1 Max T6000 1339.5 s 36/1000 6/1000

Run Validation:
All four benchmark runs successfully hit the following target states:

  • last_work_consumed = 1885
  • termination_reason = completed
  • matches_expected_bitstring = true

Comparison to the Original GPU Baseline (#106)

The original Kremer & Dupuis result on this exact circuit (submission #106, verified) used the same MPO + unswapping method on a single datacenter GPU. The headline improvement is the compute class — the same simulation runs here on a consumer laptop CPU with no GPU:

Implementation Hardware Runtime Peak counts
Kremer & Dupuis (#106) 1× Nvidia A100 80 GB GPU 4059 s ~100/1000
This work (M5 Pro) Apple M5 Pro laptop CPU 734.2 s 92/1000

Wall-clock is also ~5.5× faster.


📝 Note on Gate Counts

The initial QASM circuit consists of 1,917 rzz and 3,890 u gates. During preprocessing, Qiskit's Collect2qBlocks and ConsolidateBlocks passes fuse these losslessly into 1,885 generic 2q-unitary blocks before MPO compression begins.

Progress is reported in these consolidated-block units (0..1885), directly matching the "Total 2q Unitaries Consumed" convention established in the Kremer & Dupuis reference notebook.

Authors

Alexey Galda

Institutions

Moderna

Quantum runtime (seconds)

No response

Classical runtime (seconds)

734

Compute resources (quantum)

No response

Compute resources (classical)

Apple M5 Pro (T6050) laptop, single Python process, Apple Accelerate BLAS

Notes

12 min on a consumer Apple Silicon laptop CPU

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

Status
In review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions