Skip to content

perf: parallelize n_batches=5 IBM solve_fermion calls (subprocess pool) #43

@thc1006

Description

@thc1006

Context

run_hi_nqs_sqd does n_batches=5 (default) IBM solve_fermion calls sequentially per iteration (hi_nqs_sqd.py:530-583). Each call uses pyscf's internal Davidson with all 12 OMP threads, taking ~6 min @ batch=5000 or ~12 min @ batch=8000.

Per iter: 5 × 6 min = 30 min just for IBM diag. Dominates wall time.

Proposal

Run 5 batches in parallel via subprocess pool:

  • 5 Python subprocesses, each gets ~2 CPU cores
  • Single Davidson per subprocess (2 threads)
  • Wall per call: ~12-18 min (2.5-3× slower per call due to fewer threads)
  • 5 in parallel: ~12-18 min total (vs 30 min sequential)

Expected speedup: 30-40% per iter (= 50-60% reduction in IBM diag wall)

Why this is hard

  1. IPC for hamiltonian + integrals: must serialize (or shared-memory) the molecular integrals. Could pickle and pass via subprocess args, or use shared mmap.
  2. Result merging: each subprocess returns (e_b, sci_state, occs_b). Need to gather via Pipe / file / mp.Queue.
  3. GPU contention: H200 GPU shared. NQS train uses GPU. If subprocesses also need GPU... probably fine since IBM solve_fermion is CPU-only.
  4. Memory: each subprocess loads pyscf + qiskit_addon_sqd + integrals. 5× 1-2 GB = 5-10 GB. H200 node has 1.9 TB, fine.

Expected impact

  • 30 min/iter → 18 min/iter on 52Q at batch=5k
  • 30-iter job: 9 hours saved (15h → ~9h)
  • Largest single speedup available without changing algorithm

Risk

  • Per-call accuracy: 2-thread vs 12-thread Davidson should converge to same energy (deterministic given seed). Need verification with fixed seed test.
  • Subprocess crash: if one batch fails, others should still succeed. Code already has `if not math.isfinite(e_b): continue` so robust to one batch failing.

Implementation sketch

```python
from concurrent.futures import ProcessPoolExecutor
import os

def _run_one_batch(batch_configs_np, hcore, eri, spin_sq, n_orb, n_qubits):
# Subprocess entry — must be top-level for pickling
os.environ['OMP_NUM_THREADS'] = '2'
from qiskit_addon_sqd.fermion import solve_fermion
from qvartools._utils.formatting.bitstring_format import configs_to_ibm_format
ibm_data = configs_to_ibm_format(batch_configs_np, n_orb, n_qubits)
e_b, sci_state, occs_b, _ = solve_fermion(ibm_data, hcore=hcore, eri=eri, spin_sq=spin_sq)
return e_b, sci_state, occs_b

In run_hi_nqs_sqd batch loop:

with ProcessPoolExecutor(max_workers=cfg.n_batches) as pool:
futures = [pool.submit(_run_one_batch, batch_np, hcore, eri, spin_sq, n_orb, n_qubits)
for batch_np in batch_configs_list]
results = [f.result() for f in futures]
```

Out of scope

  • Distributed-memory parallelism across nodes (MPI)
  • Async / overlapping with NQS train (separate issue)

Effort estimate

  • 2-3 days dev + testing
  • High value: enables 52Q jobs to fit in 12h time limit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions