Context
run_hi_nqs_sqd does n_batches=5 (default) IBM solve_fermion calls sequentially per iteration (hi_nqs_sqd.py:530-583). Each call uses pyscf's internal Davidson with all 12 OMP threads, taking ~6 min @ batch=5000 or ~12 min @ batch=8000.
Per iter: 5 × 6 min = 30 min just for IBM diag. Dominates wall time.
Proposal
Run 5 batches in parallel via subprocess pool:
- 5 Python subprocesses, each gets ~2 CPU cores
- Single Davidson per subprocess (2 threads)
- Wall per call: ~12-18 min (2.5-3× slower per call due to fewer threads)
- 5 in parallel: ~12-18 min total (vs 30 min sequential)
Expected speedup: 30-40% per iter (= 50-60% reduction in IBM diag wall)
Why this is hard
- IPC for hamiltonian + integrals: must serialize (or shared-memory) the molecular integrals. Could pickle and pass via subprocess args, or use shared mmap.
- Result merging: each subprocess returns (e_b, sci_state, occs_b). Need to gather via Pipe / file / mp.Queue.
- GPU contention: H200 GPU shared. NQS train uses GPU. If subprocesses also need GPU... probably fine since IBM solve_fermion is CPU-only.
- Memory: each subprocess loads pyscf + qiskit_addon_sqd + integrals. 5× 1-2 GB = 5-10 GB. H200 node has 1.9 TB, fine.
Expected impact
- 30 min/iter → 18 min/iter on 52Q at batch=5k
- 30-iter job: 9 hours saved (15h → ~9h)
- Largest single speedup available without changing algorithm
Risk
- Per-call accuracy: 2-thread vs 12-thread Davidson should converge to same energy (deterministic given seed). Need verification with fixed seed test.
- Subprocess crash: if one batch fails, others should still succeed. Code already has `if not math.isfinite(e_b): continue` so robust to one batch failing.
Implementation sketch
```python
from concurrent.futures import ProcessPoolExecutor
import os
def _run_one_batch(batch_configs_np, hcore, eri, spin_sq, n_orb, n_qubits):
# Subprocess entry — must be top-level for pickling
os.environ['OMP_NUM_THREADS'] = '2'
from qiskit_addon_sqd.fermion import solve_fermion
from qvartools._utils.formatting.bitstring_format import configs_to_ibm_format
ibm_data = configs_to_ibm_format(batch_configs_np, n_orb, n_qubits)
e_b, sci_state, occs_b, _ = solve_fermion(ibm_data, hcore=hcore, eri=eri, spin_sq=spin_sq)
return e_b, sci_state, occs_b
In run_hi_nqs_sqd batch loop:
with ProcessPoolExecutor(max_workers=cfg.n_batches) as pool:
futures = [pool.submit(_run_one_batch, batch_np, hcore, eri, spin_sq, n_orb, n_qubits)
for batch_np in batch_configs_list]
results = [f.result() for f in futures]
```
Out of scope
- Distributed-memory parallelism across nodes (MPI)
- Async / overlapping with NQS train (separate issue)
Effort estimate
- 2-3 days dev + testing
- High value: enables 52Q jobs to fit in 12h time limit
Context
run_hi_nqs_sqddoesn_batches=5(default) IBMsolve_fermioncalls sequentially per iteration (hi_nqs_sqd.py:530-583). Each call uses pyscf's internal Davidson with all 12 OMP threads, taking ~6 min @ batch=5000 or ~12 min @ batch=8000.Per iter: 5 × 6 min = 30 min just for IBM diag. Dominates wall time.
Proposal
Run 5 batches in parallel via subprocess pool:
Expected speedup: 30-40% per iter (= 50-60% reduction in IBM diag wall)
Why this is hard
Expected impact
Risk
Implementation sketch
```python
from concurrent.futures import ProcessPoolExecutor
import os
def _run_one_batch(batch_configs_np, hcore, eri, spin_sq, n_orb, n_qubits):
# Subprocess entry — must be top-level for pickling
os.environ['OMP_NUM_THREADS'] = '2'
from qiskit_addon_sqd.fermion import solve_fermion
from qvartools._utils.formatting.bitstring_format import configs_to_ibm_format
ibm_data = configs_to_ibm_format(batch_configs_np, n_orb, n_qubits)
e_b, sci_state, occs_b, _ = solve_fermion(ibm_data, hcore=hcore, eri=eri, spin_sq=spin_sq)
return e_b, sci_state, occs_b
In run_hi_nqs_sqd batch loop:
with ProcessPoolExecutor(max_workers=cfg.n_batches) as pool:
futures = [pool.submit(_run_one_batch, batch_np, hcore, eri, spin_sq, n_orb, n_qubits)
for batch_np in batch_configs_list]
results = [f.result() for f in futures]
```
Out of scope
Effort estimate