perf: parallelize n_batches=5 IBM solve_fermion calls (subprocess pool)

## Context

`run_hi_nqs_sqd` does `n_batches=5` (default) IBM `solve_fermion` calls **sequentially** per iteration (`hi_nqs_sqd.py:530-583`). Each call uses pyscf's internal Davidson with all 12 OMP threads, taking ~6 min @ batch=5000 or ~12 min @ batch=8000.

Per iter: 5 × 6 min = **30 min just for IBM diag**. Dominates wall time.

## Proposal

Run 5 batches in parallel via subprocess pool:
- 5 Python subprocesses, each gets ~2 CPU cores
- Single Davidson per subprocess (2 threads)
- Wall per call: ~12-18 min (2.5-3× slower per call due to fewer threads)
- 5 in parallel: ~12-18 min total (vs 30 min sequential)

**Expected speedup: 30-40% per iter (= 50-60% reduction in IBM diag wall)**

## Why this is hard

1. **IPC for hamiltonian + integrals**: must serialize (or shared-memory) the molecular integrals. Could pickle and pass via subprocess args, or use shared mmap.
2. **Result merging**: each subprocess returns (e_b, sci_state, occs_b). Need to gather via Pipe / file / mp.Queue.
3. **GPU contention**: H200 GPU shared. NQS train uses GPU. If subprocesses also need GPU... probably fine since IBM solve_fermion is CPU-only.
4. **Memory**: each subprocess loads pyscf + qiskit_addon_sqd + integrals. 5× 1-2 GB = 5-10 GB. H200 node has 1.9 TB, fine.

## Expected impact

- 30 min/iter → 18 min/iter on 52Q at batch=5k
- 30-iter job: 9 hours saved (15h → ~9h)
- **Largest single speedup available without changing algorithm**

## Risk

- Per-call accuracy: 2-thread vs 12-thread Davidson should converge to same energy (deterministic given seed). Need verification with fixed seed test.
- Subprocess crash: if one batch fails, others should still succeed. Code already has \`if not math.isfinite(e_b): continue\` so robust to one batch failing.

## Implementation sketch

\`\`\`python
from concurrent.futures import ProcessPoolExecutor
import os

def _run_one_batch(batch_configs_np, hcore, eri, spin_sq, n_orb, n_qubits):
    # Subprocess entry — must be top-level for pickling
    os.environ['OMP_NUM_THREADS'] = '2'
    from qiskit_addon_sqd.fermion import solve_fermion
    from qvartools._utils.formatting.bitstring_format import configs_to_ibm_format
    ibm_data = configs_to_ibm_format(batch_configs_np, n_orb, n_qubits)
    e_b, sci_state, occs_b, _ = solve_fermion(ibm_data, hcore=hcore, eri=eri, spin_sq=spin_sq)
    return e_b, sci_state, occs_b

# In run_hi_nqs_sqd batch loop:
with ProcessPoolExecutor(max_workers=cfg.n_batches) as pool:
    futures = [pool.submit(_run_one_batch, batch_np, hcore, eri, spin_sq, n_orb, n_qubits)
               for batch_np in batch_configs_list]
    results = [f.result() for f in futures]
\`\`\`

## Out of scope

- Distributed-memory parallelism across nodes (MPI)
- Async / overlapping with NQS train (separate issue)

## Effort estimate

- 2-3 days dev + testing
- High value: enables 52Q jobs to fit in 12h time limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize n_batches=5 IBM solve_fermion calls (subprocess pool) #43

Context

Proposal

Why this is hard

Expected impact

Risk

Implementation sketch

In run_hi_nqs_sqd batch loop:

Out of scope

Effort estimate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: parallelize n_batches=5 IBM solve_fermion calls (subprocess pool) #43

Description

Context

Proposal

Why this is hard

Expected impact

Risk

Implementation sketch

In run_hi_nqs_sqd batch loop:

Out of scope

Effort estimate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions