Summary
Two out of three pipeline scripts in experiments/pipelines/07_iterative_nqs_dci/ fail end-to-end with a ValueError at the run_hi_nqs_sqd / run_hi_nqs_skqd call site, on any molecule (including the smallest, H2). The bug has existed on main for a while; nothing appears to exercise these scripts in CI.
Affected scripts
experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_sqd.py
experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_krylov_classical.py
The 3rd script in the group (iter_nqs_dci_krylov_quantum.py) does not use initial_basis and is unaffected.
Reproduction
On current main:
python experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_sqd.py h2 --device cpu
Expected: runs to completion, prints final energy.
Actual:
ValueError: initial_basis must be integer or bool dtype (binary occupations), got torch.float32
at src/qvartools/methods/nqs/hi_nqs_sqd.py:377 (validation in run_hi_nqs_sqd)
The traceback ends inside the validation block that rejects float-dtype initial_basis tensors.
Root cause
FlowGuidedKrylovPipeline.extract_and_select_basis() (src/qvartools/pipeline.py:441) returns a float32 tensor, because it clones self.trainer.accumulated_basis which lives in float (for gradient tracking during NF training). The values happen to be in {0.0, 1.0}, but the dtype is wrong for the initial_basis contract.
run_hi_nqs_sqd(initial_basis=...) and run_hi_nqs_skqd(initial_basis=...) strictly validate initial_basis.dtype to be integer or bool — float32 is rejected with the ValueError above.
The 007 scripts wire pipeline.extract_and_select_basis() directly into run_hi_nqs_sqd(initial_basis=...) with no cast:
# 07_iterative_nqs_dci/iter_nqs_dci_sqd.py:162, 182
basis = pipeline.extract_and_select_basis()
...
nqs_result = run_hi_nqs_sqd(
hamiltonian, mol_info, config=sqd_config, initial_basis=basis # <-- float32
)
Why 008 works (parallel pattern, no bug)
The sister scripts in 08_iterative_nqs_dci_pt2/ do the same extract_and_select_basis but pipe the result through expand_basis_via_connections() before run_hi_nqs_sqd, and that function happens to cast the tensor to long internally. 007 doesn't have that intermediate step, so it hits the raw dtype mismatch.
Fix
Script-level .long() cast on the basis tensor before passing as initial_basis. See PR (to be linked) for the patch — 2 files, 2-line changes each plus explanatory comments.
Verification after fix
Both scripts on H2/CPU:
| Script |
Wall time |
Error |
Chem.Acc |
iter_nqs_dci_sqd.py |
11.88 s |
0.0 mHa |
YES |
iter_nqs_dci_krylov_classical.py |
9.32 s |
0.0 mHa |
YES |
Related architectural concern (out of scope)
The deeper issue is that extract_and_select_basis() returns float32 at a public API boundary, while every downstream runner that accepts initial_basis expects integer/bool. Two cleaner long-term fixes, either of which would eliminate this class of bug permanently:
- Library fix: cast to
long at the end of extract_and_select_basis() — semantically correct since the values are binary occupations. Low risk since 008 + similar pipelines only use the tensor for indexing.
- Lenient validator: have
validate_initial_basis accept float dtype when all values are exactly {0,1} and auto-cast. More forgiving but may hide future misuse.
This issue tracks the script-level quick fix. A follow-up issue could track the architectural cleanup.
Discovery context
Surfaced during end-to-end smoke testing of PR #39 (refactor/pipeline-catalog, 3-digit catalog rename + 010-013 method-as-pipeline entries). Verified via stash-checkout-unstash test that the bug exists on main at the same point — not caused by that PR.
Summary
Two out of three pipeline scripts in
experiments/pipelines/07_iterative_nqs_dci/fail end-to-end with aValueErrorat therun_hi_nqs_sqd/run_hi_nqs_skqdcall site, on any molecule (including the smallest, H2). The bug has existed onmainfor a while; nothing appears to exercise these scripts in CI.Affected scripts
experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_sqd.pyexperiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_krylov_classical.pyThe 3rd script in the group (
iter_nqs_dci_krylov_quantum.py) does not useinitial_basisand is unaffected.Reproduction
On current
main:Expected: runs to completion, prints final energy.
Actual:
The traceback ends inside the validation block that rejects float-dtype
initial_basistensors.Root cause
FlowGuidedKrylovPipeline.extract_and_select_basis()(src/qvartools/pipeline.py:441) returns afloat32tensor, because it clonesself.trainer.accumulated_basiswhich lives in float (for gradient tracking during NF training). The values happen to be in{0.0, 1.0}, but the dtype is wrong for theinitial_basiscontract.run_hi_nqs_sqd(initial_basis=...)andrun_hi_nqs_skqd(initial_basis=...)strictly validateinitial_basis.dtypeto be integer or bool — float32 is rejected with the ValueError above.The 007 scripts wire
pipeline.extract_and_select_basis()directly intorun_hi_nqs_sqd(initial_basis=...)with no cast:Why 008 works (parallel pattern, no bug)
The sister scripts in
08_iterative_nqs_dci_pt2/do the sameextract_and_select_basisbut pipe the result throughexpand_basis_via_connections()beforerun_hi_nqs_sqd, and that function happens to cast the tensor to long internally. 007 doesn't have that intermediate step, so it hits the raw dtype mismatch.Fix
Script-level
.long()cast on thebasistensor before passing asinitial_basis. See PR (to be linked) for the patch — 2 files, 2-line changes each plus explanatory comments.Verification after fix
Both scripts on H2/CPU:
iter_nqs_dci_sqd.pyiter_nqs_dci_krylov_classical.pyRelated architectural concern (out of scope)
The deeper issue is that
extract_and_select_basis()returnsfloat32at a public API boundary, while every downstream runner that acceptsinitial_basisexpects integer/bool. Two cleaner long-term fixes, either of which would eliminate this class of bug permanently:longat the end ofextract_and_select_basis()— semantically correct since the values are binary occupations. Low risk since 008 + similar pipelines only use the tensor for indexing.validate_initial_basisaccept float dtype when all values are exactly{0,1}and auto-cast. More forgiving but may hide future misuse.This issue tracks the script-level quick fix. A follow-up issue could track the architectural cleanup.
Discovery context
Surfaced during end-to-end smoke testing of PR #39 (
refactor/pipeline-catalog, 3-digit catalog rename + 010-013 method-as-pipeline entries). Verified via stash-checkout-unstash test that the bug exists onmainat the same point — not caused by that PR.