Skip to content

bug: 07_iterative_nqs_dci/iter_nqs_dci_{sqd,krylov_classical}.py fail with initial_basis dtype ValueError #40

@thc1006

Description

@thc1006

Summary

Two out of three pipeline scripts in experiments/pipelines/07_iterative_nqs_dci/ fail end-to-end with a ValueError at the run_hi_nqs_sqd / run_hi_nqs_skqd call site, on any molecule (including the smallest, H2). The bug has existed on main for a while; nothing appears to exercise these scripts in CI.

Affected scripts

  • experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_sqd.py
  • experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_krylov_classical.py

The 3rd script in the group (iter_nqs_dci_krylov_quantum.py) does not use initial_basis and is unaffected.

Reproduction

On current main:

python experiments/pipelines/07_iterative_nqs_dci/iter_nqs_dci_sqd.py h2 --device cpu

Expected: runs to completion, prints final energy.

Actual:

ValueError: initial_basis must be integer or bool dtype (binary occupations), got torch.float32
  at src/qvartools/methods/nqs/hi_nqs_sqd.py:377  (validation in run_hi_nqs_sqd)

The traceback ends inside the validation block that rejects float-dtype initial_basis tensors.

Root cause

FlowGuidedKrylovPipeline.extract_and_select_basis() (src/qvartools/pipeline.py:441) returns a float32 tensor, because it clones self.trainer.accumulated_basis which lives in float (for gradient tracking during NF training). The values happen to be in {0.0, 1.0}, but the dtype is wrong for the initial_basis contract.

run_hi_nqs_sqd(initial_basis=...) and run_hi_nqs_skqd(initial_basis=...) strictly validate initial_basis.dtype to be integer or bool — float32 is rejected with the ValueError above.

The 007 scripts wire pipeline.extract_and_select_basis() directly into run_hi_nqs_sqd(initial_basis=...) with no cast:

# 07_iterative_nqs_dci/iter_nqs_dci_sqd.py:162, 182
basis = pipeline.extract_and_select_basis()
...
nqs_result = run_hi_nqs_sqd(
    hamiltonian, mol_info, config=sqd_config, initial_basis=basis  # <-- float32
)

Why 008 works (parallel pattern, no bug)

The sister scripts in 08_iterative_nqs_dci_pt2/ do the same extract_and_select_basis but pipe the result through expand_basis_via_connections() before run_hi_nqs_sqd, and that function happens to cast the tensor to long internally. 007 doesn't have that intermediate step, so it hits the raw dtype mismatch.

Fix

Script-level .long() cast on the basis tensor before passing as initial_basis. See PR (to be linked) for the patch — 2 files, 2-line changes each plus explanatory comments.

Verification after fix

Both scripts on H2/CPU:

Script Wall time Error Chem.Acc
iter_nqs_dci_sqd.py 11.88 s 0.0 mHa YES
iter_nqs_dci_krylov_classical.py 9.32 s 0.0 mHa YES

Related architectural concern (out of scope)

The deeper issue is that extract_and_select_basis() returns float32 at a public API boundary, while every downstream runner that accepts initial_basis expects integer/bool. Two cleaner long-term fixes, either of which would eliminate this class of bug permanently:

  1. Library fix: cast to long at the end of extract_and_select_basis() — semantically correct since the values are binary occupations. Low risk since 008 + similar pipelines only use the tensor for indexing.
  2. Lenient validator: have validate_initial_basis accept float dtype when all values are exactly {0,1} and auto-cast. More forgiving but may hide future misuse.

This issue tracks the script-level quick fix. A follow-up issue could track the architectural cleanup.

Discovery context

Surfaced during end-to-end smoke testing of PR #39 (refactor/pipeline-catalog, 3-digit catalog rename + 010-013 method-as-pipeline entries). Verified via stash-checkout-unstash test that the bug exists on main at the same point — not caused by that PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions