Mix batch 0429#5439
Conversation
📝 WalkthroughWalkthroughThis PR implements mixed-batch LMDB training for DeepMD-kit's PyTorch backend, enabling frames with differing atom counts in single batches. New flat-batch forward paths flatten per-atom tensors across frames, precompute graph structures (neighbor lists, edge/angle indices), and reorganize model inference around flattened atom representations. The LMDB collator flattens atom-wise data and optionally precomputes graph metadata. Loss normalization, data loading, and training infrastructure are updated to detect and handle mixed batches via batch/ptr bookkeeping. ChangesMixed-batch LMDB Training Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deepmd/pt/loss/ener.py (1)
412-424:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftGeneralized-force loss still assumes a single
natomsfor every frame.This branch reshapes
forceanddrdqwithnatoms * 3, which only works for uniform-size batches. In mixed-batch mode those tensors are flattened across frames, so enabling generalized-force loss here will mis-shape the data or crash.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/loss/ener.py` around lines 412 - 424, The generalized-force branch in ener.py assumes a single natoms by reshaping force_pred, force_label and drdq with natoms * 3, which breaks for mixed-size batches; instead, use frame-aware shapes: do not flatten all frames into one dimension—reshape or iterate using per-frame atom counts (or keep forces as (nframes, natoms_i, 3) and drdq as (nframes, natoms_i*3, n_gcoord)) and compute gen_force and gen_force_label per-frame (or via batched operations that use the original nframes dimension) so tensor dims align; also apply find_drdq/pref_gf per-frame. Locate symbols: has_gf, drdq, find_drdq, pref_gf, force_pred, force_label, force_reshape_nframes, force_label_reshape_nframes, drdq_reshape, gen_force, gen_force_label and update the reshaping/Einstein-summation to operate over the nframes dimension (or loop over frames) instead of using natoms * 3 across the whole batch.
🧹 Nitpick comments (13)
deepmd/pt/model/descriptor/repflows.py (1)
840-874: 💤 Low valueMake the disjoint-frame invariant explicit.
The synthetic batch=1 wrap is only correct because
RepFlowLayer.forwardis strictly per-atom + per-neighbor, and becausenlist/a_nlisthere never cross frame boundaries (the LMDB collator builds per-frame extended atoms and offsets neighbor indices accordingly, then maps back via per-framemapping). A brief comment near this block stating that invariant — and noting that any future RepFlowLayer change introducing global ops over the batch axis would silently mix frames — will save a future debugging session.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/descriptor/repflows.py` around lines 840 - 874, Add a short explanatory comment above the synthetic batch=1 wrapping that states the disjoint-frame invariant: explain that the temporary batching of tensors like node_ebd_batched, edge_ebd_batched, h2_batched, angle_ebd_batched, nlist_batched, a_nlist_batched, etc. is safe because RepFlowLayer.forward operates strictly per-atom and per-neighbor and the LMDB collator guarantees nlist/a_nlist never reference atoms across different frames (mapping/offsets keep frames disjoint); also warn that if RepFlowLayer.forward (or any layer in self.layers) later introduces global ops over the batch axis the synthetic batch will mix frames and break correctness, so such changes must preserve the per-frame isolation or remove this batching trick.deepmd/pt/utils/nlist.py (2)
101-170: 🏗️ Heavy liftPer-frame Python loop with
.item()calls forces a CPU/GPU sync each iteration.For large
batch_size(e.g., 128 inlmdb_baseline.json), this loop incurs2 * batch_sizedevice syncs fromint(ptr[...].item())plus per-frame kernel launches forextend_coord_with_ghosts_with_imagesandbuild_neighbor_list. Even though it runs in dataloader workers, this can dominate end-to-end throughput when frames have small atom counts. Two relatively cheap mitigations:
- Pull
ptrto CPU once before the loop (ptr_cpu = ptr.cpu().tolist()) and index lists directly, eliminating per-iteration syncs.- If the data loader ultimately runs on CPU tensors (typical for collate_fn), the per-frame work is largely free of device sync but the loop still costs Python overhead — consider a vectorized batched ghost-extension that operates on a padded tensor and then re-flattens, mirroring the existing
extend_input_and_build_neighbor_list_with_images.Not blocking, but worth measuring before this lands as the default mixed-batch path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/utils/nlist.py` around lines 101 - 170, The loop is causing repeated device syncs via int(ptr[frame_idx].item()) and per-frame kernel overhead in extend_coord_with_ghosts_with_images and build_neighbor_list; fix by materializing ptr on CPU once (e.g., ptr_cpu = ptr.cpu().tolist()) and use ptr_cpu[frame_idx]/ptr_cpu[frame_idx+1] for start_idx/end_idx to eliminate .item() calls, and where possible call normalize_coord/extend_coord_with_ghosts_with_images/build_neighbor_list on already-CPU tensors or add a batched/padded variant (similar to extend_input_and_build_neighbor_list_with_images) to process frames in a vectorized way to reduce Python per-frame overhead.
262-281: 💤 Low valueDocument the "first nloc extended atoms per frame are local" assumption.
get_central_ext_indexrelies on the layout produced by_extend_coord_with_ghosts_impl, where shift indices are sorted by L2 norm and(0,0,0)is therefore the first image — making the firstnlocextended atoms in each frame the local ones. This invariant is implicit; a one-line comment will save a future maintainer who changes the ghost layout from silently breaking this helper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/utils/nlist.py` around lines 262 - 281, Add a one-line doc comment inside get_central_ext_index stating the invariant it relies on: that the extended layout produced by _extend_coord_with_ghosts_impl places the unshifted image (shift (0,0,0)) first for each frame and thus the first nloc extended atoms per frame are the local atoms; mention that shift indices are sorted by L2 norm so the (0,0,0) image is first and changing that ordering will break get_central_ext_index..gitignore (1)
81-81: 💤 Low value
deepmd-kit/ignore pattern is overly broad.Ignoring a directory named after the project itself can mask legitimate content if the repo or a checkout is ever placed under a
deepmd-kit/subdirectory (e.g., nested workspaces, tooling that clones into a project-named folder). Consider narrowing or scoping the rule to only the artifact path you intend to ignore.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.gitignore at line 81, The .gitignore entry "deepmd-kit/" is too broad and may unintentionally ignore legitimately nested repos; update the ignore rule that currently uses deepmd-kit/ to a more specific pattern (for example scope it to the repo root by changing to "/deepmd-kit/" or better yet target only build/dist/artifact paths like "/deepmd-kit/build", "/deepmd-kit/dist", or specific filenames/patterns such as "deepmd-kit/*.egg-info") so only the intended artifact directories/files are ignored; locate the deepmd-kit/ entry in the .gitignore and replace it with one or more narrow, explicit patterns.deepmd/pt/model/task/invar_fitting.py (2)
299-315: 💤 Low valueAtom-wise output detection is shape-heuristic; consider being explicit.
The current rule un-pads any tensor whose first two dims happen to be
(nframes, max_nloc)and leaves everything else untouched. That works for the current outputs (var_name,middle_output), but a future output that returns frame-level shape(nframes, max_nloc)would be silently un-padded, and any per-atom output whose first axis is something other thannframeswould be silently passed through. Consider keying offoutput_def()(which already declares atomic vs. frame-level variables) instead of tensor shape.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/task/invar_fitting.py` around lines 299 - 315, The tensor un-padding currently uses a shape heuristic in the loop that builds result_flat; instead use the model's declared output types from output_def() to decide which keys are atom-wise. Replace the shape-based condition in the for key, value in result_batch loop with a lookup like output_def()[key].is_atomic (or equivalent in your model API) and only apply valid_atom_mask to values for keys declared atomic; leave frame-level or other outputs untouched and handle non-tensor values as before (keep variable names result_batch, result_flat, valid_atom_mask, and call output_def() to determine atomic vs frame-level).
259-264: 💤 Low valuePadding
atype = -1silently wraps to the last type in PyTorch indexing.Per-type lookups inside
self.forward(bias_atom_e/case embedding/etc.) usetensor[atype], where-1does not error out — it wraps to the last type. Outputs are masked away later, so this does not corrupt the result, but it does waste compute on the padding rows and can interact unexpectedly with any branch that special-cases negative types. Using a valid sentinel like0is safer and semantically clearer; thevalid_atom_maskcontinues to discard the rows.♻️ Suggested change
- atype_batch = torch.full( - (nframes, max_nloc), - -1, - dtype=atype.dtype, - device=device, - ) + # Use 0 as a safe dummy type for padding rows; outputs are discarded + # by valid_atom_mask below so the value is not observable. + atype_batch = torch.zeros( + (nframes, max_nloc), + dtype=atype.dtype, + device=device, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/task/invar_fitting.py` around lines 259 - 264, The padding value for atype_batch is set to -1 which causes PyTorch indexing (e.g., in self.forward when doing per-type lookups like bias_atom_e[atype] or embeddings) to wrap to the last type; change the sentinel to a valid non-negative index (e.g., 0 or a dedicated PAD type index) when creating atype_batch so lookups won't accidentally reference real types, and keep using valid_atom_mask to discard those rows downstream; update places that assume negative sentinel behavior to use the new PAD index (refer to atype_batch, self.forward, bias_atom_e, and valid_atom_mask).deepmd/pt/model/network/graph_utils_flat.py (1)
41-89: 💤 Low valueLGTM.
Edge index construction (
n2e_index/n_ext2e_index) and the angle index mapping viaedge_lookupare consistent with the consumers inrepflows.pyand the dynamic-selection path. Thea_nlist_mask_3dlogic correctly requires both neighbors of an angle to be valid before producing a triplet. The single CPU sync vian_edge = ... .item()is necessary fortorch.arange.One small optional suggestion: an
assert a_sel <= nneinear the top would surface mis-configured inputs as a clear error rather than a downstream broadcast failure (DescrptDPA3already enforcese_sel >= a_sel, so this is mostly defensive).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/network/graph_utils_flat.py` around lines 41 - 89, Add a defensive check to guard against mis-configured inputs by asserting that a_sel (from a_nlist_mask.shape[1]) does not exceed nnei (from nlist_flat.shape[1]) near the top of the function; insert something like an assert or raise ValueError after computing nnei and a_sel so that functions using n2e_index, edge_lookup_a, and the a_nlist_mask_3d broadcasts fail fast with a clear message instead of producing downstream broadcast errors.deepmd/pt/model/model/make_model.py (3)
520-599: 💤 Low value
forward_common_flatis a pure passthrough toforward_common_flat_native.The wrapper adds no behavior — same parameters, same return — and just forwards every argument. Two near-identical method definitions (with duplicated docstrings) is a maintenance hazard and adds friction for readers tracing the call. Either (a) inline
forward_common_flat_nativeintoforward_common_flatand drop the_nativevariant, or (b) document a clear reason for the split (e.g. JIT export boundary, planned non-native variant) so the duplication is justified.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/model/make_model.py` around lines 520 - 599, forward_common_flat is a pure passthrough to forward_common_flat_native with duplicated docstrings, creating maintenance overhead; either remove the redundant wrapper by merging forward_common_flat_native into forward_common_flat (rename/remove the _native variant and keep a single method and docstring) or add a concise comment/docstring to forward_common_flat explaining the explicit reason for the split (e.g., JIT/export boundary or future divergence). Update any call sites that reference forward_common_flat_native to use the single chosen method name (forward_common_flat) and ensure tests/type hints still pass.
296-309: 💤 Low valueMove the helper import to module level and tidy the validation order.
Two minor suggestions:
from deepmd.pt.utils.nlist import rebuild_extended_coord_from_flat_graphis imported every call inside the hot forward path (line 296). Promote it to the existing nlist import block at the top of the file.- The
requires_grad_(True)work oncoord/boxat lines 280-282 happens before the field-presence validation at lines 283-309. If the validation raises, the clone+detach is wasted. Reorder so the validation runs first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/model/make_model.py` around lines 296 - 309, Move the helper import for rebuild_extended_coord_from_flat_graph out of the hot forward path and add it to the module-level nlist import block with the other deepmd.pt.utils.nlist imports so it is imported once at load time; also change the validation order inside the flat mixed-batch branch so you check for presence of the required graph fields (mapping, extended_batch, extended_image, etc.) and raise the RuntimeError before mutating tensors, and only call coord.requires_grad_(True) / box.requires_grad_(True) (and any clone+detach) after the validation passes to avoid wasted work when validation fails.
311-314: 💤 Low valueRedundant assertions after the explicit None-check.
Lines 283-295 already verify
extended_atype,extended_batch,mapping,nlist,nlist_ext, etc. are not None (otherwise theelseraises), so the fourassertstatements at lines 311-314 are redundant at runtime. If they exist solely for type-narrowing under static checkers, a comment to that effect would help future readers. Otherwise they can be removed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/model/make_model.py` around lines 311 - 314, The four assert statements for extended_atype, extended_batch, mapping, and nlist are redundant because the preceding explicit None-check/else already guarantees they are not None; either remove those assert lines or replace them with a short explanatory comment indicating they exist purely for static type-narrowing (for tools like mypy) so future readers understand their purpose; locate the checks around the make_model function where extended_atype, extended_batch, mapping, and nlist are validated and update accordingly.deepmd/pt/train/training.py (1)
1690-1723: 💤 Low valueMinor structural cleanup in
get_datafor mixed-batch handling.The current loop skips
batch/ptr(lines 1697-1698) only to transfer them separately at lines 1722-1723. Since they are tensors, they would be handled correctly by the genericnot isinstance(..., list)branch at line 1699-1701. You can drop the explicit skip and the post-loop transfer, then just extendinput_keyswhenis_mixed_batchis true:♻️ Proposed simplification
for key in batch_data.keys(): if key == "sid" or key == "fid" or "find_" in key: continue - # Skip batch and ptr for now, will handle them separately - elif key == "batch" or key == "ptr": - continue elif not isinstance(batch_data[key], list): if batch_data[key] is not None: batch_data[key] = batch_data[key].to(DEVICE, non_blocking=True) @@ # Mixed-nloc LMDB batches include precomputed flat-graph tensors. if is_mixed_batch: input_keys = input_keys + list(_FLAT_GRAPH_INPUT_KEYS) - batch_data["batch"] = batch_data["batch"].to(DEVICE, non_blocking=True) - batch_data["ptr"] = batch_data["ptr"].to(DEVICE, non_blocking=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/train/training.py` around lines 1690 - 1723, In get_data, remove the special-case skipping of "batch" and "ptr" inside the batch_data loop and delete the separate post-loop device transfers for batch_data["batch"] and batch_data["ptr"]; instead let the existing non-list tensor branch handle them (they'll be moved to DEVICE automatically), and only keep the is_mixed_batch detection so you can extend input_keys with list(_FLAT_GRAPH_INPUT_KEYS) when is_mixed_batch is true (symbols: get_data, batch_data, is_mixed_batch, input_keys, _FLAT_GRAPH_INPUT_KEYS, DEVICE).deepmd/pt/model/model/ener_model.py (1)
92-176: ⚡ Quick winExtract the shared output-assembly block to avoid maintaining two near-identical paths.
Lines 117-143 (flat path) and 153-174 (regular path) build
model_predictin essentially the same way (atom_energy/energy, conditionalforce/dforce, virials,mask,hessian). Two parallel implementations are a maintenance hazard — future output keys or shape adjustments must be mirrored in both, and any drift becomes silent. Consider extracting a small helper like_assemble_energy_predict(model_ret, do_atomic_virial)and calling it from both branches, with the flat branch additionally addingmodel_predict["updated_coord"] += coordin the no-fitting-net case.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/model/model/ener_model.py` around lines 92 - 176, The two branches in forward (flat and regular) duplicate the same output-assembly logic for model_predict; extract that into a helper method (e.g. _assemble_energy_predict(model_ret, do_atomic_virial)) that encapsulates setting "atom_energy", "energy", choosing "force" vs "dforce" using do_grad_r("energy"), adding "virial" and "atom_virial" via do_grad_c("energy") and do_atomic_virial, including "mask" and "hessian" when present/when _hessian_enabled; then call this helper from both the flat-path and regular-path after receiving model_ret, and keep the flat/regular-specific behavior (the no-fitting-net case that does model_predict["updated_coord"] += coord) in the respective branch only. Ensure the helper references self.do_grad_r, self.do_grad_c, self._hessian_enabled and self.get_fitting_net() behavior remains consistent.deepmd/pt/train/wrapper.py (1)
237-247: 💤 Low value
natomssemantics now diverge between flat and regular batches; document the contract.For regular batches
atype.dim() > 1andnatoms = atype.shape[-1]is per-frame nloc, but for mixed batchesatypeis 1-D andnatoms = atype.shape[0]is the total flattened atom count across all frames. Downstream loss code must distinguish these by inspectingptrininput_dictto normalize correctly — that's a fragile, implicit contract spread acrosswrapper.py,ener.py, andmake_model.py.Consider either (a) renaming/splitting the argument so callees see
total_atomsvsnloc_per_frameexplicitly, or (b) adding a docstring/inline note here that clearly states the dual semantics and that the loss must branch onptr. The current inline comment on lines 237-238 helps but does not surface the downstream consumer expectation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/train/wrapper.py` around lines 237 - 247, The natoms value currently has dual semantics (per-frame nloc when atype.dim()>1 vs total flattened atom count when 1-D) which is implicit and fragile; fix by making the contract explicit: compute both nloc_per_frame (e.g., atype.shape[-1] when atype.dim()>1 else None) and total_atoms (atype.numel() or atype.shape[0]) in wrapper.py and change the call to self.loss[task_key](input_dict, self.model[task_key], label, nloc_per_frame=nloc_per_frame, total_atoms=total_atoms, learning_rate=cur_lr), then update all loss implementations (ener.py) and any factory in make_model.py to accept these new named args (or provide backward-compatible handling) so downstream code no longer must infer semantics from ptr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.gitignore:
- Line 78: The .gitignore entry "test_mptraj/" conflicts with tracked files
committed in this PR (test_mptraj/lmdb_baseline.json and
test_mptraj/lmdb_mixed_batch.json); fix by either removing the directory-level
ignore line "test_mptraj/" or by adding explicit negation patterns for the
tracked files (e.g., "!test_mptraj/lmdb_baseline.json" and
"!test_mptraj/lmdb_mixed_batch.json") so future edits to those config files are
not hidden and git add won't require -f; update .gitignore accordingly and
ensure the two committed files remain tracked.
In `@deepmd/pt/loss/ener.py`:
- Around line 228-240: The mixed-batch normalization currently computes a single
scalar atom_norm via atom_norm = mean(1.0 / natoms_per_frame) which misweights
frames with different atom counts; instead compute per-frame normalizers and
apply them before reducing l2_ener_loss and l2_virial_loss: when is_mixed_batch
(detect via ptr and natoms_per_frame derived from ptr), build per-frame
atom_norms = 1.0 / natoms_per_frame.float() and multiply each frame's loss by
its corresponding atom_norm before taking the final mean/sum (or perform a
weighted reduction using these per-frame weights) so that l2_ener_loss and
l2_virial_loss are normalized per-frame correctly rather than using a single
averaged scalar.
In `@deepmd/pt/model/atomic_model/dp_atomic_model.py`:
- Around line 378-390: The flat-path call to descriptor.forward_flat currently
passes fparam unmodified and thus skips the fallback logic used in
forward_atomic; update the flat-path before calling descriptor.forward_flat (the
block around descriptor.forward_flat) to mirror forward_atomic's behavior: if
self.add_chg_spin_ebd (or the equivalent flag) is True and fparam is None,
compute the default via self.get_default_fparam(...) using the same inputs/shape
logic used in forward_atomic and assign it to fparam, then pass that fparam into
descriptor.forward_flat so mixed-batch runs use the same default frame
parameters.
In `@deepmd/pt/model/model/__init__.py`:
- Around line 90-98: The hybrid-descriptor logic in
_set_default_descriptor_init_seed (and the analogous block later) always derives
child seeds from the passed-in default `seed` instead of the effective parent
config seed; update the logic to first determine the parent_seed =
params.get("seed", seed) (i.e., use params["seed"] when present, otherwise fall
back to the incoming seed) and then call child_seed(parent_seed, idx) when
recursing into each hybrid child (preserving the existing
isinstance(descriptor_params, dict) check and recursive call to
_set_default_descriptor_init_seed).
In `@deepmd/pt/model/model/make_model.py`:
- Around line 497-518: The flat-path branch in make_model.py quietly skips
atomic virial when do_grad_c("energy") is true and do_atomic_virial is true (the
pass under if do_atomic_virial), causing fit_ret to miss
atom_virial/energy_derv_c keys and silently breaking callers (e.g.,
EnergyModel.forward reading model_ret["energy_derv_c"]); update the branch in
the block handling energy_derv_c_redu so that if do_atomic_virial is true you
either (A) compute and populate the appropriate atomic virial / energy_derv_c
entries in fit_ret consistent with the reduced-to-atom shape used elsewhere (use
energy_redu, energy_derv_c_redu and ptr to derive per-atom contributions), or if
that is nontrivial, (B) raise a clear NotImplementedError (or emit a one-time
logged warning and raise) referencing do_atomic_virial and the flat-path so
callers know the feature is unsupported; ensure the error/warning mentions
make_model.make_model (or the surrounding function) and that fit_ret must
include the keys model_predict/EnergyModel.forward expects (e.g.,
"energy_derv_c" / atom_virial) to avoid silent failures.
- Around line 217-352: The flat forward path forward_common_flat_native is
missing the same input/output precision normalization used in forward_common;
call the model's _input_type_cast on incoming tensors (coord, box, fparam,
aparam, atype, batch, mapping, extended_* inputs as appropriate) at the start so
coord/box are promoted to global_pt_float_precision (and energy-related inputs
to global_pt_ener_float_precision if required) before doing
clone().detach().requires_grad_(True) or rebuilding extended_coord, and after
computing model_predict_lower run _output_type_cast to promote outputs
(especially energy_redu) back to global_pt_ener_float_precision; ensure you
reference the existing helpers _input_type_cast and _output_type_cast and keep
their casting behavior identical to forward_common.
In `@deepmd/pt/train/training.py`:
- Around line 264-299: When _data.mixed_batch is true but the model descriptor
lacks repflows, raise an immediate, clear error instead of constructing a
dataloader with graph_config=None; inside the mixed_batch branch (where
descriptor is obtained from model_for_graph.atomic_model.descriptor and
make_lmdb_mixed_batch_collate is called), check hasattr(descriptor, "repflows")
and if False raise a ValueError/RuntimeError with a message like
"mixed_batch=True requires a flat-graph capable descriptor with 'repflows' (used
by make_lmdb_mixed_batch_collate and
EnergyModel.forward/forward_common_flat_native) — disable mixed_batch or provide
a compatible descriptor." This ensures failure occurs at dataloader construction
rather than later in EnergyModel.forward.
In `@deepmd/pt/utils/lmdb_dataset.py`:
- Around line 42-49: The __all__ export list is not alphabetically sorted which
triggers Ruff RUF022; reorder the list items in the __all__ variable so they are
lexicographically sorted (e.g., "_collate_lmdb_batch",
"_collate_lmdb_mixed_batch", "LmdbDataset", "LmdbTestData", "is_lmdb",
"make_lmdb_mixed_batch_collate") while preserving the existing string entries,
quotes, and list structure so the symbol names (LmdbDataset, LmdbTestData,
_collate_lmdb_batch, _collate_lmdb_mixed_batch, make_lmdb_mixed_batch_collate,
is_lmdb) remain unchanged.
- Around line 247-255: Mixed batching currently replaces the shuffle-based
DataLoader and ignores computed block weights, so when mixed_batch is True you
must construct _inner_dataloader to honor _block_targets computed by
compute_block_targets() rather than using DataLoader(..., shuffle=True). Replace
the plain shuffle=True approach in the mixed_batch branch by creating and
passing a sampler that uses _block_targets (e.g.,
torch.utils.data.WeightedRandomSampler or an equivalent custom sampler) to
DataLoader, keep num_workers=0 and collate_fn=_collate_lmdb_mixed_batch, and
ensure _inner_dataloader is assigned the DataLoader configured with that sampler
so auto_prob_style/block weighting still affects sampling.
- Around line 51-61: The whitelisted atom-wise keys in
_ATOMWISE_MIXED_BATCH_KEYS are missing supported labels (e.g. "atom_ener",
"drdq"), causing those fields to be handled by collate_tensor_fn and break
mixed-batch logic; update the frozenset declaration named
_ATOMWISE_MIXED_BATCH_KEYS to include all atom-wise labels used elsewhere (at
minimum "atom_ener" and "drdq") and make the same addition to the duplicate
whitelist further down (the other frozenset at lines ~99-104) so mixed-batch
flattening consistently handles these fields instead of falling back to
collate_tensor_fn.
In `@deepmd/pt/utils/nlist.py`:
- Around line 246-259: The code currently inverts atom_cell (atom_cell =
cell[extended_batch]) per extended atom which is wasteful; instead compute the
inverse once per frame by calling torch.linalg.inv_ex on cell (after cell =
box.reshape(-1,3,3)) and then index the resulting rec_cell with extended_batch
(i.e., rec_cell_full, _ = torch.linalg.inv_ex(cell); rec_cell =
rec_cell_full[extended_batch]) so subsequent einsum uses the indexed per-atom
inverses without repeating the matrix inversion.
In `@test_mixed_batch.sh`:
- Around line 15-19: The script prints "Check mixed_batch_train.log for details"
but never writes that file; fix by redirecting the training command's
stdout/stderr to mixed_batch_train.log (e.g., change the dp --pt train
test_mptraj/lmdb_mixed_batch.json --skip-neighbor-stat invocation to append or
write both stdout and stderr into mixed_batch_train.log) or, if you prefer not
to create a log, update the echo lines to not reference mixed_batch_train.log;
modify the dp invocation or the echo accordingly to keep the message accurate.
In `@test_mptraj/lmdb_baseline.json`:
- Around line 195-205: The baseline config currently hardcodes
developer-specific absolute paths in the keys stat_file, training_data.systems,
and validation_data.systems which breaks portability; change these values to
reusable placeholders or environment-variable expansions (e.g. ${DATASET_PATH})
or point them to a small example dataset shipped in the repo (mirror other
example configs), update any code that loads the config to resolve those env
vars, and add a short note in the config or README explaining how
test_mixed_batch.sh sets or substitutes those variables before running.
In `@test_mptraj/lmdb_mixed_batch.json`:
- Around line 74-84: The fixture uses hard-coded absolute paths in stat_file and
LMDB system entries (stat_file, training_data.systems, validation_data.systems)
which prevents CI portability; update the test_mptraj/lmdb_mixed_batch.json
fixture to reference repo-managed test data or injected paths by replacing
absolute paths with relative paths inside the repo (e.g., a test-data directory)
or with placeholders/env variables read by the test runner, and ensure the test
harness supplies those fixtures before running so the config uses
repo-controlled files rather than machine-specific absolute paths.
---
Outside diff comments:
In `@deepmd/pt/loss/ener.py`:
- Around line 412-424: The generalized-force branch in ener.py assumes a single
natoms by reshaping force_pred, force_label and drdq with natoms * 3, which
breaks for mixed-size batches; instead, use frame-aware shapes: do not flatten
all frames into one dimension—reshape or iterate using per-frame atom counts (or
keep forces as (nframes, natoms_i, 3) and drdq as (nframes, natoms_i*3,
n_gcoord)) and compute gen_force and gen_force_label per-frame (or via batched
operations that use the original nframes dimension) so tensor dims align; also
apply find_drdq/pref_gf per-frame. Locate symbols: has_gf, drdq, find_drdq,
pref_gf, force_pred, force_label, force_reshape_nframes,
force_label_reshape_nframes, drdq_reshape, gen_force, gen_force_label and update
the reshaping/Einstein-summation to operate over the nframes dimension (or loop
over frames) instead of using natoms * 3 across the whole batch.
---
Nitpick comments:
In @.gitignore:
- Line 81: The .gitignore entry "deepmd-kit/" is too broad and may
unintentionally ignore legitimately nested repos; update the ignore rule that
currently uses deepmd-kit/ to a more specific pattern (for example scope it to
the repo root by changing to "/deepmd-kit/" or better yet target only
build/dist/artifact paths like "/deepmd-kit/build", "/deepmd-kit/dist", or
specific filenames/patterns such as "deepmd-kit/*.egg-info") so only the
intended artifact directories/files are ignored; locate the deepmd-kit/ entry in
the .gitignore and replace it with one or more narrow, explicit patterns.
In `@deepmd/pt/model/descriptor/repflows.py`:
- Around line 840-874: Add a short explanatory comment above the synthetic
batch=1 wrapping that states the disjoint-frame invariant: explain that the
temporary batching of tensors like node_ebd_batched, edge_ebd_batched,
h2_batched, angle_ebd_batched, nlist_batched, a_nlist_batched, etc. is safe
because RepFlowLayer.forward operates strictly per-atom and per-neighbor and the
LMDB collator guarantees nlist/a_nlist never reference atoms across different
frames (mapping/offsets keep frames disjoint); also warn that if
RepFlowLayer.forward (or any layer in self.layers) later introduces global ops
over the batch axis the synthetic batch will mix frames and break correctness,
so such changes must preserve the per-frame isolation or remove this batching
trick.
In `@deepmd/pt/model/model/ener_model.py`:
- Around line 92-176: The two branches in forward (flat and regular) duplicate
the same output-assembly logic for model_predict; extract that into a helper
method (e.g. _assemble_energy_predict(model_ret, do_atomic_virial)) that
encapsulates setting "atom_energy", "energy", choosing "force" vs "dforce" using
do_grad_r("energy"), adding "virial" and "atom_virial" via do_grad_c("energy")
and do_atomic_virial, including "mask" and "hessian" when present/when
_hessian_enabled; then call this helper from both the flat-path and regular-path
after receiving model_ret, and keep the flat/regular-specific behavior (the
no-fitting-net case that does model_predict["updated_coord"] += coord) in the
respective branch only. Ensure the helper references self.do_grad_r,
self.do_grad_c, self._hessian_enabled and self.get_fitting_net() behavior
remains consistent.
In `@deepmd/pt/model/model/make_model.py`:
- Around line 520-599: forward_common_flat is a pure passthrough to
forward_common_flat_native with duplicated docstrings, creating maintenance
overhead; either remove the redundant wrapper by merging
forward_common_flat_native into forward_common_flat (rename/remove the _native
variant and keep a single method and docstring) or add a concise
comment/docstring to forward_common_flat explaining the explicit reason for the
split (e.g., JIT/export boundary or future divergence). Update any call sites
that reference forward_common_flat_native to use the single chosen method name
(forward_common_flat) and ensure tests/type hints still pass.
- Around line 296-309: Move the helper import for
rebuild_extended_coord_from_flat_graph out of the hot forward path and add it to
the module-level nlist import block with the other deepmd.pt.utils.nlist imports
so it is imported once at load time; also change the validation order inside the
flat mixed-batch branch so you check for presence of the required graph fields
(mapping, extended_batch, extended_image, etc.) and raise the RuntimeError
before mutating tensors, and only call coord.requires_grad_(True) /
box.requires_grad_(True) (and any clone+detach) after the validation passes to
avoid wasted work when validation fails.
- Around line 311-314: The four assert statements for extended_atype,
extended_batch, mapping, and nlist are redundant because the preceding explicit
None-check/else already guarantees they are not None; either remove those assert
lines or replace them with a short explanatory comment indicating they exist
purely for static type-narrowing (for tools like mypy) so future readers
understand their purpose; locate the checks around the make_model function where
extended_atype, extended_batch, mapping, and nlist are validated and update
accordingly.
In `@deepmd/pt/model/network/graph_utils_flat.py`:
- Around line 41-89: Add a defensive check to guard against mis-configured
inputs by asserting that a_sel (from a_nlist_mask.shape[1]) does not exceed nnei
(from nlist_flat.shape[1]) near the top of the function; insert something like
an assert or raise ValueError after computing nnei and a_sel so that functions
using n2e_index, edge_lookup_a, and the a_nlist_mask_3d broadcasts fail fast
with a clear message instead of producing downstream broadcast errors.
In `@deepmd/pt/model/task/invar_fitting.py`:
- Around line 299-315: The tensor un-padding currently uses a shape heuristic in
the loop that builds result_flat; instead use the model's declared output types
from output_def() to decide which keys are atom-wise. Replace the shape-based
condition in the for key, value in result_batch loop with a lookup like
output_def()[key].is_atomic (or equivalent in your model API) and only apply
valid_atom_mask to values for keys declared atomic; leave frame-level or other
outputs untouched and handle non-tensor values as before (keep variable names
result_batch, result_flat, valid_atom_mask, and call output_def() to determine
atomic vs frame-level).
- Around line 259-264: The padding value for atype_batch is set to -1 which
causes PyTorch indexing (e.g., in self.forward when doing per-type lookups like
bias_atom_e[atype] or embeddings) to wrap to the last type; change the sentinel
to a valid non-negative index (e.g., 0 or a dedicated PAD type index) when
creating atype_batch so lookups won't accidentally reference real types, and
keep using valid_atom_mask to discard those rows downstream; update places that
assume negative sentinel behavior to use the new PAD index (refer to
atype_batch, self.forward, bias_atom_e, and valid_atom_mask).
In `@deepmd/pt/train/training.py`:
- Around line 1690-1723: In get_data, remove the special-case skipping of
"batch" and "ptr" inside the batch_data loop and delete the separate post-loop
device transfers for batch_data["batch"] and batch_data["ptr"]; instead let the
existing non-list tensor branch handle them (they'll be moved to DEVICE
automatically), and only keep the is_mixed_batch detection so you can extend
input_keys with list(_FLAT_GRAPH_INPUT_KEYS) when is_mixed_batch is true
(symbols: get_data, batch_data, is_mixed_batch, input_keys,
_FLAT_GRAPH_INPUT_KEYS, DEVICE).
In `@deepmd/pt/train/wrapper.py`:
- Around line 237-247: The natoms value currently has dual semantics (per-frame
nloc when atype.dim()>1 vs total flattened atom count when 1-D) which is
implicit and fragile; fix by making the contract explicit: compute both
nloc_per_frame (e.g., atype.shape[-1] when atype.dim()>1 else None) and
total_atoms (atype.numel() or atype.shape[0]) in wrapper.py and change the call
to self.loss[task_key](input_dict, self.model[task_key], label,
nloc_per_frame=nloc_per_frame, total_atoms=total_atoms, learning_rate=cur_lr),
then update all loss implementations (ener.py) and any factory in make_model.py
to accept these new named args (or provide backward-compatible handling) so
downstream code no longer must infer semantics from ptr.
In `@deepmd/pt/utils/nlist.py`:
- Around line 101-170: The loop is causing repeated device syncs via
int(ptr[frame_idx].item()) and per-frame kernel overhead in
extend_coord_with_ghosts_with_images and build_neighbor_list; fix by
materializing ptr on CPU once (e.g., ptr_cpu = ptr.cpu().tolist()) and use
ptr_cpu[frame_idx]/ptr_cpu[frame_idx+1] for start_idx/end_idx to eliminate
.item() calls, and where possible call
normalize_coord/extend_coord_with_ghosts_with_images/build_neighbor_list on
already-CPU tensors or add a batched/padded variant (similar to
extend_input_and_build_neighbor_list_with_images) to process frames in a
vectorized way to reduce Python per-frame overhead.
- Around line 262-281: Add a one-line doc comment inside get_central_ext_index
stating the invariant it relies on: that the extended layout produced by
_extend_coord_with_ghosts_impl places the unshifted image (shift (0,0,0)) first
for each frame and thus the first nloc extended atoms per frame are the local
atoms; mention that shift indices are sorted by L2 norm so the (0,0,0) image is
first and changing that ordering will break get_central_ext_index.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b6ad87fb-4157-40de-8f34-510d8232bbef
📒 Files selected for processing (21)
.gitignoredeepmd/pt/entrypoints/main.pydeepmd/pt/loss/ener.pydeepmd/pt/model/atomic_model/dp_atomic_model.pydeepmd/pt/model/descriptor/dpa3.pydeepmd/pt/model/descriptor/env_mat.pydeepmd/pt/model/descriptor/repflows.pydeepmd/pt/model/model/__init__.pydeepmd/pt/model/model/ener_model.pydeepmd/pt/model/model/make_model.pydeepmd/pt/model/network/graph_utils_flat.pydeepmd/pt/model/task/invar_fitting.pydeepmd/pt/train/training.pydeepmd/pt/train/wrapper.pydeepmd/pt/utils/lmdb_dataset.pydeepmd/pt/utils/nlist.pydeepmd/utils/argcheck.pysource/tests/pt/test_lmdb_dataloader.pytest_mixed_batch.shtest_mptraj/lmdb_baseline.jsontest_mptraj/lmdb_mixed_batch.json
| system/ | ||
|
|
||
| temp/ | ||
| test_mptraj/ |
There was a problem hiding this comment.
Ignoring test_mptraj/ conflicts with tracked files in this PR.
This PR commits test_mptraj/lmdb_baseline.json and test_mptraj/lmdb_mixed_batch.json under test_mptraj/, but the new test_mptraj/ ignore rule will hide any future additions/edits in that directory and forces git add -f. Either drop the directory-level ignore or add explicit negations for the tracked configs.
🛠️ Suggested fix
temp/
-test_mptraj/
+test_mptraj/*
+!test_mptraj/*.json
+!test_mptraj/.gitignore
pkl/📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_mptraj/ | |
| test_mptraj/* | |
| !test_mptraj/*.json | |
| !test_mptraj/.gitignore |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.gitignore at line 78, The .gitignore entry "test_mptraj/" conflicts with
tracked files committed in this PR (test_mptraj/lmdb_baseline.json and
test_mptraj/lmdb_mixed_batch.json); fix by either removing the directory-level
ignore line "test_mptraj/" or by adding explicit negation patterns for the
tracked files (e.g., "!test_mptraj/lmdb_baseline.json" and
"!test_mptraj/lmdb_mixed_batch.json") so future edits to those config files are
not hidden and git add won't require -f; update .gitignore accordingly and
ensure the two committed files remain tracked.
| # Detect mixed batch format | ||
| is_mixed_batch = "ptr" in input_dict and input_dict["ptr"] is not None | ||
|
|
||
| # For mixed batch, compute per-frame atom_norm and average | ||
| if is_mixed_batch: | ||
| ptr = input_dict["ptr"] | ||
| nframes = ptr.numel() - 1 | ||
| # Compute natoms for each frame | ||
| natoms_per_frame = ptr[1:] - ptr[:-1] # [nframes] | ||
| # Average atom_norm across frames | ||
| atom_norm = torch.mean(1.0 / natoms_per_frame.float()) | ||
| else: | ||
| atom_norm = 1.0 / natoms |
There was a problem hiding this comment.
This mixed-batch normalization misweights frames with different atom counts.
l2_ener_loss/l2_virial_loss are reduced over frames first, then multiplied by mean(1 / N_i). That is not the same as applying each frame’s own 1 / N_i normalization before the reduction, so mixed batches with small and large frames get the wrong loss weighting.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/loss/ener.py` around lines 228 - 240, The mixed-batch normalization
currently computes a single scalar atom_norm via atom_norm = mean(1.0 /
natoms_per_frame) which misweights frames with different atom counts; instead
compute per-frame normalizers and apply them before reducing l2_ener_loss and
l2_virial_loss: when is_mixed_batch (detect via ptr and natoms_per_frame derived
from ptr), build per-frame atom_norms = 1.0 / natoms_per_frame.float() and
multiply each frame's loss by its corresponding atom_norm before taking the
final mean/sum (or perform a weighted reduction using these per-frame weights)
so that l2_ener_loss and l2_virial_loss are normalized per-frame correctly
rather than using a single averaged scalar.
| if self.do_grad_r() or self.do_grad_c(): | ||
| extended_coord.requires_grad_(True) | ||
|
|
||
| # Descriptor and fitting both consume the flat atom layout. | ||
| descriptor_out = self.descriptor.forward_flat( | ||
| extended_coord, | ||
| extended_atype, | ||
| extended_batch, | ||
| nlist, | ||
| mapping, | ||
| batch, | ||
| ptr, | ||
| fparam=fparam if self.add_chg_spin_ebd else None, |
There was a problem hiding this comment.
Mirror the default-fparam fallback in the flat path.
forward_atomic() synthesizes get_default_fparam() when the fitting net expects frame parameters and the caller omits fparam. This flat path passes None straight into descriptor.forward_flat(), so mixed-batch runs lose that fallback and will diverge from the non-flat path for models using default charge/spin frame params.
Suggested fix
if self.do_grad_r() or self.do_grad_c():
extended_coord.requires_grad_(True)
+ if (
+ hasattr(self.fitting_net, "get_dim_fparam")
+ and self.fitting_net.get_dim_fparam() > 0
+ and fparam is None
+ ):
+ default_fparam_tensor = self.fitting_net.get_default_fparam()
+ assert default_fparam_tensor is not None
+ fparam_input_for_des = torch.tile(
+ default_fparam_tensor.unsqueeze(0), [ptr.numel() - 1, 1]
+ )
+ else:
+ fparam_input_for_des = fparam
+
# Descriptor and fitting both consume the flat atom layout.
descriptor_out = self.descriptor.forward_flat(
extended_coord,
@@
- fparam=fparam if self.add_chg_spin_ebd else None,
+ fparam=fparam_input_for_des if self.add_chg_spin_ebd else None,
central_ext_index=central_ext_index,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/model/atomic_model/dp_atomic_model.py` around lines 378 - 390, The
flat-path call to descriptor.forward_flat currently passes fparam unmodified and
thus skips the fallback logic used in forward_atomic; update the flat-path
before calling descriptor.forward_flat (the block around
descriptor.forward_flat) to mirror forward_atomic's behavior: if
self.add_chg_spin_ebd (or the equivalent flag) is True and fparam is None,
compute the default via self.get_default_fparam(...) using the same inputs/shape
logic used in forward_atomic and assign it to fparam, then pass that fparam into
descriptor.forward_flat so mixed-batch runs use the same default frame
parameters.
| def _set_default_descriptor_init_seed( | ||
| params: dict[str, Any], seed: int | list[int] | ||
| ) -> None: | ||
| if params.get("type") == "hybrid": | ||
| for idx, descriptor_params in enumerate(params.get("list", [])): | ||
| if isinstance(descriptor_params, dict): | ||
| _set_default_descriptor_init_seed( | ||
| descriptor_params, child_seed(seed, idx) | ||
| ) |
There was a problem hiding this comment.
Hybrid descriptor child seeds ignore an explicit parent seed.
In hybrid mode, Line 97 derives child seeds from the passed default (DEFAULT_DESCRIPTOR_INIT_SEED) rather than the effective parent config seed. If users set descriptor.seed, it won’t drive child seed derivation.
Suggested fix
def _set_default_descriptor_init_seed(
params: dict[str, Any], seed: int | list[int]
) -> None:
+ _set_default_init_seed(params, seed)
+ parent_seed = params["seed"]
if params.get("type") == "hybrid":
for idx, descriptor_params in enumerate(params.get("list", [])):
if isinstance(descriptor_params, dict):
_set_default_descriptor_init_seed(
- descriptor_params, child_seed(seed, idx)
+ descriptor_params, child_seed(parent_seed, idx)
)
return
- _set_default_init_seed(params, seed)Also applies to: 111-113
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/model/model/__init__.py` around lines 90 - 98, The
hybrid-descriptor logic in _set_default_descriptor_init_seed (and the analogous
block later) always derives child seeds from the passed-in default `seed`
instead of the effective parent config seed; update the logic to first determine
the parent_seed = params.get("seed", seed) (i.e., use params["seed"] when
present, otherwise fall back to the incoming seed) and then call
child_seed(parent_seed, idx) when recursing into each hybrid child (preserving
the existing isinstance(descriptor_params, dict) check and recursive call to
_set_default_descriptor_init_seed).
| def forward_common_flat_native( | ||
| self, | ||
| coord: torch.Tensor, | ||
| atype: torch.Tensor, | ||
| batch: torch.Tensor, | ||
| ptr: torch.Tensor, | ||
| box: torch.Tensor | None = None, | ||
| fparam: torch.Tensor | None = None, | ||
| aparam: torch.Tensor | None = None, | ||
| do_atomic_virial: bool = False, | ||
| extended_atype: torch.Tensor | None = None, | ||
| extended_batch: torch.Tensor | None = None, | ||
| extended_image: torch.Tensor | None = None, | ||
| extended_ptr: torch.Tensor | None = None, | ||
| mapping: torch.Tensor | None = None, | ||
| central_ext_index: torch.Tensor | None = None, | ||
| nlist: torch.Tensor | None = None, | ||
| nlist_ext: torch.Tensor | None = None, | ||
| a_nlist: torch.Tensor | None = None, | ||
| a_nlist_ext: torch.Tensor | None = None, | ||
| nlist_mask: torch.Tensor | None = None, | ||
| a_nlist_mask: torch.Tensor | None = None, | ||
| edge_index: torch.Tensor | None = None, | ||
| angle_index: torch.Tensor | None = None, | ||
| ) -> dict[str, torch.Tensor]: | ||
| """Forward pass for mixed-nloc batches with a precomputed flat graph. | ||
|
|
||
| This path consumes graph tensors prepared by the LMDB collate function | ||
| and keeps atom-wise values flattened across frames. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| coord | ||
| Flattened atomic coordinates with shape [total_atoms, 3]. | ||
| atype | ||
| Flattened atomic types with shape [total_atoms]. | ||
| batch | ||
| Atom-to-frame assignment with shape [total_atoms]. | ||
| ptr | ||
| Frame boundaries with shape [nframes + 1]. | ||
| box | ||
| Simulation boxes with shape [nframes, 9]. | ||
| fparam | ||
| Frame parameters with shape [nframes, ndf]. | ||
| aparam | ||
| Flattened atomic parameters with shape [total_atoms, nda]. | ||
| do_atomic_virial | ||
| Whether to calculate atomic virial. | ||
|
|
||
| Returns | ||
| ------- | ||
| model_predict : dict[str, torch.Tensor] | ||
| Model predictions with flat format: | ||
| - atomwise outputs: [total_atoms, ...] | ||
| - frame-wise outputs: [nframes, ...] | ||
|
|
||
| Notes | ||
| ----- | ||
| The precomputed graph fields are required for this path; missing | ||
| fields are treated as a data pipeline error. | ||
| """ | ||
| # Enable gradient tracking for coord and box if needed | ||
| if self.do_grad_r("energy"): | ||
| coord = coord.clone().detach().requires_grad_(True) | ||
| if self.do_grad_c("energy") and box is not None: | ||
| box = box.clone().detach().requires_grad_(True) | ||
| if ( | ||
| extended_atype is not None | ||
| and extended_batch is not None | ||
| and extended_image is not None | ||
| and mapping is not None | ||
| and nlist is not None | ||
| and nlist_ext is not None | ||
| and a_nlist is not None | ||
| and a_nlist_ext is not None | ||
| and nlist_mask is not None | ||
| and a_nlist_mask is not None | ||
| and central_ext_index is not None | ||
| ): | ||
| from deepmd.pt.utils.nlist import rebuild_extended_coord_from_flat_graph | ||
|
|
||
| extended_coord = rebuild_extended_coord_from_flat_graph( | ||
| coord, | ||
| box, | ||
| mapping, | ||
| extended_batch, | ||
| extended_image, | ||
| ) | ||
| else: | ||
| raise RuntimeError( | ||
| "Flat mixed-batch forward requires precomputed graph fields from " | ||
| "the LMDB collate_fn." | ||
| ) | ||
| # Pass flat extended coordinates directly to the atomic model. | ||
| assert extended_atype is not None | ||
| assert extended_batch is not None | ||
| assert mapping is not None | ||
| assert nlist is not None | ||
| model_predict_lower = self.forward_common_lower_flat( | ||
| extended_coord, | ||
| extended_atype, | ||
| extended_batch, | ||
| nlist, | ||
| mapping, | ||
| batch, | ||
| ptr, | ||
| do_atomic_virial=do_atomic_virial, | ||
| fparam=fparam, | ||
| aparam=aparam, | ||
| extended_ptr=extended_ptr, | ||
| central_ext_index=central_ext_index, | ||
| nlist_ext=nlist_ext, | ||
| a_nlist=a_nlist, | ||
| a_nlist_ext=a_nlist_ext, | ||
| nlist_mask=nlist_mask, | ||
| a_nlist_mask=a_nlist_mask, | ||
| edge_index=edge_index, | ||
| angle_index=angle_index, | ||
| ) | ||
|
|
||
| # Compute derivatives if needed | ||
| if self.do_grad_r("energy") or self.do_grad_c("energy"): | ||
| model_predict_lower = self._compute_derivatives_flat( | ||
| model_predict_lower, | ||
| extended_coord, | ||
| extended_atype, | ||
| extended_batch, | ||
| coord, | ||
| atype, | ||
| batch, | ||
| ptr, | ||
| box, | ||
| do_atomic_virial, | ||
| ) | ||
|
|
||
| return model_predict_lower |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether _input_type_cast / _output_type_cast are referenced anywhere
# in the new flat path or its callees.
rg -nP --type=py -C2 '_(input|output)_type_cast' deepmd/pt/model/Repository: deepmodeling/deepmd-kit
Length of output: 2157
🏁 Script executed:
#!/bin/bash
# Find forward_common_lower_flat and _compute_derivatives_flat implementations
cd deepmd/pt/model/model && cat -n make_model.py | sed -n '353,700p' | head -100Repository: deepmodeling/deepmd-kit
Length of output: 4742
🏁 Script executed:
#!/bin/bash
# Read _compute_derivatives_flat implementation
cd deepmd/pt/model/model && cat -n make_model.py | sed -n '438,550p'Repository: deepmodeling/deepmd-kit
Length of output: 5611
🏁 Script executed:
#!/bin/bash
# Read forward_common_flat and forward_common_lower implementations for comparison
cd deepmd/pt/model/model && cat -n make_model.py | sed -n '520,600p' && echo "---" && cat -n make_model.py | sed -n '150,220p'Repository: deepmodeling/deepmd-kit
Length of output: 6816
🏁 Script executed:
#!/bin/bash
# Read _input_type_cast and _output_type_cast implementations
cd deepmd/pt/model/model && cat -n make_model.py | sed -n '711,800p'Repository: deepmodeling/deepmd-kit
Length of output: 4361
Add input/output precision casting to flat path for numeric consistency with forward_common.
forward_common normalizes input precision via _input_type_cast (line 171) and casts outputs back via _output_type_cast (line 214). forward_common_flat_native skips both, causing autograd to operate at the input's native precision instead of global_pt_float_precision, and outputs (particularly energy_redu) won't be promoted to global_pt_ener_float_precision. This breaks numeric consistency—if coord/box arrive in a non-canonical precision, box.clone().detach().requires_grad_(True) at line 282 can create a dtype mismatch between the cloned box and the extended_coord reconstructed from the original precision. Add precision casting at the flat path's entry and exit to match the behavior of the regular path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/model/model/make_model.py` around lines 217 - 352, The flat forward
path forward_common_flat_native is missing the same input/output precision
normalization used in forward_common; call the model's _input_type_cast on
incoming tensors (coord, box, fparam, aparam, atype, batch, mapping, extended_*
inputs as appropriate) at the start so coord/box are promoted to
global_pt_float_precision (and energy-related inputs to
global_pt_ener_float_precision if required) before doing
clone().detach().requires_grad_(True) or rebuilding extended_coord, and after
computing model_predict_lower run _output_type_cast to promote outputs
(especially energy_redu) back to global_pt_ener_float_precision; ensure you
reference the existing helpers _input_type_cast and _output_type_cast and keep
their casting behavior identical to forward_common.
| if mixed_batch: | ||
| with torch.device("cpu"): | ||
| self._inner_dataloader = DataLoader( | ||
| self, | ||
| batch_size=self._reader.batch_size, | ||
| shuffle=True, | ||
| num_workers=0, | ||
| collate_fn=_collate_lmdb_mixed_batch, | ||
| ) |
There was a problem hiding this comment.
auto_prob stops affecting sampling when mixed_batch=True.
compute_block_targets() still runs above, but this branch switches to a plain DataLoader(..., shuffle=True) and never consumes _block_targets. Any LMDB config relying on auto_prob_style/block weighting will silently change its training distribution once mixed batching is enabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/utils/lmdb_dataset.py` around lines 247 - 255, Mixed batching
currently replaces the shuffle-based DataLoader and ignores computed block
weights, so when mixed_batch is True you must construct _inner_dataloader to
honor _block_targets computed by compute_block_targets() rather than using
DataLoader(..., shuffle=True). Replace the plain shuffle=True approach in the
mixed_batch branch by creating and passing a sampler that uses _block_targets
(e.g., torch.utils.data.WeightedRandomSampler or an equivalent custom sampler)
to DataLoader, keep num_workers=0 and collate_fn=_collate_lmdb_mixed_batch, and
ensure _inner_dataloader is assigned the DataLoader configured with that sampler
so auto_prob_style/block weighting still affects sampling.
| if box is None: | ||
| return coord[mapping] | ||
| cell = box.reshape(-1, 3, 3) | ||
| atom_cell = cell[extended_batch] | ||
| rec_cell, _ = torch.linalg.inv_ex(atom_cell) | ||
| coord_inter = torch.einsum("ni,nij->nj", coord[mapping], rec_cell) | ||
| coord_wrapped = torch.einsum( | ||
| "ni,nij->nj", | ||
| torch.remainder(coord_inter, 1.0), | ||
| atom_cell, | ||
| ) | ||
| image = extended_image.to(dtype=box.dtype, device=box.device) | ||
| shift_vec = torch.einsum("ni,nij->nj", image, atom_cell) | ||
| return coord_wrapped + shift_vec |
There was a problem hiding this comment.
Invert the periodic cell once per frame, not once per extended atom.
atom_cell = cell[extended_batch] produces an [n_ext, 3, 3] tensor of repeated per-frame matrices, and torch.linalg.inv_ex(atom_cell) then inverts each row independently. This is called on every forward pass for periodic systems and scales with the number of extended atoms (tens of thousands). Inverting per-frame and indexing into the result preserves the same output and avoids the redundant work.
⚡ Suggested fix
if box is None:
return coord[mapping]
cell = box.reshape(-1, 3, 3)
- atom_cell = cell[extended_batch]
- rec_cell, _ = torch.linalg.inv_ex(atom_cell)
+ rec_cell_per_frame, _ = torch.linalg.inv_ex(cell)
+ atom_cell = cell[extended_batch]
+ rec_cell = rec_cell_per_frame[extended_batch]
coord_inter = torch.einsum("ni,nij->nj", coord[mapping], rec_cell)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/utils/nlist.py` around lines 246 - 259, The code currently inverts
atom_cell (atom_cell = cell[extended_batch]) per extended atom which is
wasteful; instead compute the inverse once per frame by calling
torch.linalg.inv_ex on cell (after cell = box.reshape(-1,3,3)) and then index
the resulting rec_cell with extended_batch (i.e., rec_cell_full, _ =
torch.linalg.inv_ex(cell); rec_cell = rec_cell_full[extended_batch]) so
subsequent einsum uses the indexed per-atom inverses without repeating the
matrix inversion.
| dp --pt train test_mptraj/lmdb_mixed_batch.json --skip-neighbor-stat | ||
|
|
||
| echo "" | ||
| echo "=== Training completed ===" | ||
| echo "Check mixed_batch_train.log for details" |
There was a problem hiding this comment.
The script never creates mixed_batch_train.log.
The final message points users to a log file that this script does not write. Either redirect the training output there or update the message.
Suggested fix
-dp --pt train test_mptraj/lmdb_mixed_batch.json --skip-neighbor-stat
+dp --pt train test_mptraj/lmdb_mixed_batch.json --skip-neighbor-stat \
+ > mixed_batch_train.log 2>&1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test_mixed_batch.sh` around lines 15 - 19, The script prints "Check
mixed_batch_train.log for details" but never writes that file; fix by
redirecting the training command's stdout/stderr to mixed_batch_train.log (e.g.,
change the dp --pt train test_mptraj/lmdb_mixed_batch.json --skip-neighbor-stat
invocation to append or write both stdout and stderr into mixed_batch_train.log)
or, if you prefer not to create a log, update the echo lines to not reference
mixed_batch_train.log; modify the dp invocation or the echo accordingly to keep
the message accurate.
| "stat_file": "./MP_traj_v024_alldata_mixu.hdf5", | ||
| "training_data": { | ||
| "systems": "/mnt/data_nas/zhangd/data/lmdb_data/mptraj_v024.lmdb", | ||
| "batch_size": "auto:128", | ||
| "_comment": "that's all" | ||
| }, | ||
| "validation_data": { | ||
| "systems": "/mnt/data_nas/zhangd/data/lmdb_data/wbm.lmdb", | ||
| "batch_size": 1, | ||
| "_comment": "that's all" | ||
| }, |
There was a problem hiding this comment.
Baseline config contains developer-specific absolute paths.
stat_file, training_data.systems, and validation_data.systems point to a specific machine (/mnt/data_nas/zhangd/...) and a local ./MP_traj_v024_alldata_mixu.hdf5. This makes the baseline non-reproducible for any other user/CI. Consider parameterizing via env vars / placeholders, or pointing to a small example dataset that ships with the repo (similar to other example configs), and documenting how test_mixed_batch.sh resolves these.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test_mptraj/lmdb_baseline.json` around lines 195 - 205, The baseline config
currently hardcodes developer-specific absolute paths in the keys stat_file,
training_data.systems, and validation_data.systems which breaks portability;
change these values to reusable placeholders or environment-variable expansions
(e.g. ${DATASET_PATH}) or point them to a small example dataset shipped in the
repo (mirror other example configs), update any code that loads the config to
resolve those env vars, and add a short note in the config or README explaining
how test_mixed_batch.sh sets or substitutes those variables before running.
| "stat_file": "/aisi/mnt/data_nas/liwentao/devel_workspace/deepmd-kit-lmdb/MP_traj_v024_alldata_mixu.hdf5", | ||
| "training_data": { | ||
| "systems": "/mnt/data_nas/zhangd/data/lmdb_data/mptraj_v024.lmdb", | ||
| "batch_size": 8, | ||
| "mixed_batch": true, | ||
| "_comment": "Enable mixed_batch mode for testing, disable shuffle for reproducibility" | ||
| }, | ||
| "validation_data": { | ||
| "systems": "/mnt/data_nas/zhangd/data/lmdb_data/wbm.lmdb", | ||
| "batch_size": 8, | ||
| "mixed_batch": true |
There was a problem hiding this comment.
This test fixture is tied to one machine’s filesystem.
stat_file and both LMDB systems entries are hard-coded absolute paths, so the new script/config cannot run in CI or on another checkout. Please switch this fixture to repo-managed test data or make the paths injectable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test_mptraj/lmdb_mixed_batch.json` around lines 74 - 84, The fixture uses
hard-coded absolute paths in stat_file and LMDB system entries (stat_file,
training_data.systems, validation_data.systems) which prevents CI portability;
update the test_mptraj/lmdb_mixed_batch.json fixture to reference repo-managed
test data or injected paths by replacing absolute paths with relative paths
inside the repo (e.g., a test-data directory) or with placeholders/env variables
read by the test runner, and ensure the test harness supplies those fixtures
before running so the config uses repo-controlled files rather than
machine-specific absolute paths.
Add mixed-nloc LMDB batching for PyTorch training
Summary
This PR adds optional
mixed_batchsupport to the PyTorch LMDB training and validation data path.When
mixed_batchis enabled, one LMDB batch may contain frames with different atom counts /nlocvalues. Atom-wise fields are flattened across the batch, frame ownership is tracked withbatchandptr, and the model runs through a flat forward path without padding.The default behavior is unchanged.
mixed_batchisfalseby default, so LMDB batches still use the existing same-nlocsampler unless the option is explicitly enabled.Motivation
LMDB batches previously required all frames in a batch to have the same
nloc. This can make batching inefficient for datasets that mix systems with different atom counts. Mixed batching lets those frames train together while keeping atom indexing explicit in a flattened graph layout.Main Changes
mixed_batchto PyTorch LMDBtraining_dataandvalidation_data.mix_batchas an alias for the same option.mixed_batch=truemode, interpretbatch_sizeas the number of frames/systems in one mixed batch.batchandptrfor frame ownership.Usage
{ "training": { "training_data": { "systems": "path/to/train.lmdb", "batch_size": 8, "mixed_batch": true }, "validation_data": { "systems": "path/to/valid.lmdb", "batch_size": 8, "mixed_batch": true } } }With
mixed_batch=true,batch_sizecontrols how many frames/systems are merged into one flattened batch.With
mixed_batch=false, the existing same-nlocLMDB batching behavior is preserved.Tests
batch/ptrlayout and flat graph precomputation.mix_batchconfig alias.test_mixed_batch.shfor end-to-end training coverage.Scope
Summary by CodeRabbit
New Features
mixed_batchconfiguration option for training and validation datasets.Improvements