[stacked 3/3, multi-datapipes] darcy flow multi dataset#1507
[stacked 3/3, multi-datapipes] darcy flow multi dataset#1507coreyjadams wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
…ms; add in-memory numpy reader for small single file datasets like darcy flow
Greptile SummaryThis PR introduces a
Important Files Changed
Last reviewed commit: 0fc3987 |
| ref = torch.norm(target.reshape(B, -1), dim=1) | ||
| return torch.mean(diff / ref) |
There was a problem hiding this comment.
Division by zero in RelativeL2Loss
ref is the per-sample L2 norm of the target. If any sample in the batch has an all-zero (or near-zero) target, ref[i] == 0 and diff[i] / ref[i] produces inf or nan, corrupting the loss and gradients for the entire batch.
Consider clamping the denominator:
| ref = torch.norm(target.reshape(B, -1), dim=1) | |
| return torch.mean(diff / ref) | |
| ref = torch.norm(target.reshape(B, -1), dim=1) | |
| return torch.mean(diff / ref.clamp(min=1e-8)) |
| train_l2_err_sq += metrics["l2_err_sq"] | ||
| train_l2_ref_sq += metrics["l2_ref_sq"] | ||
| train_n += b | ||
| train_sample = (x, y, pred) |
There was a problem hiding this comment.
Stale computation graph held in memory
train_sample = (x, y, pred) stores a reference to pred which still carries the autograd graph from the current iteration. This keeps the full computation graph alive in memory until the next iteration overwrites it. For large models this doubles the peak GPU memory during training.
Consider detaching before storing:
| train_sample = (x, y, pred) | |
| train_sample = (x, y, pred.detach()) |
| val_loss_sum = _zero.clone() | ||
| val_l2_err_sq = _zero.clone() | ||
| val_l2_ref_sq = _zero.clone() |
There was a problem hiding this comment.
Validation accumulators reuse stale _zero from training
val_loss_sum = _zero.clone() (and the other accumulators) clone the _zero tensor created at the top of the epoch loop (line 277). Since _zero is torch.tensor(0.0, device=dist.device) with requires_grad=False, this works, but _zero was already used for train_loss_sum etc. and those were accumulated into during training. The .clone() call creates a fresh zero so this is technically correct, but it would be clearer to create fresh torch.tensor(0.0, ...) here instead of relying on .clone() of a tensor that has been previously assigned elsewhere.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
[stacked]
PR 3/3
Will be rebased after first 2 merge, includes final darcy example with multi dataset and runnable example.
PhysicsNeMo Pull Request
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.