commit 6a6629f1c4f444e20859ab905ba2d950d66bbb0b
Merge: 2fb3d047 3670381b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 14:42:49 2026 -0400
Merge branch 'main' into psharpe/GLOBE-dev
commit 2fb3d0473e8d66536c04e2b9b78965a8c5a7d3bf
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 12:02:10 2026 -0400
Allows CPU-based tree build, if desired
commit ada167d9c52b45487dafdba3e6d2872c1ac925fd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 11:24:36 2026 -0400
bump tensordict version
commit 34e14584b2092ab49c724b75a861989d66910f35
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 11:17:12 2026 -0400
Makes `psutil` a core dependency
commit 7e31b1c5e4e033e50c380f594ae206d3115596b6
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 01:36:17 2026 -0400
Optimize indexing in _expand_dual_leaf_hits to reduce GPU synchronization by consolidating boolean mask operations into single nonzero calls, improving performance in the ClusterTree implementation.
commit ebb2b0a611e57e6804b7f3a8cfad0294da68f38b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 11 05:35:46 2026 +0000
Refactor gradient checkpointing in BarnesHutKernel to apply in both training and evaluation phases. This change improves memory efficiency and addresses a TensorDict bug related to batch size propagation during tracing.
commit ce4ab79d649c674d24165aaedccf953984b23feb
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 18:29:41 2026 +0000
Adds chunking support on CPU too.
commit 2b326c1ddb498cb7358be78c794360f6a9f73348
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 14:14:58 2026 -0400
Refactor chunk memory budget calculation to use integer arithmetic and introduce integer ceiling division function to enhance compatibility with torch.compile. This change prevents crashes related to symbolic float handling during chunk size derivation.
commit fce4b188a930f3431c44fd686a807c44a9c332ab
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 13:48:04 2026 -0400
eliminates graph break
commit c5cff850b6d00122d866c225d354e9166c3f2462
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 16:26:25 2026 +0000
fixes smoothing_radius graph break using buffer.
commit 3fb11547a2fa2b59143efed608e1004fd470fd9f
Merge: 835ea142 aa6d2208
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 15:36:50 2026 +0000
Merge branch 'psharpe/GLOBE-dev-perf' of https://github.com/peterdsharpe/physicsnemo into psharpe/GLOBE-dev-perf
commit aa6d220864ae24a81e68273aa25a3af1ca60c387
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 10 11:36:06 2026 -0400
Enhance numerical stability in ClusterTree and BarnesHutKernel by switching cumulative sum operations to double precision (fp64) to prevent catastrophic cancellation in small-leaf scenarios. Added tests to validate precision at scale for both implementations, ensuring consistency between fp32 and fp64 results.
commit 835ea142a27961d2a3af9609d7c911ac5ff9bbd0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 05:53:52 2026 +0000
further simplification to smoothing_radius
commit ccf4d85c179617557349e44b8ae70b0cae069a71
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 05:27:28 2026 +0000
removes torch.tensor() break in smoothing_radius
commit de1b8c93a2603bd1c228e71dd76ea436f351371a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 04:49:02 2026 +0000
Large refactor to reduce GPU syncs in field_kernel and cluster_tree; plus tests
commit 8d5e6f2604be7852a909830ad392b84d0914dfa5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 02:14:40 2026 +0000
Removes GPU syncs during chunking
commit b0f1db4410ea591a5b90ad15eec0ac323e76bff4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 01:35:22 2026 +0000
pipe through gradient checkpointing option, still default to true
commit 9d5515c0539ad05ac162be8c57e8c10da9e75495
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 01:32:01 2026 +0000
Removes GradScaler
commit 57f2c6ce2dfb405b9a5c704aabb1b355fd1f8f03
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat May 9 01:29:49 2026 +0000
env var setting
commit 59e4b6c94292ce20c8641e7e65417de270c7fae3
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 20:07:04 2026 +0000
Squashed commit of the following:
commit 6bb41cc5ff9a79e97478b1a1e478245599929c7e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 19:29:35 2026 +0000
Add validation for data ranks in GLOBE model
- Introduced `validate_data_contains_ranks` function to ensure that incoming data contains all declared leaves with matching ranks, enhancing error handling for missing or mismatched data.
- Updated the GLOBE model to utilize this validation during the forward pass, ensuring robustness against incorrect input configurations.
- Added comprehensive tests to verify the behavior of the model with respect to extra and missing data keys, as well as rank mismatches.
commit 97b1d3e9fffec31af513dcc5df724b94ae51f9ee
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 19:10:54 2026 +0000
fixes normal handling
commit fdca989212fced6cabdf9da5eaebea9d3dd203c4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 17:18:21 2026 +0000
fixes issue where precision was casting the data in addition to the model, not just the model internals
commit 80c76380c8b4ca158e5cbbb21f53140e5df77377
Merge: 96b17855 c8f21f16
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 17:14:06 2026 +0000
Merge branch 'psharpe/add-GLOBE-to-unified-recipe' of https://github.com/peterdsharpe/physicsnemo into psharpe/add-GLOBE-to-unified-recipe
commit 96b178550b819cce9b99e758e353ebb9e9d1f409
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 17:08:34 2026 +0000
autocast minor fixes
commit c8f21f1680dce763347099fab0a040aec03fbd3d
Merge: a0a2bc70 5f2940f9
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 12:30:27 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-to-unified-recipe
commit 2b5106d70fce298d77727b63afd3a8835a07c8e5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 15:28:53 2026 +0000
sync to latest run
commit a0a2bc70b8a7157a788bd33fff07567e9ceec475
Merge: e4c6b52f 121e0b31
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:55:54 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-to-unified-recipe
commit e4c6b52f00de81bb50f4f753180d1f299494f54e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:54:55 2026 -0400
Squashed commit of the following:
commit 95cfcd5e653cb691d034b8a1c7fc05ef6f843938
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:52:56 2026 -0400
pre-commit formatting
commit 881c2f67954a52b8441c4c682d9d9fc5648ae874
Merge: 224eef57 121e0b31
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:52:22 2026 -0400
Merge branch 'main' into psharpe/update-unified-recipe
commit 224eef57dab884308c72e11b2f0d273528083ddf
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:50:50 2026 -0400
Adds tests
commit 2a191773f9354660c09db0feb04a6c9ceb56f6d5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:39:56 2026 -0400
Refactor freestream scale calculations to use the `freestream_scales` function for consistency and precision. Update docstring for TensorBoard logging function to clarify usage and improve readability.
commit f677107cd7793542d99b52634281925e18b105e3
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:38:27 2026 -0400
Cleanup pass
commit f9ac280ab8644f0a653859f92d2369789bb72d51
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:29:13 2026 -0400
inlines function
commit f8773b2b2a24a993954472a54a031b024fc33a08
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:25:51 2026 -0400
strips out apply_to_tensordict_mesh (now dead code; easily a one-liner with td.apply())
commit 705e46bd2d556df6500a903aa7c654f2590fb276
Merge: d6b96f6c e33028cd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 03:15:23 2026 +0000
Merge branch 'psharpe/update-unified-recipe' of https://github.com/peterdsharpe/physicsnemo into psharpe/update-unified-recipe
commit e33028cd47c94b8b0c010abc331fb2113069e5ff
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:13:29 2026 -0400
Switches to be tensordict-native
commit e302f3c0515f34edae293c5fdf1c2ef6ec048076
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:10:08 2026 -0400
more conversions to "associations"
commit d705007ba22c0cc5ae2284aea6f89c7b6666d9d5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:07:19 2026 -0400
Standardize on the name "association"
commit 9c5c5282623fa9b43269ba1c739051b01fa7717e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:01:40 2026 -0400
pipes through new imports
commit 0b7268926a2fd43d45ace9917efb10cb09576804
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:00:59 2026 -0400
docs
commit eb9807d023685e2e7613115a71629ad781f90dc0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:00:17 2026 -0400
adds docstring
commit 51c7d2ceb38b8f36fa5bc521eaa345d6a8570051
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:55:27 2026 -0400
Migrates functions per review
commit 8a1aea97e58caee97278a89f9d0734aeb9b9d610
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:52:57 2026 -0400
Enhance error handling in normalize_output_to_tensordict to raise explicit ValueError for all-None tensor outputs, improving clarity on model misconfiguration.
commit 3cf398188221317f2b71604f638386db02c76a95
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:52:23 2026 -0400
fixes forward_kwargs bool coercion
commit d6b96f6c0119ffe7d9a0392cb986637b473df49e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 02:49:03 2026 +0000
syncs with new name
commit 1e86dae3a1aeb4b2f3cbb62e89356a1c23637403
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 02:45:12 2026 +0000
Update surface pattern in drivaer_ml_surface.yaml to reflect vehicle boundaries
commit 121e0b31e8ee4bd796f70f0ce96e9c313972c03f
Author: Corey adams <6619961+coreyjadams@users.noreply.github.com>
Date: Thu May 7 15:54:35 2026 -0500
Testing and validating ci with data from huggingface (#1601)
* Testing and validating ci with data from huggingface
* Update ci data download action
* Remove get data command
* Make sure the singleton states clear out to fix tests.
commit bb0e19a45476f650aef4fcda5179cb19aca26c84
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 15:58:55 2026 -0400
Partial merge from psharpe/add-GLOBE-to-unified-recipe
commit a44697351dbad6e2d9feeb711408a95e4abee6bd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:49:04 2026 -0400
sync tests
commit 10beae03c2ee04dd844cdbe89257ef30bfa5c66c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 10:48:37 2026 -0400
readme updates
commit 26a91d0f5655b07399808d7af9021b6eb0fa62c0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:54:54 2026 -0400
Squashed commit of the following:
commit 2a191773f9354660c09db0feb04a6c9ceb56f6d5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:39:56 2026 -0400
Refactor freestream scale calculations to use the `freestream_scales` function for consistency and precision. Update docstring for TensorBoard logging function to clarify usage and improve readability.
commit f677107cd7793542d99b52634281925e18b105e3
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:38:27 2026 -0400
Cleanup pass
commit f9ac280ab8644f0a653859f92d2369789bb72d51
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:29:13 2026 -0400
inlines function
commit f8773b2b2a24a993954472a54a031b024fc33a08
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:25:51 2026 -0400
strips out apply_to_tensordict_mesh (now dead code; easily a one-liner with td.apply())
commit 705e46bd2d556df6500a903aa7c654f2590fb276
Merge: d6b96f6c e33028cd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 03:15:23 2026 +0000
Merge branch 'psharpe/update-unified-recipe' of https://github.com/peterdsharpe/physicsnemo into psharpe/update-unified-recipe
commit e33028cd47c94b8b0c010abc331fb2113069e5ff
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:13:29 2026 -0400
Switches to be tensordict-native
commit e302f3c0515f34edae293c5fdf1c2ef6ec048076
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:10:08 2026 -0400
more conversions to "associations"
commit d705007ba22c0cc5ae2284aea6f89c7b6666d9d5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:07:19 2026 -0400
Standardize on the name "association"
commit 9c5c5282623fa9b43269ba1c739051b01fa7717e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:01:40 2026 -0400
pipes through new imports
commit 0b7268926a2fd43d45ace9917efb10cb09576804
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:00:59 2026 -0400
docs
commit eb9807d023685e2e7613115a71629ad781f90dc0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 23:00:17 2026 -0400
adds docstring
commit 51c7d2ceb38b8f36fa5bc521eaa345d6a8570051
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:55:27 2026 -0400
Migrates functions per review
commit 8a1aea97e58caee97278a89f9d0734aeb9b9d610
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:52:57 2026 -0400
Enhance error handling in normalize_output_to_tensordict to raise explicit ValueError for all-None tensor outputs, improving clarity on model misconfiguration.
commit 3cf398188221317f2b71604f638386db02c76a95
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 22:52:23 2026 -0400
fixes forward_kwargs bool coercion
commit d6b96f6c0119ffe7d9a0392cb986637b473df49e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 02:49:03 2026 +0000
syncs with new name
commit 1e86dae3a1aeb4b2f3cbb62e89356a1c23637403
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 02:45:12 2026 +0000
Update surface pattern in drivaer_ml_surface.yaml to reflect vehicle boundaries
commit bb0e19a45476f650aef4fcda5179cb19aca26c84
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 15:58:55 2026 -0400
Partial merge from psharpe/add-GLOBE-to-unified-recipe
commit a513e024635a4fd89437c524141ae31b8c772be9
Merge: 8e98aaee b15dae48
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu May 7 15:53:49 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-to-unified-recipe
commit 8e98aaee30cf9e753d209287db28b2647aec4880
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 19:44:43 2026 -0400
formatting
commit 596f527a1d38e680f80d7db37990c1eb819af984
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 19:06:43 2026 -0400
edits comments
commit ceda0b3757218fb347af3330eb2791419980e388
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 13:51:14 2026 -0400
type hints
commit dfafc47c5a7d06dc23173e7e071c87de285c0a4c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 13:38:40 2026 -0400
simplifications
commit 49827b15fedf68a9cd1828ec234775b550e6fcec
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 13:20:03 2026 -0400
Adds progress
commit 4756955f21a82b2be309d05d170a8149575c21c5
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 10:55:38 2026 -0400
review pass
commit e6b7345c5e70bcf5bc867f5467d32c113c4b301a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 09:17:46 2026 -0400
simplifications throughout
commit 9e0c1e86f12e0f6beb49742aec01159be45deeac
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 08:27:10 2026 -0400
Notes domino is WIP
commit 8c2374a4fdbd30981233f74048aa71dacbbd5c4d
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 08:26:58 2026 -0400
Notes DoMINO is WIP
commit 9982de0205903857e2c7cf63f52e0a8c8f630c2e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 08:26:46 2026 -0400
add nontrivial default
commit 66c3c3a361ef8af2e94a2d32a1bc38005f42683b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 08:26:16 2026 -0400
dead code sweep
commit bb4eb8ee381169b44e7ad58ff875b9da135ad6c2
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 17:51:58 2026 -0400
formatting
commit cd9a95a3787ca516d9c0dc6aa41a24715f7aa80d
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 17:51:46 2026 -0400
cleanup
commit abba05443c2a6230f9c1329adeebb700f677eac2
Merge: 2ee96eb4 b5f075a9
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 17:44:01 2026 -0400
Merge remote-tracking branch 'origin/psharpe/DomainMesh-apply-fix' into psharpe/add-GLOBE-to-unified-recipe
commit b5f075a9366d471066bd6a3fa9a878352a29d19c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 17:39:50 2026 -0400
renames DomainMesh.apply to DomainMesh.apply_to_meshes, to mitigate shadowing.
commit 2ee96eb40c9dc73a0adaaaf6a4a88333088d02b0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 16:50:42 2026 -0400
adds tests
commit 64352f1e50b0bada49a061027a0f8c4e1ebf23b0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 15:59:38 2026 -0400
Refactor external aerodynamics recipe to use TensorDict interface
- Updated README.md to reflect changes in output normalization and loss/metric calculators, transitioning from dict-based to TensorDict-based structures.
- Modified training configuration files to standardize boundary naming as `boundaries.vehicle` across surface and volume datasets.
- Adjusted dataset YAML files to ensure consistent boundary definitions and removed outdated comments.
- Enhanced collate functions to support TensorDict, ensuring proper batch dimension handling for targets.
- Updated loss and metric calculators to operate on TensorDict inputs, improving compatibility and performance.
- Revised tests to validate TensorDict functionality and ensure expected behavior across various configurations.
commit 3b8760985a8ccf45a47679ad8fc8db948cb95237
Merge: c7b27c09 70701e18
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 15:27:27 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-to-unified-recipe
commit c7b27c093a00c3bf2bf4c6cbc1b3ad2a0cc829bb
Merge: 03d91803 64fecb91
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 11:05:57 2026 -0400
Merge branch 'psharpe/add-GLOBE-drivaerml-standalone' into psharpe/add-GLOBE-to-unified-recipe
commit 03d9180347a68bf1071efee03bf90ef7b0aad883
Merge: 4f79a781 cbb1bf16
Author: Mehdi Ataei <ataei8@gmail.com>
Date: Mon May 4 16:22:50 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-to-unified-recipe
commit 4f79a781ceb87868a7f9f6b1723e4a2fea6951ae
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 4 16:21:38 2026 -0400
Adds first cut of updating recipe to support GLOBE, and to be Mesh-native.
commit 4e33dc0b2c2be47ee098fa5b738d2f7cce00b16c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 4 10:44:27 2026 -0400
import sort
commit 80b760788e617482bddb62f8b0d8301c921584b8
Merge: 5716659f 260ddfb1
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 19:54:07 2026 +0000
Merge branch 'main' into psharpe/GLOBE-dev
commit 5716659f3209a9b4c5e35ffa076320081ffb0163
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 8 12:40:27 2026 -0400
fixes old renames
commit 73312c90befd8567e3c9bcd9827b8907769ea994
Merge: 76f2a747 fb402c99
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed May 6 13:24:25 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-3D-BarnesHut
commit 76f2a74752f66c6d41b2f032cdd29836036ef209
Merge: bbaad544 8cbd6cb6
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 12:26:44 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-3D-BarnesHut
commit bbaad544f6ba6de16d0005906f9f34a9673b9c89
Merge: 560f5ff7 64fecb91
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 11:15:08 2026 -0400
Merge branch 'psharpe/add-GLOBE-drivaerml-standalone' into psharpe/add-GLOBE-3D-BarnesHut
commit 64fecb919d16758e087c2551f4b15ef79297b73c
Merge: 87c64c22 9f279dbc
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 11:04:51 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-drivaerml-standalone
commit 87c64c228869d731173fb0e29e9fd1a2d8c7dfa6
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 11:02:19 2026 -0400
Adds a edge case guard
commit 2958dcea5e0a7b6ba4dd797f56aa20f57c52d783
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 10:43:59 2026 -0400
docs on sample IDs
commit a1b8337a56fdd950796978aedc11d2e6bfe01f2c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 10:42:27 2026 -0400
fixes patience edge case
commit 29813c2f450029cf8147790582076509dcd7708b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 10:09:23 2026 -0400
fix old arg name
commit c4da8f27ae6321e5da50f497423fb4e5bd4b29f7
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 10:00:53 2026 -0400
Add expected training behavior section to drivaer README
- Included details on reference hardware, wall time per epoch, peak VRAM usage, and training/validation loss metrics at various epochs.
- This addition aims to provide users with a benchmark for evaluating their training runs.
commit f9ac3c633d86c4da588913f8b151afbb8d71a766
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 09:51:54 2026 -0400
Integrate Profiler into training scripts for enhanced performance monitoring
- Added Profiler import and initialization in train.py for both airfrans and drivaer examples.
- Updated profiling logic to enable and configure the profiler based on conditions.
- Adjusted the training loop to ensure profiler steps are called correctly.
This enhancement allows for better performance tracking during model training.
commit 733e5a0f456b46d35c73ae8859cd6b18f027ae64
Merge: 5f78ba6e 93d13424
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue May 5 09:13:47 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-drivaerml-standalone
commit 5f78ba6e877d4a944be4093fd210a390c42bf30e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon May 4 18:59:09 2026 -0400
adds dataset link
commit 560f5ff743c4abdd41b50283bacd3340a153f5f7
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 21:16:06 2026 -0400
Squashed commit of the following:
commit 91a942b6e0e75951907d2a21b49d78a3c837e78f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 21:12:32 2026 -0400
Adds Greptile minor fixes
commit b24f9b6e99e2abed1e9b6f530a108cbe1a99ca44
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:56:24 2026 -0400
Back-merges dataset interrogate fix
commit 6ddfb5ae569e59b7872184fe0f4838b3039a3735
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:43:01 2026 -0400
Removes accidentally-commited benchmarks; these will come later
commit 9fa0b5d2744471d5b33baf7a6927ec19f2aa0153
Merge: 3e67057c 4c52a453
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:37:57 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-drivaerml-standalone
commit 3e67057c596760d9aecf0505b5f9de19e98cc9bd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:35:32 2026 -0400
Partial merge from add-GLOBE-3D-BarnesHut
commit 4c52a4534c87290a4e0f9149c2942cbe097f95ad
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 16:38:28 2026 -0400
Synchronizes `GLOBE` model progress for 26.05 (#1595)
* Migrate cached_dataset.py
* verified model arch new features (self_regularization_beta)
* minor formatting syncs
* Adds nonregression testing
* Adds compile_logging utilities and prefetching utilities
* Adds self to pade.py codeowners
* Syncs AirFRANS updates
* corrects a docstring
* Strips out broken ram caching
* Adds helpful error messages
* Adds helpful error messages
* docs
* Refactor compile logging in training script
- Removed the CompileDiagnosticsCollector and replaced it with a new utility function, silence_compile_logs_on_non_zero_ranks, to suppress non-error logs from torch.compile on all ranks except rank 0.
- Updated the training script to call this new function, improving log clarity during distributed training.
- Adjusted logging levels for the globe logger to ensure proper diagnostics are captured only during the first launch.
* Enhance DataLoader worker configuration for distributed training
- Updated the logic for auto-computing `num_workers` in the `AirFRANSDataSet` class to consider CPU affinity and local world size, improving efficiency in distributed environments.
- Adjusted logging to provide detailed information about the computed `num_workers`, including CPU count and GPU visibility.
- Modified the run script comments to reflect the new method of calculating `num_workers`, ensuring clarity on process-level parallelism.
* Partial merge from add-GLOBE-3D-BarnesHut
commit 4cb586ab3750c2a79fa23c39a6929ccf62fcbc53
Merge: 645701fc ed855da4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 15:23:50 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-model-progress
commit 645701fc7af2fd0ba8ffc97da9b9c8698d9dec64
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 15:06:55 2026 -0400
Partial merge from add-GLOBE-3D-BarnesHut
commit 15d7913a1623f5f9afe61e0257007898babf4a72
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 13:45:06 2026 -0400
Enhance DataLoader worker configuration for distributed training
- Updated the logic for auto-computing `num_workers` in the `AirFRANSDataSet` class to consider CPU affinity and local world size, improving efficiency in distributed environments.
- Adjusted logging to provide detailed information about the computed `num_workers`, including CPU count and GPU visibility.
- Modified the run script comments to reflect the new method of calculating `num_workers`, ensuring clarity on process-level parallelism.
commit 65675a42e1b4617f088fd079f999bec3efe3a4b0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 17:53:47 2026 -0400
Refactor compile logging in training script
- Removed the CompileDiagnosticsCollector and replaced it with a new utility function, silence_compile_logs_on_non_zero_ranks, to suppress non-error logs from torch.compile on all ranks except rank 0.
- Updated the training script to call this new function, improving log clarity during distributed training.
- Adjusted logging levels for the globe logger to ensure proper diagnostics are captured only during the first launch.
commit 948da86713ea492c599d14e9f55981ab1e3d7549
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 15:40:31 2026 -0400
docs
commit ed855da4116d87174a17db0911413648f383cef3
Author: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Date: Thu Apr 30 20:44:37 2026 -0700
Implements Predictor specialization for multi-diffusion (#1573)
* Implements Predictor specialization for multi-diffusion
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Compile denoiser in multi-diffusion sampling compile tests
Compiling the predictor instance directly was producing divergent results
under torch 2.10 in the sample() loop (euler cases only). Follow the same
pattern as test_samplers.py::TestSampleCompile and compile the denoiser
closure instead — tracing through it still verifies that the predictor's
__call__ path is compile-compatible.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Avoid fullgraph compile in multi-diffusion sampling test
torch 2.10 Dynamo crashes with Fatal Python error: Aborted when tracing
the nested MultiDiffusionPredictor -> MultiDiffusionModel2D call chain
inside sample() with fullgraph=True. Allow graph breaks here; the
predictor compile contract is still tested in isolation by
test_multi_diffusion_predictor.py::TestCompile.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Flatten MultiDiffusionPredictor hot path for torch.compile
Dispatch on pos_embd presence and model_kwargs is now resolved once at
__init__ into a specialized closure, so __call__ is branch-free and the
no-kwargs path avoids ** expansion. This keeps fullgraph=True compile
cleanly traceable under torch 2.10 (which was hitting a Dynamo abort on
the nested MultiDiffusionPredictor -> MultiDiffusionModel2D call chain
when the denoiser closure was compiled in the sample() loop).
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Loosen torch.compile euler check in multi-diffusion sampling tests
Reverts the two earlier CI-fix attempts (compile-denoiser switch, predictor
hot-path flatten) since neither actually fixed the divergence. The
underlying issue is an upstream torch>=2.10 Dynamo bug: euler + compiled
MultiDiffusionPredictor produces numerically divergent results. Heun works,
predictor compiles correctly in isolation. For euler we now assert only
shape + isfinite until the upstream bug is resolved.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Force contiguous t_cur/t_next in Euler solvers
sample() passes t.expand(B) (a stride-0 non-contiguous tensor) into
solver.step(). HeunSolver already forces .contiguous() on both tensors to
prevent torch.compile from specializing on the stride pattern of the first
call and then either mis-firing guards or silently recompiling on
subsequent calls with different underlying storage.
EulerSolver and EDMStochasticEulerSolver had no such guard, which was a
latent bug exposed by torch 2.10 (stricter stride tracking) in the
multi-diffusion compiled sample loop — producing 90%+ element divergence
vs eager on the first call and a Dynamo abort on the second call. Apply
the same fix uniformly across all four solver steps.
Also revert the temporary loosened euler assertion in
test_multi_diffusion_sampling.py now that the real fix is in place.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Drop dead is_compiling guard and inherit from Predictor in MultiDiffusionPredictor
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Narrow _patching type and tighten multi-diffusion tests
Move the _patching None check out of the is_compiling guard in
MultiDiffusionModel2D so the type checker narrows self._patching
to RandomPatching2D | GridPatching2D for the rest of each method,
and route fuse/reset_patch_indices through isinstance.
Streamline TestConstructor to only exercise the public contract
(.fuse, .model, setter round-trip) and drop assertions on private
caches. Compile the denoiser instead of the predictor in
TestMultiDiffusionSampleCompile and add TestMultiDiffusionFullSamplerCompile
mirroring test_samplers.py.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Force contiguous pos_embd before patching
pos_embd.unsqueeze(0).expand(B, -1, -1, -1) produces a stride-0 view
(all B copies share storage). Passing this through nn.ReflectionPad2d
and F.unfold inside image_batching triggers a glibc heap corruption
on torch 2.10 (CI, not locally on torch 2.8) when the first non-regression
posembd_sin test runs. Same class of fix as the earlier euler solver.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Use functional F.pad in image_batching
Instantiating torch.nn.ReflectionPad2d inside image_batching on every
call creates a fresh nn.Module each time, which torch.compile / AOT
autograd struggles to trace cleanly under fullgraph=True on torch 2.10.
Switch to torch.nn.functional.pad which is a plain functional call and
traces without allocating a module. Same result semantically.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Replace einops.rearrange with native torch reshape+permute
einops.rearrange goes through a pattern-matched lowering path that
torch.compile / inductor on torch 2.10 handles fragilely in the
image_batching / image_fuse hot paths. The underlying transform is a
plain view + permute + view, so express it directly: this gives inductor
a straightforward sequence of ops to trace, and drops the einops
dependency from this module.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Materialise returned tensors in multi-diffusion fuse path
Under torch.compile / inductor on torch 2.10, a compiled sample() call
through MultiDiffusionPredictor was returning a tensor whose metadata
was valid but whose data pointer was dangling (use-after-free) — the
caller SIGABRTed on the first read of the tensor data. Add .contiguous()
at the two boundaries that returned a view: image_fuse returns
x_folded[...] / overlap_count[...], and MultiDiffusionModel2D.forward
returns the (possibly fused) inner-model output. Forcing fresh storage
on each boundary prevents the returned tensor from aliasing a buffer
whose lifetime ends with the compiled frame.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Use clone instead of contiguous at fuse boundary
The second torch.compile call of a fused MultiDiffusionPredictor was
segfaulting (SIGSEGV) while the first succeeded. .contiguous() is a
no-op when the tensor is already contiguous, so inductor could still
see the returned tensor as aliasing an internal buffer across calls.
.clone() always allocates fresh storage, so successive compiled calls
get independent outputs. Also drop the redundant .contiguous() added
earlier in MultiDiffusionModel2D.forward now that image_fuse owns that
boundary.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Revert speculative fuse-boundary copies and xfail full-sampler compile on torch>=2.10
Revert commits 3dfcdb51, 746518ff and a007c469 (native-torch rearrange
in image_batching/image_fuse, .contiguous() on returned tensors, .clone()
at fuse boundary) since they did not resolve the torch 2.10 inductor
codegen segfault in TestMultiDiffusionFullSamplerCompile. Keep commits
7e1db11c (pos_embd .contiguous() for the glibc heap corruption in
posembd_sin non-regression tests) and feb0d9e4 (ReflectionPad2d → F.pad).
Gate TestMultiDiffusionFullSamplerCompile with xfail(run=False) when
torch>=2.10 so the SIGSEGV does not bring down the pytest process.
TestMultiDiffusionSampleCompile (per-step denoiser compile) still runs.
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Minor updates to predictor.py
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
* Drop redundant _patching_type and add test-time-only docstring warning
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
---------
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
commit 16a336f9ac267ffa6b2fee8e02d0d21dd104007a
Author: Peter Harrington <48932392+pzharrington@users.noreply.github.com>
Date: Thu Apr 30 00:03:24 2026 -0700
FSDP optimizer state channels last fix (#1597)
* Fix channels last FSDP optimizer state load bug
* lint
* Catch use_orig_params=True case
commit 845906f4d0846061b516749d0f1ee450c16aca91
Author: Peter Harrington <48932392+pzharrington@users.noreply.github.com>
Date: Wed Apr 29 23:59:29 2026 -0700
Add HealDA dataloader protocols and init recipe (#1555)
* Add healda protocols and loaders to experimental
* Cleanup and address imports
* Update precommit for examples tests
* integrate restartable sampler, other updates, migrate tests
* move imports, cleanup
* ruff check fix
* skip prefetch on CPU
* Rename to local_platform
* Revert precommit change
* greptile feedback
* Migrate CSVs and deps to example
* lockfile fix
commit 91a942b6e0e75951907d2a21b49d78a3c837e78f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 21:12:32 2026 -0400
Adds Greptile minor fixes
commit b24f9b6e99e2abed1e9b6f530a108cbe1a99ca44
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:56:24 2026 -0400
Back-merges dataset interrogate fix
commit f03f7218c02cfbd517d176ffc722accdbf343501
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:54:30 2026 -0400
Refactor DrivAerMLSample class to inherit from TensorClass and clean up unused imports. This change enhances type handling and maintains code clarity.
commit 6ddfb5ae569e59b7872184fe0f4838b3039a3735
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:43:01 2026 -0400
Removes accidentally-commited benchmarks; these will come later
commit 9fa0b5d2744471d5b33baf7a6927ec19f2aa0153
Merge: 3e67057c 4c52a453
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:37:57 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-drivaerml-standalone
commit 3e67057c596760d9aecf0505b5f9de19e98cc9bd
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 20:35:32 2026 -0400
Partial merge from add-GLOBE-3D-BarnesHut
commit 9d582ad73a7204797c859bf0fa195bee63c20064
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 22:38:53 2026 +0000
default to expand far targets
commit a297fb9f3e332a442649a09c0621915089c43abe
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 22:35:13 2026 +0000
sync with airfrans
commit 0c0c0280fd86f499403c0171fa371eeada4b4e38
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 22:33:34 2026 +0000
sync with airfrans
commit 4cb586ab3750c2a79fa23c39a6929ccf62fcbc53
Merge: 645701fc ed855da4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 15:23:50 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-model-progress
commit 645701fc7af2fd0ba8ffc97da9b9c8698d9dec64
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 15:06:55 2026 -0400
Partial merge from add-GLOBE-3D-BarnesHut
commit b0fb9d299046fb65c24e781433d849f3c644b58a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 19:00:02 2026 +0000
Squashed commit of the following:
commit 15d7913a1623f5f9afe61e0257007898babf4a72
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 13:45:06 2026 -0400
Enhance DataLoader worker configuration for distributed training
- Updated the logic for auto-computing `num_workers` in the `AirFRANSDataSet` class to consider CPU affinity and local world size, improving efficiency in distributed environments.
- Adjusted logging to provide detailed information about the computed `num_workers`, including CPU count and GPU visibility.
- Modified the run script comments to reflect the new method of calculating `num_workers`, ensuring clarity on process-level parallelism.
commit 65675a42e1b4617f088fd079f999bec3efe3a4b0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 17:53:47 2026 -0400
Refactor compile logging in training script
- Removed the CompileDiagnosticsCollector and replaced it with a new utility function, silence_compile_logs_on_non_zero_ranks, to suppress non-error logs from torch.compile on all ranks except rank 0.
- Updated the training script to call this new function, improving log clarity during distributed training.
- Adjusted logging levels for the globe logger to ensure proper diagnostics are captured only during the first launch.
commit 948da86713ea492c599d14e9f55981ab1e3d7549
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 15:40:31 2026 -0400
docs
commit 4f2254d0de5991a835f5617111ac11ce588efff4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 18:42:52 2026 +0000
Refactor Kernel class in field_kernel.py by removing the add_semantics method and optimizing the output processing. The changes streamline the handling of tensor semantics and improve performance by constructing per-field outputs in a single pass, addressing potential issues with PyTorch's tensor operations.
commit 459e3c0a3e63ae21a08c05c639665ff96c19a47e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 17:47:39 2026 +0000
sync onto td 0.12.2
commit 15d7913a1623f5f9afe61e0257007898babf4a72
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun May 3 13:45:06 2026 -0400
Enhance DataLoader worker configuration for distributed training
- Updated the logic for auto-computing `num_workers` in the `AirFRANSDataSet` class to consider CPU affinity and local world size, improving efficiency in distributed environments.
- Adjusted logging to provide detailed information about the computed `num_workers`, including CPU count and GPU visibility.
- Modified the run script comments to reflect the new method of calculating `num_workers`, ensuring clarity on process-level parallelism.
commit 65675a42e1b4617f088fd079f999bec3efe3a4b0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 17:53:47 2026 -0400
Refactor compile logging in training script
- Removed the CompileDiagnosticsCollector and replaced it with a new utility function, silence_compile_logs_on_non_zero_ranks, to suppress non-error logs from torch.compile on all ranks except rank 0.
- Updated the training script to call this new function, improving log clarity during distributed training.
- Adjusted logging levels for the globe logger to ensure proper diagnostics are captured only during the first launch.
commit 948da86713ea492c599d14e9f55981ab1e3d7549
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri May 1 15:40:31 2026 -0400
docs
commit 7992acc211470de1e21d446a9659a4fb70901a21
Merge: dd801555 aec5a3aa
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 29 16:11:47 2026 -0400
Merge branch 'psharpe/add-GLOBE-model-progress' into psharpe/add-GLOBE-3D-BarnesHut
commit aec5a3aa5dfae3717c5daf97b24a196ed056286a
Merge: 973bdd53 645fb0aa
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 29 16:10:53 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-model-progress
commit 973bdd53c4cee8b575e75172085f33bf7435a485
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 29 15:57:31 2026 -0400
Adds helpful error messages
commit 65a9c54418be92af19324d66062e53406a6b1137
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 29 15:56:58 2026 -0400
Adds helpful error messages
commit 8bde77a319a64744c2fc6028a58b7fd43726a5d8
Merge: 9a8f056c a8a0739a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 29 10:00:32 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-model-progress
commit dd8015555b4a5d368a816e29a79d1d5e283bb921
Merge: 9d63b124 9a8f056c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 17:13:46 2026 -0400
Merge branch 'psharpe/add-GLOBE-model-progress' into psharpe/add-GLOBE-3D-BarnesHut
commit 9a8f056cde5589601568fd2ed784f0f515e79e4e
Merge: 61f5cc00 fc088545
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 17:13:26 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-model-progress
commit 9d63b124c33074ec5447afc76389ba78d55b9791
Merge: 56db224a 61f5cc00
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 16:49:27 2026 -0400
Merge psharpe/add-GLOBE-model-progress into psharpe/add-GLOBE-3D-BarnesHut
Brings in the corrected Pade docstring, the AirFRANSSample TensorClass-
inheritance refactor, and the removal of the broken RAM caching path.
Note: git's recursive auto-merge of cached_dataset.py duplicated the
'sample_paths is empty' validation block (line aligner confused by both
branches removing self.use_ram_caching above it). One copy was removed
during conflict resolution. Resulting blob is byte-identical to MERGE_HEAD.
commit 61f5cc00b27e848a6a48efc2c03d52affdf29cc9
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 16:27:06 2026 -0400
Strips out broken ram caching
commit dd331c9e0aeadd520fa5aa282f11483cedd61875
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 15:25:09 2026 -0400
corrects a docstring
commit 7d4a113d5312a8170a7d9f7374f7f83aa9bcec6a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 12:05:29 2026 -0400
Syncs AirFRANS updates
commit ebaa00ea1f1cb05bb651bc5b95bcd8039cd02e55
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 12:00:05 2026 -0400
Adds self to pade.py codeowners
commit fd7d4695a6384551c5903f6c707ec61bb3a06088
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:55:32 2026 -0400
Adds compile_logging utilities and prefetching utilities
commit 783b849378313126a01e0012dc5c8ea7f882f7ed
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:50:48 2026 -0400
Adds nonregression testing
commit d1209c0225d4c75c8b3551ea636bea0dc052ec0e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:47:49 2026 -0400
minor formatting syncs
commit 83cd333133cc8ced126abdd54f2bbd282965bf45
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:47:33 2026 -0400
verified model arch new features (self_regularization_beta)
commit f69179b5442226b3bb6e4317f03c92f3d8010995
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:45:59 2026 -0400
Migrate cached_dataset.py
commit 56db224aed395535ab0b790babb9846d78d0bcff
Merge: 096d3913 3c396d43
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 27 11:12:59 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-3D-BarnesHut
commit 096d3913862726bb4210edb5e3c1b7e78b1b0a8d
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:28:29 2026 +0000
default to 128x3
commit 69db12e4ed2046f016f8b8e00460a8594684df46
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:27:32 2026 +0000
forward expand far targets
commit 86757d757ca801870cd5d01787b165621b475a04
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:26:43 2026 +0000
removes smooth_pade
commit 34dcff5cc4e77a0410e9a274d29024044df8afdf
Merge: ce0dbb85 8a02bced
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:22:55 2026 +0000
Merge branch 'main' into psharpe/add-GLOBE-3D-BarnesHut
commit ce0dbb8500f3d085e2ac193d744308a6b27fef08
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:21:42 2026 +0000
syncs changes to train.py
commit af594ccc9b84d91d143f1085af86dd5efff45451
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:20:10 2026 +0000
Adds `prediction_chunk_size` parameter to GLOBE model for memory-efficient evaluation. This allows chunked processing of prediction points, improving performance when handling large datasets. Updates documentation to reflect new parameter and its usage.
commit 62e870e79a7ded5444de303b81d0b60a302181b9
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:01:01 2026 +0000
formatting
commit b15a414d44826ed7d918b43bb43c430e7de2b66b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 15:00:33 2026 +0000
formatting
commit 002cc44d2729746304499075d4fd8880fdc7be94
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 14:59:32 2026 +0000
adds new params to airfrans
commit 99955a114de94731e43d1c0d25c24e24c305949f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 14:51:07 2026 +0000
formatting
commit fc4761d8187863d10e0475a01a51bcded8224886
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Fri Apr 24 14:49:21 2026 +0000
Adds in more pade kernel capabilities
commit fd68c8f624caaed8fa186bfa6a9c35b355f9a88d
Merge: 6a8ab360 59321561
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Apr 21 09:50:44 2026 -0400
Merge branch 'main' into psharpe/add-GLOBE-3D-BarnesHut
commit 6a8ab36035c2640013fae55fdb55cae57b46ac0f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 20 16:25:00 2026 +0000
switch airfrans to theta 0
commit 7dfb8cafc1bfef610d396bfd5e0995f2732cabd0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 20 15:12:26 2026 +0000
formatting
commit 57e06a48ad13c5ef1d30a37b3d77612210f7fc82
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 20 15:11:29 2026 +0000
formatting
commit ceeee52be6c448b342cb4bffd5a471e39b6eae74
Merge: 625e43e2 8b1e3e8b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 8 11:42:27 2026 -0400
Merge branch 'psharpe/add-GLOBE-3D-BarnesHut' of https://github.com/peterdsharpe/physicsnemo into psharpe/add-GLOBE-3D-BarnesHut
commit 625e43e24da748556927df9701795a64103279b2
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 6 14:38:40 2026 -0400
Update theta effect visualization script to enhance configuration and error evaluation
- Increased the number of source points from 20 to 60 and grid resolution from 128 to 256 for improved accuracy.
- Refactored mode configurations to use tuples for better clarity and added new labels and colors for modes.
- Updated error evaluation logic to accommodate the new mode structure, ensuring consistent handling across all configurations.
- Enhanced plotting logic to reflect changes in mode handling and improve visualization clarity.
commit 8b1e3e8b23bfccc86a5f9f650c1ef1b4d0c2b984
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Apr 6 18:27:17 2026 +0000
shutdown earlier
commit cf58894e0f78e622231b14d51a6232bc621cb7a0
Merge: 21a99804 f124d964
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 1 14:32:13 2026 +0000
Merge branch 'psharpe/add-GLOBE-3D-BarnesHut' of https://github.com/peterdsharpe/physicsnemo into psharpe/add-GLOBE-3D-BarnesHut
commit 21a99804427a75f2eea06c1b4a0b4043be88f522
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 1 14:31:58 2026 +0000
Refactor hyperparameter logging in GLOBE examples
- Removed reliance on `inspect` for constructor parameters in `log_hyperparameters`.
- Updated to use canonical constructor arguments from `model._args["__args__"]`, improving robustness and avoiding silent drops of untracked parameters.
commit ca805ad45240e8ba51cd8d6880af5b3e677ce5f7
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 1 14:18:49 2026 +0000
Adds a backwards-compat self-regularization-beta parameter.
commit 06482befa2c6d7f47a108a09e682303253096dd8
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Apr 1 14:18:18 2026 +0000
Adds benchmark_accuracy
commit f124d964ff1c57ccc7d7e6015e70dae69f66325b
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Mar 31 22:06:11 2026 -0400
Refactor 3D visualization setup in theta effect scripts
- Updated the figure creation process in `theta_effect.py` and `theta_effect_nonuniform_strength.py` to use `p.figure3d` for improved aspect ratio handling.
- Removed redundant box aspect setting to streamline the visualization code.
commit 7436d7026c522cabbca2f3d0f11962848bb6997a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Mar 31 22:04:27 2026 -0400
Add expand_far_targets option to ClusterTree and kernel classes
- Introduced an optional parameter `expand_far_targets` in the `ClusterTree`, `BarnesHutKernel`, and `MultiscaleKernel` classes to allow for the expansion of far-field target nodes to individual points. This change enhances the accuracy of interactions by eliminating target-side centroid approximations at the cost of additional kernel evaluations.
- Updated relevant documentation to reflect the new parameter and its implications on performance and accuracy.
commit c5c942af35f33a86b35e9070d87a3352f6b18568
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Mar 31 21:51:20 2026 -0400
Add visualization script for the effect of the Barnes-Hut theta parameter on GLOBE's field kernel
- Introduced a new Python script that generates multiple visualizations, including ClusterTree spatial hierarchies and kernel scalar field comparisons.
- The script evaluates the impact of varying theta values on computational cost and approximation errors, enhancing understanding of the Barnes-Hut algorithm's performance.
commit a631ef89047d93a76a875fce7ff56814cd1d8b47
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Mar 31 20:44:17 2026 -0400
Adds visualization scripts
commit 379c02197d9dbdc5fdc8c0825276f863bc329b5f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu Mar 26 15:22:27 2026 +0000
docs
commit c6e0b9de2c7c461b3b0dcd01ccc924d9f29938a0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Thu Mar 26 15:19:44 2026 +0000
Update MLflow launch scripts to specify ports for UI instances
- Added port specifications to the MLflow UI launch commands in both `mlflow_launch.sh` scripts for `airfrans` and `drivaer`, ensuring that each instance runs on a unique port (5001 for airfrans and 5002 for drivaer).
commit bf43029c1588c93a7b96ffd3224425cdbcbf48ac
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Wed Mar 25 19:21:31 2026 -0400
Adds Mesh dimensionality annotations for readability.
commit 3dac89498109d1684bbbf744e78c0a0512cfdbdf
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Tue Mar 24 16:34:36 2026 +0000
Moves benchmarks into an explicitly-resource benchmarking, to distinguish from accuracy benchmarks.
commit 07ce3efb1f3231e1877cc4ba149918d17b99adf3
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Mon Mar 23 00:45:11 2026 +0000
Disable bytecode-compile; let this happen lazily
commit 0f7ac796387cf2a3ca7dec2cb1976b8b993bbc41
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 16:16:14 2026 -0400
Refactor dataclass usage to tensorclass in ClusterTree
- Replaced `@dataclass` with `@tensorclass` for `DualInteractionPlan` and `SourceAggregates` to enhance performance and compatibility with tensor operations.
- This change aligns with the ongoing optimization efforts in the GLOBE model's data handling.
commit 1f5d7656d8815b6de72f834a998e3f34179eb55e
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 20:14:37 2026 +0000
Optimize source scalar and vector gathering in BarnesHutKernel
- Refactored the gathering of source scalars and vectors to reduce GPU kernel calls, improving performance.
- Implemented pre-flattening of source scalars and vectors using `concatenate_leaves`, followed by efficient indexing.
- Adjusted the handling of gathered vectors to ensure compatibility with the feature engineering pipeline in `_evaluate_interactions`.
commit 6dd835845b4edae2f22ab04b392040900f07cd37
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 12:38:04 2026 -0400
EXPERIMENTAL changes that disable lazy compilation from GLOBE kernels
commit 7948120728cc8604fe58e79dbcc957704aec8ab0
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 16:31:33 2026 +0000
Enhance CompileDiagnosticsCollector for improved logging of graph breaks
- Introduced a new internal class `_BreakRecord` to encapsulate details of graph break events, including the reason, user caller, and raw message.
- Updated the `filter` method to parse and store detailed information about graph breaks and recompiles, improving the clarity of logged messages.
- Added a new method `_extract_user_caller` to identify the nearest user code caller from the traceback, enhancing the context provided in logs.
- Modified the `summary` method to display user caller information alongside graph break reasons, aiding in debugging and analysis.
commit f223f790633c5a55db5c8dbeb8a88dcc3f3c3941
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 15:49:49 2026 +0000
Add non-regression tests for GLOBE model
- Introduced a new test suite in `test_nonregression.py` to validate the GLOBE model's forward pass against saved reference outputs, ensuring numerical consistency.
- Implemented functionality to generate and save reference data on first run, with subsequent tests comparing outputs to this reference.
- Added a gradient flow test to verify that gradients propagate correctly through the model during training.
- Created a binary reference output file `globe_nonregression_output.pth` for comparison in tests.
commit 851ce5f2d85d8e430a2b137d46020163493e83bc
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 15:47:58 2026 +0000
save every 1
commit 9f4d49f8037030c3fda1beb2a0e8465f6764b9a2
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 11:34:56 2026 -0400
Add validation method to DualInteractionPlan and corresponding tests
- Implemented a `validate` method in the `DualInteractionPlan` class to check for internal consistency, including shape pairing, non-negativity, and bounds for broadcast tensors.
- Added unit tests for the `validate` method to ensure correct behavior, including checks for valid plans, shape mismatches, out-of-bounds errors, and negative counts.
- Integrated validation call in the `find_dual_interaction_pairs` method to enforce checks during execution.
commit 53f6a1a9ccd46b306f91e547b5d6290cfb1308f2
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 11:09:07 2026 -0400
Add fixture to disable torch.compile during tests
- Introduced a new pytest fixture `_disable_torch_compile` to prevent per-test compilation overhead by disabling torch.compile.
- This change ensures that tests run with zero compilation cost while maintaining identical numerical behavior.
commit 652b0c28be48bbe70cbb39629fa89bf0b57a656a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 01:22:53 2026 -0400
Remove unused total parameter from _ragged_arange call in BarnesHutKernel class
commit b42927132780a65457cd8cbb73a35eb56b5f9aa8
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 05:02:43 2026 +0000
Add leaf node ID and segment ID attributes to ClusterTree
- Introduced `leaf_node_ids` and `leaf_seg_ids` to optimize source aggregation by precomputing leaf indices during tree construction.
- Updated relevant methods to utilize these new attributes, improving performance and maintaining compatibility with `torch.compile`.
- Enhanced documentation to reflect the new properties and their usage.
commit 0ad3271344bb9aaa11f87d6f629d3564707aafa4
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 04:57:32 2026 +0000
sets a default LR of 1e-2.
commit 71ab33296b3ea74e34597d3ef54de95f3942ee95
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 04:56:46 2026 +0000
Expands out kwarg splat, which was causing a torch.compile graph break.
commit 56607fc99dfcaf43e0aaaa905f39c1d88ddf26ee
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 04:52:10 2026 +0000
Add precomputed leaf field consistency tests for ClusterTree
- Introduced tests for verifying leaf_node_ids and leaf_seg_ids against expected values.
- Added checks for empty trees and single-point trees to ensure correct behavior.
- Utilized _ragged_arange for leaf_seg_ids computation consistency.
commit fd65d17815d9b6d6ed5cc707b0278065ea52c9fe
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sun Mar 22 04:51:54 2026 +0000
Fixes ragged_arange to allow for counts of 0 while still preserving compile compatibility
commit 082a9c7fc65c5343a9e5669689fd8ce30450215c
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 22:00:26 2026 +0000
Tweaks compile diagnostics + reduced log verbosity
commit e704d4bb245316ee4c3e498a2c5aea8fe04481c2
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:32:28 2026 +0000
Fixes LR scaling from Moonshot paper
commit 5ab066bb1de1d07ce007e1741c632772978ab149
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:30:56 2026 +0000
update paths
commit 713674cb6b540262bb7c7a5e52dbe309c2b4aa01
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:30:24 2026 +0000
update paths
commit 179bf7ed75fb0d27c406f36d33bfff3e96edd23f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:29:22 2026 +0000
Moves a lot of non-GLOBE-specific code to experimental.utils
commit 59139e4ced1d956f8914edc0a15325c546916ef6
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:10:21 2026 +0000
Cleans up verbose logs
commit d7a6b3015dabd13ce8524841c5a4423fd22e639f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:03:15 2026 +0000
Adds CompileDiagnosticsCollector
commit eb8d912ece968656fdd039a91e01914c71e4337a
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 21:00:19 2026 +0000
Filters warnings by default
commit d479c39f96689ab6df25dce39c1a9bfddf218d39
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 20:57:42 2026 +0000
Fixes a graph break in ragged
commit fea587889b934526d3a067eebb9990260205dfde
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 20:40:50 2026 +0000
reverts erroneous rdvz changes
commit c76b78dec29d2f3d416d70b90db7490b7ed3797f
Author: Peter Sharpe <peterdsharpe@gmail.com>
Date: Sat Mar 21 19:41:25 2026 +0000
Add `patience_steps` parameter to training scripts for dynamic learning rate adjustment. Update `ReduceLROnPlateau` scheduler to use calculated patience based on training data size. Refactor world-size learning rate adjustment for consistency across different GPU configurati…
PhysicsNeMo Pull Request
This PR is a focused performance pass on the experimental GLOBE model and its training scripts. No model architecture changes, no checkpoint compat changes; all numerics are unchanged at fp32-epsilon precision (a couple of internal aggregations now happen in fp64 - details in §5).
Net results on 16x B200s, DrivAerML training with standalone recipe: per-sample time (load, forward, backward, step) goes from ~7.8 seconds to ~5.5 seconds. (-30%)
TL;DR
mem_get_info,.item(),if X.any():,tensor[bool_mask], …). Most have been eliminated.O(tree_depth)Python loops are gone. Per-node source aggregation and per-node strength accumulation are now singlecumsum+ range-subtract ops, exploiting the morton-sorted source ordering.BarnesHutKernelscatter is packed. Oneindex_add_per phase instead of onescatter_add_per phase per output field. This more than halves the backward-pass scatter cost on DrivAerML, whereindexing_backwardwas ~23 % of total GPU time per step.GradScaleris gone from the training scripts (it's a no-op with bf16 autocast and forces a per-step inf/NaN-check sync).expandable_segments:Trueto avoid synchronizingcudaMalloc/cudaFreeround-trips under the chunked kernel workload.N=1M, coords ~5 m,leaf_size=1).torch.compilegraph breaks are fixed, mostly as a side-effect of the changes above.tree_build_deviceknob ("cpu"/"cuda"/None) for the small-problem regime where CUDA-launch latency dominates tree work.Why these changes matter (hot path context)
A GLOBE training step is dominated by two things:
ClusterTreeper BC type, then run a dual-tree Barnes-Hut traversal for every (src BC, dst BC) pair to produce aDualInteractionPlandescribing which target/source point pairs interact in the near, far, and mixed-rank phases.BarnesHutKernelevaluation. Apply MLPs to those pairs andscatter_addthe results back to per-target buffers, four times (one for each near/far phase), once per multiscale branch, once per communication hyperlayer.Profiling showed both phases were CPU-bound on launch/sync overhead rather than GPU-bound on the actual MLP and indexing work. The changes below attack that bottleneck.
1. Eliminating GPU↔CPU synchronization
Each entry below removes a sync point that profiling identified as significant. These are the same change in spirit (avoid materializing
int(tensor)or branching on tensor contents in the hot path), applied to several different call sites.1.1
_auto_chunk_size:mem_get_info→ cachedtotal_memoryBarnesHutKernel._auto_chunk_sizeran on every kernel evaluation and calledtorch.cuda.mem_get_info(device), which is a synchronizing driver query. Across the four phases × multiple branches × multiple hyperlayers, this was ~60 syncs per training step. Replaced with a cached lookup of total device memory and a fixed-fraction budget (25 %):Same swap is applied to
MultiscaleKernel.forward's branch-checkpoint decision.The trade is "exactly fits in currently-free memory" → "fits in 25 % of total device memory". Fine on 100+ GB devices (e.g. GB200) where the non-kernel resident state is well under the remaining 75 %.
1.2 Smoothing radius: Python float → registered buffer
Kernel._evaluate_interactionswas rebuilding a tensor on every forward pass:Now stored as a registered buffer and used directly with TensorDict broadcasting:
Eager: skips a tensor allocation and a
.apply(lambda)per forward. Compile: closes thetorch.compilegraph break.1.3 Cluster tree construction: per-level sync → end-of-loop compaction
ClusterTree.from_pointsused to do, on every level of every tree,That's a CPU-GPU sync per level × per tree (the in-code comment notes ~16 levels × 4 trees × 28 samples for DrivAerML). The loop now accumulates
(seg_node_ids, seg_starts, seg_sizes, is_leaf_seg)per level and pays one boolean compaction after the loop:The previous helpers
_fill_leaf_aabbsand_fill_leaf_total_areasalso each calledint(leaf_sizes.sum()); they're merged into a single_fill_leaf_aggregatesthat shares the_ragged_arangemapping between AABB and area aggregation, removing the redundant.sum()sync.1.4 Dual-tree traversal: ~12
.any()-gated indexings → one compactionClusterTree.find_dual_interaction_pairsis the hot loop of geometry prep. The previous body did 12 separateif X.any():early-exits andtensor[bool_mask]selections per iteration - each one a sync.The new body classifies every active node-pair into one of three cases (
case_T_only,case_S_only,case_both), computes all eight potential child slots with validity masks, stacks them, and pays one boolean compaction per iteration:Far-field outputs are similarly accumulated unfiltered into per-iteration lists with a parallel
is_farvalidity mask, then compacted ONCE at the very end of the traversal.1.5 Leaf-pair expansion: dedup
nonzerocallsIn
_expand_dual_leaf_hits,tensor[bool_mask]lowers to a synchronizingaten::nonzerothat sizes the output. When several indexings share the same mask, the previous code paid one sync per indexing. They're now deduped:The function also drops a pile of "early return on empty input" branches that were each a sync; the downstream ops (
scatter,bincount,repeat_interleave,_ragged_arange) all handle zero-element inputs correctly.1.6 Traversal loop bound:
max_depth.item()→bit_lengthThe dual-traversal loop bound was
i.e. two
.item()calls before the loop starts. Replaced with a static upper bound that needs no GPU read:n_sourcesis a Pythonint(a.shape[0]lookup), not a tensor.1.7 Tree-build wrapper:
no_grad+ minor cleanupsGLOBE._build_trees_and_plansand_build_prediction_plansnow run insidetorch.no_grad()(the outputs are integer indices and a non-grad-tracked area divisor, so autograd bookkeeping on dozens of tensor ops was wasted) and inline the per-plan debug logging into the build loop instead of re-iterating afterward.2. Prefix-sum aggregation: replacing Python loops over tree levels
A cluster tree's morton-sorted source ordering has a key property: every node covers a contiguous range of sources,
[node_range_start, node_range_start + node_range_count). So any node-subtree sum is a prefix-sum range subtract:prefix[end] - prefix[start].This kills two
O(tree_depth)Python loops that were previously the dominant cost in their respective functions.2.1
ClusterTree.compute_source_aggregatesThe previous implementation did:
for level_ids in reversed(depth_levels):propagating centroids and per-source features bottom-up.Combined cost in profiling: ~2 s per training step of mixed CPU (level-loop overhead) and GPU (kernel launches per level).
Replaced with a single
cumsum+ range-subtract over the morton-sorted sources:Per-source feature aggregation gets the same treatment via an inner
_aggregate_via_prefix_sum, applied across TensorDict leaves. Trailing feature dims are flattened so cumsum sees a single feature axis (avoiding a per-feature kernel chain insidecumsum).The
internal_level_ids/internal_level_offsets/leaf_node_ids/leaf_seg_idstensors stored onClusterTreepurely to support the old bottom-up loop are dropped.2.2
BarnesHutKernel._compute_node_strengthsSame story, on a single scalar field (
source_strengths). The old implementation did leaf-summing via scatter then a bottom-up Python loop; the new one is onecumsum+ range-subtract:3. Packed scatter for backward-pass efficiency
BarnesHutKernelscatters into per-target output buffers four times per forward (near, far-node, near-far, far-near), once per output field. In profiling,indexing_backward_kernel_stride_1(the backward ofscatter_add_) was ~1 s out of ~4.5 s total GPU time per training step (~23 %) - the largest single GPU kernel by time.Two changes in
BarnesHutKernel.forward:(n_targets, total_features)buffer instead of per-field buffers.index_add_instead ofscatter_add_for the per-target accumulation.The pack/scatter helper is shared across all four phases:
Counterpart
_unpack_bufslices back into a per-field TensorDict. The canonical-ordered field list and per-key feature widths are cached asKernel._output_packing.Net effect on DrivAerML (
C_p1ch +C_f3ch = 4 fields, 4 phases): 16scatter_add_calls → 4index_add_calls per kernel evaluation.4. Memory management
4.1 CUDA allocator:
expandable_segments:TrueBoth
run.shscripts now exportThe chunked kernel evaluations stress the cache enough that the default segment allocator falls back to synchronous
cudaMalloc/cudaFreeround-trips.expandable_segmentslets the allocator grow existing segments instead.4.2 Gradient checkpointing applies in both train and eval
The four
if self.training and self.use_gradient_checkpointingbranches acrossBarnesHutKernel.forwardare hoisted into one helper:The
self.use_gradient_checkpointingflag is also piped throughMultiscaleKernel→Kernel→BarnesHutKerneland exposed as a constructor kwarg onGLOBE(defaultTrue).4.3 Packed output buffer
BarnesHutKernel.forwardused to lazily allocate per-field, per-phase output buffers (output_bufs[k] = torch.zeros(...)if-key-not-seen pattern). Now allocated up-front:Composes with §3 (packed scatter) and lets us drop the
if n_near == 0 and n_nf == 0 and n_fn == 0 and n_far_nodes == 0empty-result short-circuit at the top offorward: an all-zeropacked_bufunpacks to all-zero per-field tensors, which is the correct result anyway.5. Numerical stability via fp64 cumsum
The cumsum-based aggregations in §2 are done in
float64and cast back tosource_points.dtypeat the end. Reason from the code:Listed under "performance" because it's a prerequisite for §2 - without it the prefix-sum trick can't be used.
6.
torch.compilecompatibilityMost of the syncs above (§1) and the packed-scatter rewrite (§3) also fix
torch.compilegraph breaks. A few changes here are specifically fortorch.compile:_ceil_divfor chunk sizing instead ofmath.ceil(a / b). Underspecialize_float=False, a Python float gets traced as an unbacked symbolic float that propagates into the chunk size and crashes Dynamo at thecheckpointboundary in_gather_and_evaluate. Pure-integer ceiling division (-(-a // b)) sidesteps this._CHUNK_MEMORY_BUDGET_PERCENT: Final[int] = 25instead of e.g.0.25- same reason.@torch.compiler.disableon the device-memory helpers and_auto_chunk_size, so Dynamo doesn't try to evaluatetorch.cuda.get_device_propertiesas a graph constant (which crashes on CPU-only hosts and produces unbacked symbolic ints on CUDA hosts).view(-1, 1)→reshape(-1, 1)inGLOBE.forward's output calibration loop, for robustness when upstream ops produce non-contiguous tensors.The
hierarchical_acceleration.mddesign doc is updated to reflect the new "static fraction of total device memory" chunk-size strategy.7. Training scripts
Both
examples/cfd/external_aerodynamics/globe/airfrans/train.pyand.../drivaer/train.pyhave the same three changes.7.1 Removed
GradScalerGradScaler is only useful for fp16 autocast (to combat gradient underflow). The training scripts use
torch.bfloat16autocast, which has the same dynamic range as fp32 - there's no underflow to protect against. Whenenabled=amp,GradScalerstill forces a per-step inf/NaN check sync on the gradients, for nothing.(Removed from checkpoint save/load plumbing too.)
7.2 New
tree_build_deviceknobExposed as a
Literal["cpu", "cuda"] | Noneargument:Forwarded to
GLOBE.__init__, which transfers boundary centroids/areas tobuild_deviceforfrom_points+find_dual_interaction_pairs, then moves the resultingClusterTreeandDualInteractionPlan(both@tensorclass, so one.to()call each) back to the input's device at the end of_build_trees_and_plans/_build_prediction_plans.Default
Nonepreserves prior behavior (build on the input's device)."cpu"can win for small problems (~a few thousand boundary cells) where CUDA launch latency +cudaStreamSynchronizeround-trips dominate the actual tree work.7.3 CUDA allocator env var (
run.sh)See §4.1.
8. Dependencies (
pyproject.toml)tensordict>=0.12.2(was>=0.10.0) - bumped alongside theBarnesHutKernelrefactor; the codebase exercises TensorDict broadcasting and batch-size propagation more heavily than before (see §4.2 for one such case in_maybe_checkpointed_evaluate), and this update brings bettertorch.compilesupport.psutil>=6.0.0- used by_device_total_memory_bytesfor the CPU branch (psutil.virtual_memory().total). The CPU build path is debug-only; production runs on CUDA, where the CUDA branch is used.What's intentionally not in this PR
.mdluscheckpoints load unmodified.GLOBE(use_gradient_checkpointing,tree_build_device) both have defaults matching the prior behavior, so this is a non-breaking change for downstream users.Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.