Skip to content

Conversation

@janosh
Copy link
Member

@janosh janosh commented Oct 24, 2025

detect_lostruns takes 25 min to run on our current LaunchPad and looks highly optimizable

  1. lines 1351-1361 query each potentially lost FW individually instead of batch fetching (scales O(N) with N = number of fireworks)
  2. line 1357 for each non-lost launch, it makes another individual query to check state (nested O(M) queries for launch states where M = number of launches)
  3. lines 1378-1388 iterate through all RUNNING fireworks and make one query per FW to check if launches are FIZZLED/COMPLETED

Example: For 10k RUNNING jobs and 1k lost runs, this requires 11,000+ DB queries.

Mongo has had batch queries and aggregation pipelines since 3.x to do this faster: (1) Batch fetch all relevant FireWorks with find({"fw_id": {"$in": fw_ids}}), (2) Collect all launch IDs then batch fetch their states in one query, and (3) Use a MongoDB aggregation pipeline with $lookup to find inconsistent FireWorks server-side instead of N queries client-side. This reduces the operation from ~11,000+ queries to order 10 batch queries plus 1 aggregation, which should give a big speedup and is backwards-compatible

Replace O(N) individual database queries with batch operations and MongoDB
aggregation pipelines, achieving 280-877x speedup for large deployments.

- Batch fetch FireWorks and launch states in single queries
- Use aggregation pipeline with $lookup for inconsistent FW detection
- Add performance tests demonstrating improvements
- Update CLI to show detection summary
- Fully backwards compatible

Resolves performance issues with large databases containing thousands of
lost/inconsistent runs.
@janosh
Copy link
Member Author

janosh commented Oct 24, 2025

anecdotally, this PR brings the run time from 25min down to 25sec

time lpad detect_lostruns
> 0.67s user 0.10s system 3% cpu 24.530 total

- Replace nested Process classes with module-level helper functions to fix
  pickling errors with spawn context on macOS
- Add _LaunchPadCallable wrapper class for picklable LaunchPad sharing
  across multiprocessing.BaseManager
- Register LaunchPad proxy on client side before connecting to DataServer
- Fix AuthenticationTest user creation conflicts with cleanup logic
@janosh janosh force-pushed the speedup-detect-lostruns branch from e4fc6a1 to 2c904ed Compare October 25, 2025 12:00
@computron computron merged commit e688721 into materialsproject:main Oct 28, 2025
4 checks passed
@computron
Copy link
Member

Thanks!

@janosh
Copy link
Member Author

janosh commented Oct 28, 2025

thanks for merging! if you have time for another one: #555

@janosh janosh deleted the speedup-detect-lostruns branch October 28, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants