perf(pm): prefetch lockfile installs with rayon workers#2974
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a prefetching mechanism for registry tarballs and refactors package extraction and cloning to use synchronous operations within a Rayon worker pool, including macOS-specific clonefile support. Feedback focuses on improving error handling by treating directory cleanup failures as fatal, removing redundant cleanup logic, and avoiding the use of catch_unwind for worker panics. Additionally, the use of std::thread::sleep within Rayon threads was flagged for potential performance impact.
| let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { | ||
| clone_package_from_cache_sync( | ||
| &job.spec.package.name, | ||
| &job.spec.package.version, | ||
| &job.spec.package.tarball_url, | ||
| &job.cache_path, | ||
| &job.spec.target, | ||
| ) | ||
| .map_err(|e| format!("{e:#}")) | ||
| })) | ||
| .unwrap_or_else(|_| Err("install clone worker panicked".to_string())); |
There was a problem hiding this comment.
The use of std::panic::catch_unwind to convert a panic into a recoverable Err violates the general rule that panics should be treated as unrecoverable bugs. If a worker panics, it indicates a bug that should not be handled as a transient error. Removing the catch-unwind logic allows the panic to propagate naturally, which is preferred for unrecoverable issues.
let result = clone_package_from_cache_sync(
&job.spec.package.name,
&job.spec.package.version,
&job.spec.package.tarball_url,
&job.cache_path,
&job.spec.target,
)
.map_err(|e| format!("{e:#}"));References
- Do not implement recovery logic for panics. Panics should be treated as unrecoverable bugs that need to be fixed, not as transient, recoverable errors.
|
|
||
| for delay in std::iter::once(std::time::Duration::ZERO).chain(create_retry_strategy()) { | ||
| if !delay.is_zero() { | ||
| std::thread::sleep(delay); |
There was a problem hiding this comment.
Using std::thread::sleep inside a Rayon worker thread is generally discouraged as it blocks the worker from executing other tasks in the pool. While retries are necessary for system calls like clonefile, consider if this blocking behavior is acceptable given the configured clone_limit or if a more asynchronous approach is possible.
| if dst.try_exists()? | ||
| && let Err(e) = std::fs::remove_dir_all(dst) | ||
| { | ||
| tracing::warn!("Failed to clean target directory {}: {}", dst.display(), e); | ||
| } |
There was a problem hiding this comment.
If std::fs::remove_dir_all fails, the subsequent clone_dir_native_sync is highly likely to fail as well (especially on macOS where clonefile requires the destination to not exist). This error should be treated as fatal rather than just a warning to ensure the system doesn't proceed with an inconsistent state.
if dst.try_exists()? {
std::fs::remove_dir_all(dst)
.with_context(|| format!("Failed to clean target directory {}", dst.display()))?;
}| if dst.try_exists()? { | ||
| if validate_name_version_sync(dst, name, version) { | ||
| return Ok(false); | ||
| } | ||
| if let Err(e) = std::fs::remove_dir_all(dst) { | ||
| tracing::warn!("Failed to clean target directory {}: {}", dst.display(), e); | ||
| } | ||
| } |
There was a problem hiding this comment.
This call to remove_dir_all is redundant because clone_sync (called on line 492) also attempts to remove the destination directory if it exists. Additionally, as noted in clone_sync, failure to clean the directory should be a fatal error.
if dst.try_exists()? && validate_name_version_sync(dst, name, version) {
return Ok(false);
}
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 7.57s | 0.06s | 10.41s | 6.30s | 608M | 293.2K |
| utoo-next | 7.15s | 0.21s | 10.81s | 8.13s | 1.01G | 121.4K |
| utoo-npm | 7.74s | 1.08s | 10.99s | 8.19s | 1.00G | 123.3K |
| utoo | 6.81s | 0.19s | 11.50s | 7.67s | 956M | 149.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 23.0K | 18.1K | 1.20G | 7M | 1.89G | 1.76G | 1M |
| utoo-next | 146.0K | 112.1K | 1.17G | 5M | 1.73G | 1.72G | 2M |
| utoo-npm | 155.0K | 115.6K | 1.17G | 6M | 1.73G | 1.72G | 2M |
| utoo | 96.4K | 55.2K | 1.18G | 7M | 1.73G | 1.72G | 3M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.31s | 0.06s | 4.30s | 0.81s | 497M | 178.5K |
| utoo-next | 3.02s | 0.04s | 5.58s | 1.53s | 621M | 83.2K |
| utoo-npm | 3.08s | 0.02s | 5.70s | 1.56s | 612M | 73.9K |
| utoo | 2.33s | 0.02s | 6.20s | 1.14s | 650M | 120.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 13.3K | 4.3K | 203M | 3M | 108M | - | 1M |
| utoo-next | 88.6K | 91.5K | 201M | 3M | 7M | 3M | 2M |
| utoo-npm | 91.7K | 98.5K | 202M | 3M | 7M | 3M | 2M |
| utoo | 17.7K | 20.1K | 204M | 3M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 5.33s | 0.06s | 5.91s | 6.19s | 580M | 193.2K |
| utoo-next | 5.73s | 0.16s | 5.04s | 6.98s | 486M | 62.8K |
| utoo-npm | 5.51s | 0.05s | 5.05s | 6.95s | 491M | 60.3K |
| utoo | 4.80s | 0.26s | 4.80s | 6.67s | 482M | 55.7K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 10.4K | 7.7K | 1.00G | 4M | 1.77G | 1.77G | 1M |
| utoo-next | 125.9K | 50.4K | 999M | 4M | 1.72G | 1.72G | 2M |
| utoo-npm | 118.1K | 55.0K | 999M | 3M | 1.72G | 1.72G | 2M |
| utoo | 80.5K | 45.6K | 999M | 3M | 1.72G | 1.72G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.09s | 0.03s | 0.15s | 1.19s | 137M | 33.3K |
| utoo-next | 1.89s | 0.03s | 0.50s | 2.37s | 79M | 18.5K |
| utoo-npm | 1.72s | 0.08s | 0.52s | 2.35s | 80M | 18.9K |
| utoo | 1.95s | 0.49s | 0.37s | 2.09s | 50M | 11.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 254 | 12 | 5M | 25K | 1.93G | 1.74G | 1M |
| utoo-next | 44.1K | 19.9K | 3K | 10K | 1.72G | 1.72G | 2M |
| utoo-npm | 43.8K | 20.2K | 4K | 8K | 1.72G | 1.72G | 2M |
| utoo | 18.4K | 11.2K | 2K | 5K | 1.72G | 1.72G | 2M |
npmmirror.com: no output captured.
Summary
Hypothesis
#2973 has the strongest p4 wall/ctx, but p3 is unstable. #2972 has stronger p3 ctx and stable p3 wall, but weaker p4 than #2973. This branch checks whether prefetch + extract workers can keep #2973 p4 while recovering #2972 p3 ctx.
Validation
Benchmark
GHA linux/npmjs, run 26020431477:
Conclusion: p4 keeps the #2973 low-ctx class, but p3 does not recover the stable #2972 result and has high wall-time variance. Keep this as the combination baseline while testing #2975/#2976/#2977 for p4 cache-hit and p3 critical-path queueing.