perf(pm): run install io workers on rayon#2972
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the installation scheduler to offload I/O-heavy tasks like package extraction and cloning to a synchronous worker pool using Rayon. It introduces synchronous versions of several utility functions and adds macOS-specific clonefile support with retry logic. Feedback focuses on removing inappropriate panic recovery logic using catch_unwind, simplifying redundant directory creation in the cloner to avoid logic errors, and optimizing JSON parsing to improve efficiency and error reporting.
| let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { | ||
| clone_package_from_cache_sync( | ||
| &job.spec.package.name, | ||
| &job.spec.package.version, | ||
| &job.spec.package.tarball_url, | ||
| &job.cache_path, | ||
| &job.spec.target, | ||
| ) | ||
| .map_err(|e| format!("{e:#}")) | ||
| })) | ||
| .unwrap_or_else(|_| Err("install clone worker panicked".to_string())); | ||
| let _ = done_tx.send(OpDone::Clone { target, result }); |
There was a problem hiding this comment.
The use of std::panic::catch_unwind to handle panics in the worker thread violates the general rule against implementing recovery logic for panics. Panics should be treated as unrecoverable bugs that indicate a logic error, rather than transient errors to be caught and converted into a result.
let result = clone_package_from_cache_sync(
&job.spec.package.name,
&job.spec.package.version,
&job.spec.package.tarball_url,
&job.cache_path,
&job.spec.target,
)
.map_err(|e| format!("{e:#}"));
let _ = done_tx.send(OpDone::Clone { target, result });References
- Do not implement recovery logic for panics. Panics should be treated as unrecoverable bugs that need to be fixed, not as transient, recoverable errors.
| let mut created_dirs = HashSet::new(); | ||
| for dir in &dirs { | ||
| if created_dirs.insert(dir.clone()) | ||
| && let Err(e) = fs::create_dir_all(dir) | ||
| && e.kind() != io::ErrorKind::AlreadyExists | ||
| { | ||
| return Err(e).with_context(|| err_msg.clone()); | ||
| } | ||
| } |
There was a problem hiding this comment.
The directory creation logic in Phase 2 is redundant and contains a potential logic error.
- Redundancy:
std::fs::create_dir_allis idempotent and returnsOk(())if the directory already exists. Sincecollect_entriesvisits each directory exactly once in a top-down order, theHashSetis unnecessary. - Logic Error: The check
e.kind() != io::ErrorKind::AlreadyExistsis problematic.create_dir_allonly returns anAlreadyExistserror if a file exists at the path where a directory is expected. In such a case, the operation should fail because subsequent file writes into that path will inevitably fail. By skipping the error, the code allows the process to continue in an invalid state.
for dir in &dirs {
fs::create_dir_all(dir).with_context(|| err_msg.clone())?;
}| match serde_json::from_str(&content) { | ||
| Ok(v) => Ok(v), | ||
| Err(original_err) => match serde_json::from_str::<serde_json::Value>(&content) { | ||
| Ok(value) => serde_json::from_value(value) | ||
| .with_context(|| format!("Failed to deserialize {pkg_path:?}")), | ||
| Err(_) => { | ||
| Err(original_err).with_context(|| format!("Failed to parse JSON from {pkg_path:?}")) | ||
| } | ||
| }, | ||
| } |
There was a problem hiding this comment.
The double-parsing logic in load_package_json_sync is inefficient and potentially degrades error reporting. serde_json::from_str::<T> is already capable of parsing the JSON and validating it against the target type T. If the JSON is valid but doesn't match the schema, it provides a descriptive error including line and column information. Falling back to serde_json::Value and then using from_value performs the same validation again but loses the original source location information in the resulting error.
serde_json::from_str(&content).with_context(|| format!("Failed to parse JSON from {pkg_path:?}"))|
Benchmark readout for run 26018170215 (npmjs, label-triggered):
Conclusion: clone+extract-on-rayon is not ready to fold from this single run. p3 is clearly better, but p4 wall regressed in this run even though ctx stayed at the good clone-worker level. I would re-run N=2 before deciding whether the extract worker should join the clone-worker change; otherwise keep only the clone-worker candidate. |
|
N=2 readout for run 26019249657 (npmjs, label-triggered):
Updated conclusion: clone+extract-on-rayon is stable for p3 and no longer shows the p4 wall regression from N=1. Its main value is p3 ctx. #2973 still has the better p4 and better first-run p3 wall, so I would wait for #2973 N=2 before deciding whether to fold this directly or test a combined branch (#2973 + extract worker). |
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 9.25s | 0.20s | 10.55s | 10.09s | 769M | 338.7K |
| utoo-next | 7.93s | 0.47s | 10.57s | 12.08s | 982M | 127.6K |
| utoo-npm | 8.07s | 0.21s | 10.69s | 12.01s | 949M | 127.4K |
| utoo | 8.12s | 1.52s | 11.13s | 11.56s | 934M | 134.7K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 15.2K | 18.4K | 1.20G | 6M | 1.86G | 1.74G | 1M |
| utoo-next | 123.7K | 84.9K | 1.17G | 5M | 1.73G | 1.72G | 2M |
| utoo-npm | 124.1K | 86.4K | 1.17G | 5M | 1.73G | 1.72G | 2M |
| utoo | 96.8K | 53.5K | 1.18G | 7M | 1.73G | 1.72G | 3M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.91s | 0.04s | 4.07s | 1.10s | 516M | 160.9K |
| utoo-next | 3.02s | 0.06s | 5.34s | 2.22s | 625M | 91.9K |
| utoo-npm | 3.07s | 0.15s | 5.38s | 2.15s | 604M | 77.6K |
| utoo | 2.34s | 0.01s | 6.01s | 1.64s | 658M | 123.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 8.1K | 4.9K | 203M | 3M | 108M | - | 1M |
| utoo-next | 73.4K | 89.0K | 201M | 2M | 7M | 3M | 2M |
| utoo-npm | 73.6K | 90.6K | 201M | 2M | 7M | 3M | 2M |
| utoo | 14.1K | 19.5K | 204M | 3M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.80s | 0.24s | 6.31s | 9.74s | 650M | 211.7K |
| utoo-next | 7.00s | 1.46s | 5.20s | 10.70s | 519M | 62.6K |
| utoo-npm | 6.98s | 1.82s | 5.25s | 10.71s | 525M | 66.3K |
| utoo | 5.48s | 0.40s | 4.92s | 10.05s | 501M | 63.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.0K | 7.2K | 1.00G | 3M | 1.77G | 1.77G | 1M |
| utoo-next | 113.6K | 49.1K | 999M | 3M | 1.72G | 1.72G | 2M |
| utoo-npm | 120.2K | 50.5K | 999M | 3M | 1.72G | 1.72G | 2M |
| utoo | 73.6K | 43.2K | 998M | 2M | 1.72G | 1.72G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.51s | 0.00s | 0.17s | 2.50s | 135M | 31.7K |
| utoo-next | 2.23s | 0.05s | 0.50s | 3.76s | 79M | 18.7K |
| utoo-npm | 2.17s | 0.03s | 0.48s | 3.78s | 81M | 18.5K |
| utoo | 1.86s | 0.04s | 0.41s | 3.47s | 53M | 11.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 194 | 21 | 109K | 29K | 1.88G | 1.77G | 1M |
| utoo-next | 38.0K | 17.4K | 405K | 28K | 1.72G | 1.72G | 2M |
| utoo-npm | 40.2K | 17.3K | 404K | 14K | 1.72G | 1.72G | 2M |
| utoo | 21.6K | 13.7K | 400K | 24K | 1.73G | 1.72G | 2M |
npmmirror.com: no output captured.
Summary
Validation
Benchmark
N=1 run 26018170215: p3 positive, p4 ctx positive but wall noisy.
N=2 run 26019249657 (npmjs):
Conclusion: stable p3 ctx win; p4 preserves clone-worker gains. Compare against #2973 before folding, because #2973 currently has better p4 and better p3 wall.