[rsz] repair_design: lib+odb screen + defer in-loop parasitic flush#10326
[rsz] repair_design: lib+odb screen + defer in-loop parasitic flush#10326oharboe wants to merge 1 commit intoThe-OpenROAD-Project:masterfrom
Conversation
Two independent changes to RepairDesign that together cut wall time
on the per-driver loop by an order of magnitude on a large
confidential ASAP7 design (~2.7M flat instances), with end-of-run
buffer/resize counts matching the unmodified path within
0.003%-0.03% drift:
1. Cheap lib + odb HPWL screen at the top of repairDriver.
For drivers where Penfield-Rubinstein closed-form upper bounds
(Elmore tau scaled by the existing slew_rc_factor_) prove that
the net cannot violate slew or cap limits, skip the full STA
path: ensureWireParasitic, findDelays, checkSlew, checkCap, and
makeBufferedNet. The bound is sound by construction (HPWL is a
lower bound on Steiner length, total cap and 2.2*R_total*C_total
are both upper bounds), so a "safe" verdict is exact, not
heuristic. Per-LibertyCell cap_limit and per-LibertyPort
slew_limit are cached to keep the screen ~50-100 ns per net.
On the reproducer ~78-84% of drivers are screened safe.
The screen also short-circuits makeBufferedNet for drivers that
pass cap/slew but have no wire-length limit, which by itself
removes a per-driver Steiner-tree build that the existing path
does unconditionally.
2. Defer the post-resize updateParasitics() flush.
In the existing inner loop, repairDriverSlew (a cell resize)
was followed by estimate_parasitics_->updateParasitics(), which
walks every invalidated net's fanin and inserts every reachable
vertex into Search::invalid_arrivals_/invalid_requireds_ (a
std::set<Vertex*>). On the reproducer's long tail, perf record
showed ~14% of CPU spent in those tree-set operations and ~25%
in the dbNetwork id/RTTI dispatch driving Steiner re-extraction
for nets that the next iteration would visit anyway.
We replace the global flush with a targeted
ensureWireParasitic(drvr_pin, drvr_net) so the local recheck
sees fresh parasitics for THIS driver, while the other
invalidated nets remain queued for on-demand refresh when
their own drivers are processed later in the level-ordered
pass. The IncrementalParasiticsGuard destructor still does a
single final flush at scope exit.
Measured on cut-down versions of the reproducer (screen enabled):
size repair_design wall (s)
before after speedup
12% (~318k drvrs) 1494 170 8.8x
25% (~661k drvrs) 8989 668 13.5x
The speedup ratio grows with design size because the deferred
cascade is what scales super-linearly. The full design previously
could not finish in any reasonable time on a 30 GB host (memory
pressure compounding the algorithmic slowdown); with the change
applied, runs at sizes that did finish for a clean before/after
comparison show the multiplicative speedup above.
All eight rsz repair_design / repair_slew / repair_cap /
repair_fanout regression tests pass byte-identical to the .ok
files. repair_design3-tcl_test (the tristate / N^2 stress test)
drops from 189s to 113s as a side effect.
Verbose-only diagnostic line "[screen] bucket .. safe; rej cap=..
slew=.. other=.." with a deterministic est-design-mem column is
emitted alongside the existing progress table; non-verbose runs
are unchanged.
Knobs (compile-time constants): k_steiner_ub_ = 1.2 (Hwang-style
Steiner upper-bound multiplier on HPWL), k_screen_safety_ = 0.0
(the existing slew_rc_factor_ already carries 10% modeling
pessimism). Both can be tuned upward if a future workload shows
buffer-count drift above a few percent.
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
|
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Code Review
This pull request introduces a fast screening mechanism (screenNetSafe) in RepairDesign to skip expensive STA-based repair checks for nets that are provably safe based on HPWL and library pin capacitances. It also includes optimizations to parasitic updates and Steiner tree construction to reduce CPU overhead during the repair process. The review feedback suggests using higher precision double literals in margin calculations for capacitance and slew limits to avoid potential precision loss.
| ++screen_rej_no_lib_; // No lib limit known: defer to STA. | ||
| return false; | ||
| } | ||
| cap_limit *= (1.0f - static_cast<float>(cap_margin_) / 100.0f); |
There was a problem hiding this comment.
To maintain precision, especially since cap_margin_ is a double, consider performing the calculation with double literals. This ensures that the division and subtraction are done with higher precision before being applied to the float cap_limit.
Similarly for slew_limit on line 289.
| cap_limit *= (1.0f - static_cast<float>(cap_margin_) / 100.0f); | |
| cap_limit *= (1.0 - cap_margin_ / 100.0); |
| ++screen_rej_no_lib_; // No slew limit known: defer to STA. | ||
| return false; | ||
| } | ||
| slew_limit *= (1.0f - static_cast<float>(slew_margin_) / 100.0f); |
There was a problem hiding this comment.
To maintain precision, especially since slew_margin_ is a double, consider performing the calculation with double literals. This ensures that the division and subtraction are done with higher precision before being applied to the float slew_limit.
Similarly for cap_limit on line 192.
| slew_limit *= (1.0f - static_cast<float>(slew_margin_) / 100.0f); | |
| slew_limit *= (1.0 - slew_margin_ / 100.0); |
|
@precisionmoon @maliberty Is this a good idea? Is Claude onto something here? |
|
@maliberty I'm out of my depth here, I've linked to this in a feature request. |
Claude thought this was a good idea to speed things up... Thoughts?
Summary
Two independent changes to
RepairDesignthat together cut wall time on the per-driver loop by an order of magnitude on a large confidential ASAP7 design (~2.7 M flat instances), with end-of-run buffer/resize counts matching the unmodified path within 0.003%–0.03% drift.1. Cheap lib + odb HPWL screen at the top of
repairDriverFor drivers where Penfield–Rubinstein closed-form upper bounds (Elmore τ scaled by the existing
slew_rc_factor_) prove that the net cannot violate slew or cap limits, skip the full STA path:ensureWireParasitic,findDelays,checkSlew,checkCap, andmakeBufferedNet. The bound is sound by construction (HPWL is a lower bound on Steiner length; total cap and2.2 · R_total · C_totalare both upper bounds), so a "safe" verdict is exact, not heuristic. Per-LibertyCellcap_limit and per-LibertyPortslew_limit are cached to keep the screen ~50–100 ns per net.On the reproducer ~78–84% of drivers are screened safe.
The screen also short-circuits
makeBufferedNetfor drivers that pass cap/slew but have no wire-length limit, which by itself removes a per-driver Steiner-tree build that the existing path does unconditionally.2. Defer the post-resize
updateParasitics()flushIn the existing inner loop,
repairDriverSlew(a cell resize) was followed byestimate_parasitics_->updateParasitics(), which walks every invalidated net's fanin and inserts every reachable vertex intoSearch::invalid_arrivals_/invalid_requireds_(astd::set<Vertex*>). On the reproducer's long tail,perf recordshowed ~14 % of CPU spent in those tree-set operations and ~25 % in thedbNetworkid/RTTI dispatch driving Steiner re-extraction for nets that the next iteration would visit anyway.We replace the global flush with a targeted
ensureWireParasitic(drvr_pin, drvr_net)so the local recheck sees fresh parasitics for this driver, while the other invalidated nets remain queued for on-demand refresh when their own drivers are processed later in the level-ordered pass. TheIncrementalParasiticsGuarddestructor still does a single final flush at scope exit.Measured speedup
Cut-down versions of the reproducer were produced by deleting the tail of the instance list (keeping the first N % of
dbInstindex range; nets are not deleted, so dangling nets are common — same workload structure for before/after, not a synthetic benchmark). Screen +makeBufferedNetshort-circuit are present in both rows; deferredupdateParasiticsis the v6 difference.The speedup ratio grows with design size because the deferred cascade is what scales super-linearly.
50 % "after" end-state: 538 455 buffers, 46 068 resized, 82.1 % screen-safe, peak RSS 9.55 GB. Soundness intact at every size sampled.
repair_design3-tcl_test(the tristate / N² stress test) drops from 189 s to 113 s as a side effect.Test plan
bazelisk test //src/rsz/test:repair_design{1..5}-tcl_test //src/rsz/test:repair_slew1-tcl_test //src/rsz/test:repair_cap1-tcl_test //src/rsz/test:repair_fanout1-tcl_test— all 8 pass byte-identical to.okTunable knobs (compile-time)
k_steiner_ub_ = 1.2f— Hwang-style Steiner upper-bound multiplier on HPWL (industry typical 1.5; tightened here based on empirical buffer-count match)k_screen_safety_ = 0.0f— extra screen safety margin (the existingslew_rc_factor_already carries 10 % modeling pessimism, so none added)Both can be raised if a future workload shows buffer-count drift above a few percent.
Out of scope (deliberate)
-hiermodes)invalid_requireds_withunordered_setupstream of OpenSTA (orthogonal; would compound this PR's gains by removing thelog Nfrom the residual tree-set inserts)