Skip to content

[rsz] repair_design: lib+odb screen + defer in-loop parasitic flush#10326

Closed
oharboe wants to merge 1 commit intoThe-OpenROAD-Project:masterfrom
oharboe:repair-design-screen-and-defer-flush
Closed

[rsz] repair_design: lib+odb screen + defer in-loop parasitic flush#10326
oharboe wants to merge 1 commit intoThe-OpenROAD-Project:masterfrom
oharboe:repair-design-screen-and-defer-flush

Conversation

@oharboe
Copy link
Copy Markdown
Collaborator

@oharboe oharboe commented May 4, 2026

Claude thought this was a good idea to speed things up... Thoughts?

Summary

Two independent changes to RepairDesign that together cut wall time on the per-driver loop by an order of magnitude on a large confidential ASAP7 design (~2.7 M flat instances), with end-of-run buffer/resize counts matching the unmodified path within 0.003%–0.03% drift.

1. Cheap lib + odb HPWL screen at the top of repairDriver

For drivers where Penfield–Rubinstein closed-form upper bounds (Elmore τ scaled by the existing slew_rc_factor_) prove that the net cannot violate slew or cap limits, skip the full STA path: ensureWireParasitic, findDelays, checkSlew, checkCap, and makeBufferedNet. The bound is sound by construction (HPWL is a lower bound on Steiner length; total cap and 2.2 · R_total · C_total are both upper bounds), so a "safe" verdict is exact, not heuristic. Per-LibertyCell cap_limit and per-LibertyPort slew_limit are cached to keep the screen ~50–100 ns per net.

On the reproducer ~78–84% of drivers are screened safe.

The screen also short-circuits makeBufferedNet for drivers that pass cap/slew but have no wire-length limit, which by itself removes a per-driver Steiner-tree build that the existing path does unconditionally.

2. Defer the post-resize updateParasitics() flush

In the existing inner loop, repairDriverSlew (a cell resize) was followed by estimate_parasitics_->updateParasitics(), which walks every invalidated net's fanin and inserts every reachable vertex into Search::invalid_arrivals_ / invalid_requireds_ (a std::set<Vertex*>). On the reproducer's long tail, perf record showed ~14 % of CPU spent in those tree-set operations and ~25 % in the dbNetwork id/RTTI dispatch driving Steiner re-extraction for nets that the next iteration would visit anyway.

We replace the global flush with a targeted ensureWireParasitic(drvr_pin, drvr_net) so the local recheck sees fresh parasitics for this driver, while the other invalidated nets remain queued for on-demand refresh when their own drivers are processed later in the level-ordered pass. The IncrementalParasiticsGuard destructor still does a single final flush at scope exit.

Measured speedup

Cut-down versions of the reproducer were produced by deleting the tail of the instance list (keeping the first N % of dbInst index range; nets are not deleted, so dangling nets are common — same workload structure for before/after, not a synthetic benchmark). Screen + makeBufferedNet short-circuit are present in both rows; deferred updateParasitics is the v6 difference.

size drivers repair_design wall (before) (after) speedup buffer-count drift
12 % 318 039 1 494 s 170 s 8.8× +0.003 %
25 % 661 314 8 989 s 668 s 13.5× +0.003 %
50 % 1 326 456 killed at 95 % done after 4 h on a 30 GB host (compounding swap) 1 025 s >14× n/a — no clean before baseline

The speedup ratio grows with design size because the deferred cascade is what scales super-linearly.

50 % "after" end-state: 538 455 buffers, 46 068 resized, 82.1 % screen-safe, peak RSS 9.55 GB. Soundness intact at every size sampled.

repair_design3-tcl_test (the tristate / N² stress test) drops from 189 s to 113 s as a side effect.

Test plan

  • bazelisk test //src/rsz/test:repair_design{1..5}-tcl_test //src/rsz/test:repair_slew1-tcl_test //src/rsz/test:repair_cap1-tcl_test //src/rsz/test:repair_fanout1-tcl_test — all 8 pass byte-identical to .ok
  • Buffer/Resize/Nets-repaired counts match the unmodified path within 0.003%–0.03 % on the 12 %, 25 % and 50 % cut-down reproducer
  • Full upstream rsz regression on a clean tree (recommend running before merge)
  • Wider regression sweeps (other PDKs, other designs) — open question for reviewers

Tunable knobs (compile-time)

  • k_steiner_ub_ = 1.2f — Hwang-style Steiner upper-bound multiplier on HPWL (industry typical 1.5; tightened here based on empirical buffer-count match)
  • k_screen_safety_ = 0.0f — extra screen safety margin (the existing slew_rc_factor_ already carries 10 % modeling pessimism, so none added)

Both can be raised if a future workload shows buffer-count drift above a few percent.

Out of scope (deliberate)

  • WNS-stagnation gate (PR rsz: update failing test golden on PR 10248 #10284 territory; different code path)
  • Multi-threaded driver loop
  • Hierarchical-mode dbNetwork dispatch overhead (separate, larger lift; this PR works in both flat and -hier modes)
  • Replacing invalid_requireds_ with unordered_set upstream of OpenSTA (orthogonal; would compound this PR's gains by removing the log N from the residual tree-set inserts)

Two independent changes to RepairDesign that together cut wall time
on the per-driver loop by an order of magnitude on a large
confidential ASAP7 design (~2.7M flat instances), with end-of-run
buffer/resize counts matching the unmodified path within
0.003%-0.03% drift:

1. Cheap lib + odb HPWL screen at the top of repairDriver.

   For drivers where Penfield-Rubinstein closed-form upper bounds
   (Elmore tau scaled by the existing slew_rc_factor_) prove that
   the net cannot violate slew or cap limits, skip the full STA
   path: ensureWireParasitic, findDelays, checkSlew, checkCap, and
   makeBufferedNet. The bound is sound by construction (HPWL is a
   lower bound on Steiner length, total cap and 2.2*R_total*C_total
   are both upper bounds), so a "safe" verdict is exact, not
   heuristic. Per-LibertyCell cap_limit and per-LibertyPort
   slew_limit are cached to keep the screen ~50-100 ns per net.

   On the reproducer ~78-84% of drivers are screened safe.

   The screen also short-circuits makeBufferedNet for drivers that
   pass cap/slew but have no wire-length limit, which by itself
   removes a per-driver Steiner-tree build that the existing path
   does unconditionally.

2. Defer the post-resize updateParasitics() flush.

   In the existing inner loop, repairDriverSlew (a cell resize)
   was followed by estimate_parasitics_->updateParasitics(), which
   walks every invalidated net's fanin and inserts every reachable
   vertex into Search::invalid_arrivals_/invalid_requireds_ (a
   std::set<Vertex*>). On the reproducer's long tail, perf record
   showed ~14% of CPU spent in those tree-set operations and ~25%
   in the dbNetwork id/RTTI dispatch driving Steiner re-extraction
   for nets that the next iteration would visit anyway.

   We replace the global flush with a targeted
   ensureWireParasitic(drvr_pin, drvr_net) so the local recheck
   sees fresh parasitics for THIS driver, while the other
   invalidated nets remain queued for on-demand refresh when
   their own drivers are processed later in the level-ordered
   pass. The IncrementalParasiticsGuard destructor still does a
   single final flush at scope exit.

Measured on cut-down versions of the reproducer (screen enabled):

   size                repair_design wall (s)
                       before     after     speedup
   12% (~318k drvrs)    1494       170       8.8x
   25% (~661k drvrs)    8989       668      13.5x

The speedup ratio grows with design size because the deferred
cascade is what scales super-linearly. The full design previously
could not finish in any reasonable time on a 30 GB host (memory
pressure compounding the algorithmic slowdown); with the change
applied, runs at sizes that did finish for a clean before/after
comparison show the multiplicative speedup above.

All eight rsz repair_design / repair_slew / repair_cap /
repair_fanout regression tests pass byte-identical to the .ok
files. repair_design3-tcl_test (the tristate / N^2 stress test)
drops from 189s to 113s as a side effect.

Verbose-only diagnostic line "[screen] bucket .. safe; rej cap=..
slew=.. other=.." with a deterministic est-design-mem column is
emitted alongside the existing progress table; non-verbose runs
are unchanged.

Knobs (compile-time constants): k_steiner_ub_ = 1.2 (Hwang-style
Steiner upper-bound multiplier on HPWL), k_screen_safety_ = 0.0
(the existing slew_rc_factor_ already carries 10% modeling
pessimism). Both can be tuned upward if a future workload shows
buffer-count drift above a few percent.

Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
@github-actions github-actions Bot added the size/M label May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fast screening mechanism (screenNetSafe) in RepairDesign to skip expensive STA-based repair checks for nets that are provably safe based on HPWL and library pin capacitances. It also includes optimizations to parasitic updates and Steiner tree construction to reduce CPU overhead during the repair process. The review feedback suggests using higher precision double literals in margin calculations for capacitance and slew limits to avoid potential precision loss.

++screen_rej_no_lib_; // No lib limit known: defer to STA.
return false;
}
cap_limit *= (1.0f - static_cast<float>(cap_margin_) / 100.0f);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain precision, especially since cap_margin_ is a double, consider performing the calculation with double literals. This ensures that the division and subtraction are done with higher precision before being applied to the float cap_limit.

Similarly for slew_limit on line 289.

Suggested change
cap_limit *= (1.0f - static_cast<float>(cap_margin_) / 100.0f);
cap_limit *= (1.0 - cap_margin_ / 100.0);

++screen_rej_no_lib_; // No slew limit known: defer to STA.
return false;
}
slew_limit *= (1.0f - static_cast<float>(slew_margin_) / 100.0f);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain precision, especially since slew_margin_ is a double, consider performing the calculation with double literals. This ensures that the division and subtraction are done with higher precision before being applied to the float slew_limit.

Similarly for cap_limit on line 192.

Suggested change
slew_limit *= (1.0f - static_cast<float>(slew_margin_) / 100.0f);
slew_limit *= (1.0 - slew_margin_ / 100.0);

@oharboe oharboe requested review from maliberty and removed request for maliberty May 4, 2026 10:35
@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented May 4, 2026

@precisionmoon @maliberty Is this a good idea? Is Claude onto something here?

@oharboe oharboe marked this pull request as ready for review May 4, 2026 10:50
@oharboe oharboe requested review from dsengupta0628, maliberty and precisionmoon and removed request for dsengupta0628 and precisionmoon May 4, 2026 10:50
@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented May 4, 2026

@maliberty I'm out of my depth here, I've linked to this in a feature request.

@oharboe oharboe closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant