rsz: bail repair_timing on obviously-futile designs#10248
rsz: bail repair_timing on obviously-futile designs#10248oharboe wants to merge 8 commits intoThe-OpenROAD-Project:masterfrom
Conversation
Adds a deterministic WNS-stagnation gate to RepairSetup::terminateProgress(). Best-so-far WNS is sampled every pass into a 200-pass ring buffer; if the best observed value has not improved by max(1 ps, 0.5% * |initial_wns|) over a full window, the gate returns true. Combined with the existing two-consecutive-termination rule this aborts the phase after ~1200 passes of no WNS movement on designs where tiny TNS twitches would otherwise keep the optimizer running forever. Motivated by automated architectural-exploration flows where the .sdc is held at an ambitious target while RTL evolves. Without this, a WNS gap of hundreds of ps grinds repair_timing for hours producing no useful movement and forcing users to guess SETUP_SLACK_MARGIN values to stop it. The gate fires only when no reasonable user would disagree that further effort is futile (1 ps absolute floor keeps tape-out grind untouched; the TNS fix-rate gate still owns termination near closure). On trip, a single loud INFO log (RSZ-0234/235/236) names the best-effort WNS and notes this is probably an exploration run. No new Tcl flag, no new ORFS env var - hardcoded conservative defaults. New test test/orfs/hopeless/ synthesizes 8 parallel 8-deep arithmetic pipelines on asap7 with a 50 ps clock and asserts both that the flow completes and that the gate log message appears (sh_test + grep). A check_same idempotent test guards the determinism property. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a WNS-stagnation gate to the timing repair logic to abort optimization phases when Worst Negative Slack fails to improve over a set number of iterations. The implementation includes a ring buffer for WNS history, methods for tracking and reporting stagnation, and a new regression test for infeasible designs. Review feedback identifies a unit mismatch in the stagnation tolerance constant and suggests replacing a magic number in the iteration logic with a named constant.
- RepairSetup.hh: include <cstddef> for size_t and <string> for std::string (include-cleaner warnings from clang-tidy CI bot). - RepairSetup.hh / RepairSetup.cc: name the stagnation-gate warmup threshold wns_stagnation_warmup_iterations_ instead of the bare 1000 literal, matching the style of the other tunables in this class. Not applied: gemini-code-assist's suggestion to change wns_stagnation_abs_tol_ from 1.0e-12f to 1.0e-3f. sta::Slack is measured in seconds (see src/sta/include/sta/Units.hh:73 "Sta internal units are always seconds, ..."), so 1.0e-12 s is 1 ps, which is the intended value. 1.0e-3 s would be 1 ms and would disable the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
|
clang-tidy review says "All clean, LGTM! 👍" |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a WNS-stagnation gate to the repair_timing process to prevent excessive iterations on infeasible designs where TNS might fluctuate but WNS remains stuck. The implementation uses a ring buffer to track WNS history and terminates progress if improvements fall below a deterministic threshold. A new regression test, "hopeless," is added to verify this behavior. Feedback includes replacing a non-ASCII character in log messages for better terminal compatibility and using std::max_element to improve code readability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
|
clang-tidy review says "All clean, LGTM! 👍" |
Wrap long sta::Slack initialization line that clang-format flagged on PR The-OpenROAD-Project#10248 after the gemini-code-assist suggestion was applied via the GitHub UI. Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
The sliding-window check (window_best - window_oldest over 200 passes) mis-classified plateaus as futility: any real design whose WNS drops dramatically early then flat-lines at a topology-bound floor while TNS keeps improving (aes, clone_flat, repair_fanout*) tripped the gate and aborted repair_timing too early, leaving worse max_cap/max_slew slack and different .ok-file output. Replace with best_wns_ever vs initial_wns_: only fire when WNS has effectively never moved from its starting value, which is the real signature of an obviously-futile run. Also bump rel_tol 0.5% -> 5%. Empirically aes improves WNS by ~50% of initial, clone_flat by ~95%, repair_fanout by ~90%; the hopeless.v synthetic only moves WNS by ~2% because buffer insertion chips something off even a grossly-over-clocked design. 5% sits in the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
|
clang-tidy review says "All clean, LGTM! 👍" |
|
use case: DSE should terminate quickly on lost causes without a human in the loop |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a WNS-stagnation gate to the repair_timing process to prevent excessive iterations on designs where timing closure is infeasible. The gate monitors Worst Negative Slack (WNS) and terminates the repair phase if improvement falls below a defined threshold after a warmup period. The changes include new tracking variables, a reset mechanism for each repair phase, and detailed logging when stagnation is detected. Additionally, a new regression test named "hopeless" has been added to verify that the flow correctly identifies and bails on pathological designs. I have no feedback to provide.
|
It might be better to wait for #10223 to land ; @jhkim-pii opinion? |
|
I think merging this PR first would be better. Otherwise this should be rebuilt on top of the new Resizer architecture by author, which looks more difficult. |
|
I am curious about ORFS desgin QoR impact by this PR. |
I think none, it passes. There are no obviously futile design configurations in ORFS designs. I had to add an obviously futile configuration in this PR to test. |
|
From my point of view it would be nice to have this merged, even if it is rewritten in the future. It captures a use-case that is important to us. Presumably the test is what I really care about, the current code is just a stop-gap that reduces the urgency and the test case is what protects the use-case. |
|
There is no need for a full orfs run to just test rsz. This should be a unit test of a repair_setup command. |
How do I write that so it isn't a "working as implemented" test? |
|
@jhkim-pii Please give me some guidance on writing a good test here. I'll pass it along to claude 🤦 |
|
@oharboe OK. I'll try to make the test case and share it. I think sharing the test case would be easier than explaining how to write such test. |
I asked Claude to make something... Should be done soon... |
Tiny TCL unit test exercising the WNS-stagnation gate added in this PR without scraping log strings. It builds a flat block of 1200 register-to-register pairs (DFF_X2 driver -> DFF_X2 sink, no logic in between), holds the cell library to _X1 / _X2 sizes, and sets a 1 ps clock period so that no SizeUpMove / BufferMove / pin-swap / clone / split-load / unbuffer move can improve WNS. The endpoint count is chosen so the inner-loop counter passes the gate's warmup before the legacy phase has visited every endpoint. Without the gate the same .tcl runs to iteration 1200 (one futile pass per endpoint); with the gate it exits at iteration 1002 and the .ok diverges, so the test fails without the fix and passes with it. Replaces the orfs-based test/orfs/hopeless gate-coverage check that maliberty asked to convert into a repair_setup-only unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com> # Conflicts: # src/rsz/test/CMakeLists.txt
|
@jhkim-pii I have no idea what Claude did there :-) It is short and it doesn't use verilog as input... |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
@maliberty @jhkim-pii I don't know what's failing here, but I don't think it is related to this PR. I just merged with origin/master. |
Looks good. |
I added the failing test case Currently, I found that merge of the PR #10141 changed the QoR slightly. I'll open a new PR to rebase the failing test |
|
@jhkim-pii Thank you!
Please close this PR when you open a new one. Thank you for nursing it through 😌 |
|
@oharboe OK. By the way, I wonder if you wish to keep the |
No particular reason, please remove it. I just think that this feature should have a test that protects the use case :-) |
--> I cannot find any ring buffer in the code. It looks like the code implementation is different from what you intended. |
|
Matt rebased the |
| const std::string wns_msg = wnsStagnationReport(opto_iteration); | ||
| if (!wns_msg.empty()) { | ||
| logger_->info(RSZ, 236, "{}", wns_msg); | ||
| } |
There was a problem hiding this comment.
logger_->info(...) call in every caller site.
It would be better to move the logger_->info(...) call within wnsStagnationReport().
Then wnsStagnationReport() does not require the argument and string return.
|
@oharboe I implemented the ring buffer here in the new PR including this PR. |
Thank you! |
Adds a deterministic WNS-stagnation gate to RepairSetup::terminateProgress(). Best-so-far WNS is sampled every pass into a 200-pass ring buffer; if the best observed value has not improved by max(1 ps, 0.5% * |initial_wns|) over a full window, the gate returns true. Combined with the existing two-consecutive-termination rule this aborts the phase after ~1200 passes of no WNS movement on designs where tiny TNS twitches would otherwise keep the optimizer running forever.
Motivated by automated architectural-exploration flows where the .sdc is held at an ambitious target while RTL evolves. Without this, a WNS gap of hundreds of ps grinds repair_timing for hours producing no useful movement and forcing users to guess SETUP_SLACK_MARGIN values to stop it. The gate fires only when no reasonable user would disagree that further effort is futile (1 ps absolute floor keeps tape-out grind untouched; the TNS fix-rate gate still owns termination near closure).
On trip, a single loud INFO log (RSZ-0234/235/236) names the best-effort WNS and notes this is probably an exploration run. No new Tcl flag, no new ORFS env var - hardcoded conservative defaults.
New test test/orfs/hopeless/ synthesizes 8 parallel 8-deep arithmetic pipelines on asap7 with a 50 ps clock and asserts both that the flow completes and that the gate log message appears (sh_test + grep). A check_same idempotent test guards the determinism property.