Skip to content

rsz: bail repair_timing on obviously-futile designs#10248

Closed
oharboe wants to merge 8 commits intoThe-OpenROAD-Project:masterfrom
oharboe:end-hopeless-repair
Closed

rsz: bail repair_timing on obviously-futile designs#10248
oharboe wants to merge 8 commits intoThe-OpenROAD-Project:masterfrom
oharboe:end-hopeless-repair

Conversation

@oharboe
Copy link
Copy Markdown
Collaborator

@oharboe oharboe commented Apr 24, 2026

Adds a deterministic WNS-stagnation gate to RepairSetup::terminateProgress(). Best-so-far WNS is sampled every pass into a 200-pass ring buffer; if the best observed value has not improved by max(1 ps, 0.5% * |initial_wns|) over a full window, the gate returns true. Combined with the existing two-consecutive-termination rule this aborts the phase after ~1200 passes of no WNS movement on designs where tiny TNS twitches would otherwise keep the optimizer running forever.

Motivated by automated architectural-exploration flows where the .sdc is held at an ambitious target while RTL evolves. Without this, a WNS gap of hundreds of ps grinds repair_timing for hours producing no useful movement and forcing users to guess SETUP_SLACK_MARGIN values to stop it. The gate fires only when no reasonable user would disagree that further effort is futile (1 ps absolute floor keeps tape-out grind untouched; the TNS fix-rate gate still owns termination near closure).

On trip, a single loud INFO log (RSZ-0234/235/236) names the best-effort WNS and notes this is probably an exploration run. No new Tcl flag, no new ORFS env var - hardcoded conservative defaults.

New test test/orfs/hopeless/ synthesizes 8 parallel 8-deep arithmetic pipelines on asap7 with a 50 ps clock and asserts both that the flow completes and that the gate log message appears (sh_test + grep). A check_same idempotent test guards the determinism property.

Adds a deterministic WNS-stagnation gate to RepairSetup::terminateProgress().
Best-so-far WNS is sampled every pass into a 200-pass ring buffer; if the
best observed value has not improved by max(1 ps, 0.5% * |initial_wns|)
over a full window, the gate returns true. Combined with the existing
two-consecutive-termination rule this aborts the phase after ~1200 passes
of no WNS movement on designs where tiny TNS twitches would otherwise keep
the optimizer running forever.

Motivated by automated architectural-exploration flows where the .sdc is
held at an ambitious target while RTL evolves. Without this, a WNS gap of
hundreds of ps grinds repair_timing for hours producing no useful movement
and forcing users to guess SETUP_SLACK_MARGIN values to stop it. The gate
fires only when no reasonable user would disagree that further effort is
futile (1 ps absolute floor keeps tape-out grind untouched; the TNS
fix-rate gate still owns termination near closure).

On trip, a single loud INFO log (RSZ-0234/235/236) names the best-effort
WNS and notes this is probably an exploration run. No new Tcl flag, no
new ORFS env var - hardcoded conservative defaults.

New test test/orfs/hopeless/ synthesizes 8 parallel 8-deep arithmetic
pipelines on asap7 with a 50 ps clock and asserts both that the flow
completes and that the gate log message appears (sh_test + grep). A
check_same idempotent test guards the determinism property.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread src/rsz/src/RepairSetup.hh
Comment thread src/rsz/src/RepairSetup.hh Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a WNS-stagnation gate to the timing repair logic to abort optimization phases when Worst Negative Slack fails to improve over a set number of iterations. The implementation includes a ring buffer for WNS history, methods for tracking and reporting stagnation, and a new regression test for infeasible designs. Review feedback identifies a unit mismatch in the stagnation tolerance constant and suggests replacing a magic number in the iteration logic with a named constant.

Comment thread src/rsz/src/RepairSetup.hh
Comment thread src/rsz/src/RepairSetup.cc Outdated
- RepairSetup.hh: include <cstddef> for size_t and <string> for
  std::string (include-cleaner warnings from clang-tidy CI bot).
- RepairSetup.hh / RepairSetup.cc: name the stagnation-gate warmup
  threshold wns_stagnation_warmup_iterations_ instead of the bare 1000
  literal, matching the style of the other tunables in this class.

Not applied: gemini-code-assist's suggestion to change
wns_stagnation_abs_tol_ from 1.0e-12f to 1.0e-3f. sta::Slack is
measured in seconds (see src/sta/include/sta/Units.hh:73 "Sta internal
units are always seconds, ..."), so 1.0e-12 s is 1 ps, which is the
intended value. 1.0e-3 s would be 1 ms and would disable the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
@github-actions
Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 24, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a WNS-stagnation gate to the repair_timing process to prevent excessive iterations on infeasible designs where TNS might fluctuate but WNS remains stuck. The implementation uses a ring buffer to track WNS history and terminates progress if improvements fall below a deterministic threshold. A new regression test, "hopeless," is added to verify this behavior. Feedback includes replacing a non-ASCII character in log messages for better terminal compatibility and using std::max_element to improve code readability.

Comment thread src/rsz/src/RepairSetup.cc Outdated
Comment thread src/rsz/src/RepairSetup.cc Outdated
oharboe and others added 2 commits April 24, 2026 12:13
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
@github-actions
Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread src/rsz/src/RepairSetup.cc Outdated
Wrap long sta::Slack initialization line that clang-format flagged on
PR The-OpenROAD-Project#10248 after the gemini-code-assist suggestion was applied via the
GitHub UI.

Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread src/rsz/src/RepairSetup.cc Outdated
The sliding-window check (window_best - window_oldest over 200 passes)
mis-classified plateaus as futility: any real design whose WNS drops
dramatically early then flat-lines at a topology-bound floor while TNS
keeps improving (aes, clone_flat, repair_fanout*) tripped the gate and
aborted repair_timing too early, leaving worse max_cap/max_slew slack
and different .ok-file output.

Replace with best_wns_ever vs initial_wns_: only fire when WNS has
effectively never moved from its starting value, which is the real
signature of an obviously-futile run.

Also bump rel_tol 0.5% -> 5%. Empirically aes improves WNS by ~50% of
initial, clone_flat by ~95%, repair_fanout by ~90%; the hopeless.v
synthetic only moves WNS by ~2% because buffer insertion chips
something off even a grossly-over-clocked design. 5% sits in the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
@github-actions
Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@oharboe oharboe requested a review from maliberty April 24, 2026 16:50
@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 24, 2026

use case: DSE should terminate quickly on lost causes without a human in the loop

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 25, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a WNS-stagnation gate to the repair_timing process to prevent excessive iterations on designs where timing closure is infeasible. The gate monitors Worst Negative Slack (WNS) and terminates the repair phase if improvement falls below a defined threshold after a warmup period. The changes include new tracking variables, a reset mechanism for each repair phase, and detailed logging when stagnation is detected. Additionally, a new regression test named "hopeless" has been added to verify that the flow correctly identifies and bails on pathological designs. I have no feedback to provide.

@maliberty
Copy link
Copy Markdown
Member

It might be better to wait for #10223 to land ; @jhkim-pii opinion?

@maliberty maliberty requested a review from jhkim-pii April 26, 2026 00:02
@jhkim-pii
Copy link
Copy Markdown
Contributor

I think merging this PR first would be better. Otherwise this should be rebuilt on top of the new Resizer architecture by author, which looks more difficult.

@jhkim-pii
Copy link
Copy Markdown
Contributor

I am curious about ORFS desgin QoR impact by this PR.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 26, 2026

I am curious about ORFS desgin QoR impact by this PR.

I think none, it passes. There are no obviously futile design configurations in ORFS designs.

I had to add an obviously futile configuration in this PR to test.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 27, 2026

From my point of view it would be nice to have this merged, even if it is rewritten in the future. It captures a use-case that is important to us.

Presumably the test is what I really care about, the current code is just a stop-gap that reduces the urgency and the test case is what protects the use-case.

@maliberty
Copy link
Copy Markdown
Member

There is no need for a full orfs run to just test rsz. This should be a unit test of a repair_setup command.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 27, 2026

There is no need for a full orfs run to just test rsz. This should be a unit test of a repair_setup command.

How do I write that so it isn't a "working as implemented" test?

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@jhkim-pii Please give me some guidance on writing a good test here. I'll pass it along to claude 🤦

@jhkim-pii
Copy link
Copy Markdown
Contributor

@oharboe OK. I'll try to make the test case and share it. I think sharing the test case would be easier than explaining how to write such test.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@oharboe OK. I'll try to make the test case and share it. I think sharing the test case would be easier than explaining how to write such test.

I asked Claude to make something... Should be done soon...

oharboe and others added 2 commits April 28, 2026 10:04
Tiny TCL unit test exercising the WNS-stagnation gate added in this PR
without scraping log strings.  It builds a flat block of 1200
register-to-register pairs (DFF_X2 driver -> DFF_X2 sink, no logic in
between), holds the cell library to _X1 / _X2 sizes, and sets a 1 ps
clock period so that no SizeUpMove / BufferMove / pin-swap / clone /
split-load / unbuffer move can improve WNS.  The endpoint count is
chosen so the inner-loop counter passes the gate's warmup before the
legacy phase has visited every endpoint.

Without the gate the same .tcl runs to iteration 1200 (one futile pass
per endpoint); with the gate it exits at iteration 1002 and the .ok
diverges, so the test fails without the fix and passes with it.

Replaces the orfs-based test/orfs/hopeless gate-coverage check that
maliberty asked to convert into a repair_setup-only unit test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>
Signed-off-by: Øyvind Harboe <oyvind.harboe@zylin.com>

# Conflicts:
#	src/rsz/test/CMakeLists.txt
@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@jhkim-pii I have no idea what Claude did there :-) It is short and it doesn't use verilog as input...

@github-actions
Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@maliberty @jhkim-pii I don't know what's failing here, but I don't think it is related to this PR. I just merged with origin/master.

@jhkim-pii
Copy link
Copy Markdown
Contributor

@jhkim-pii I have no idea what Claude did there :-) It is short and it doesn't use verilog as input...

Looks good.
I generated a test case that is similar to yours.

@jhkim-pii
Copy link
Copy Markdown
Contributor

@maliberty @jhkim-pii I don't know what's failing here, but I don't think it is related to this PR. I just merged with origin/master.

I added the failing test case repair_hold_multi_output_load.tcl in another PR #10236.

Currently, repair_hold_multi_output_load.tcl is failing in the HEAD of OR master.

I found that merge of the PR #10141 changed the QoR slightly.

I'll open a new PR to rebase the failing test repair_hold_multi_output_load.tcl.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@jhkim-pii Thank you!

I'll open a new PR to rebase the failing test repair_hold_multi_output_load.tcl.

Please close this PR when you open a new one. Thank you for nursing it through 😌

@jhkim-pii
Copy link
Copy Markdown
Contributor

jhkim-pii commented Apr 28, 2026

@oharboe OK. By the way, test/orfs/hopeless test case should be removed because it is redundant (you added a new rsz test case), right?

I wonder if you wish to keep the orfs/hopeless test case for any reason.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 28, 2026

@oharboe OK. By the way, test/orfs/hopeless test case should be removed because it is redundant (you added a new rsz test case), right?

I wonder if you wish to keep the orfs/hopeless test case for any reason.

No particular reason, please remove it. I just think that this feature should have a test that protects the use case :-)

@jhkim-pii
Copy link
Copy Markdown
Contributor

@oharboe

PR description: Adds a deterministic WNS-stagnation gate to RepairSetup::terminateProgress(). Best-so-far WNS is sampled every pass into a 200-pass ring buffer; if the best observed value has not improved by max(1 ps, 0.5% * |initial_wns|) over a full window, the gate returns true.

--> I cannot find any ring buffer in the code. It looks like the code implementation is different from what you intended.

@jhkim-pii
Copy link
Copy Markdown
Contributor

Matt rebased the repair_hold_multi_output_load.tcl regression, so there will be no issue if you pull origin/master again.

const std::string wns_msg = wnsStagnationReport(opto_iteration);
if (!wns_msg.empty()) {
logger_->info(RSZ, 236, "{}", wns_msg);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger_->info(...) call in every caller site.
It would be better to move the logger_->info(...) call within wnsStagnationReport().
Then wnsStagnationReport() does not require the argument and string return.

@jhkim-pii
Copy link
Copy Markdown
Contributor

@oharboe I implemented the ring buffer here in the new PR including this PR.

@oharboe
Copy link
Copy Markdown
Collaborator Author

oharboe commented Apr 29, 2026

@oharboe I implemented the ring buffer here in the new PR including this PR.

Thank you!

@oharboe oharboe closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants