Context
From PR #154. With real wd_in-fixed FakeRAM and the corrected single-clock + generated-clock-feedthrough SDC, litepci-asap7 closes setup (reg-to-reg MET +0.758 ns at 3.6 ns) but does not close hold: ~2.2 ns CTS clock skew, ~11,000 hold violations.
Root cause
The fixed LiteX RTL clocks 20 DMA/TLP FakeRAM macros from a single small clock source — the pcie_us/user_clk blackbox-macro pin. Those 20 sink macros are 59% (util 60) – 74% (util 75) of the core area. CTS must snake a deep buffer chain across the die around macro blockages:
- launch clock → one FakeRAM: 3 buffers, 0.21 ns
- capture clock → another FakeRAM: ~25 buffers, 2.94 ns
- → ~2.2–2.7 ns launch/capture skew → ~11k hold violations
In-scope levers tried (all fail)
- Utilization 60→75: die −20%, setup TNS −75%, but skew only 2.7→2.2 ns and hold-violation count flat — skew is ~die-size-independent.
- Macro placement — one tight block (overflows die height), balanced perimeter ring (overflows usable edge length), two-band + central clock stripe (bands consume the full die, no stripe). All fail the same area/perimeter wall: the macros are too dense to co-locate compactly with the single clock-source pin and leave room for a balanced central clock spine at any util meeting the area goal.
- SDC — single-clock + generated-clock-feedthrough fixes removed the spurious −2 ns WNS but do not touch the physical skew.
SKIP_INCREMENTAL_REPAIR=1 (ODB-1200 workaround) also disables post-GRT hold repair, but data-side hold buffering cannot absorb a ~2.2 ns structural skew anyway.
Why not just fix it
Closing hold would require modifying the RTL (the clock buffer/tree a real SoC integrator would add around the PCIe hard-IP user clock), which the HighTide benchmark charter forbids (RTL is a fixed input). So asap7's current deliverable is: setup closes, hold characterized as structurally skew-bound — a legitimate benchmark result.
Possible future directions (need discussion)
- Useful-skew / CTS clustering knobs — investigate whether ORFS CTS can be told to build a balanced multi-root tree to the FakeRAM clk pins (e.g.
CTS_* clustering, per-sink-group roots). Lower confidence given the physical spread, but not yet exhausted.
- Macro-pin clock source modeling — the clock root being a blackbox-macro output pin may prevent CTS from rooting/buffering optimally; explore an SDC/
PRE_CTS_TCL approach that gives CTS a better insertion point.
- Charter exception — if a single inserted clock buffer/tree at the
pcie_us/user_clk boundary is considered integration glue (not core RTL), a minimal netlist-level CTS hint could be permitted.
- Accept + document (current state) — keep the characterization; revisit if OpenROAD CTS gains better macro-aware balancing upstream.
See designs/src/litepci/DECISIONS.md ("Per-platform PPA + the structural hold-skew limit") for the full analysis.
Context
From PR #154. With real wd_in-fixed FakeRAM and the corrected single-clock + generated-clock-feedthrough SDC, litepci-asap7 closes setup (reg-to-reg MET +0.758 ns at 3.6 ns) but does not close hold: ~2.2 ns CTS clock skew, ~11,000 hold violations.
Root cause
The fixed LiteX RTL clocks 20 DMA/TLP FakeRAM macros from a single small clock source — the
pcie_us/user_clkblackbox-macro pin. Those 20 sink macros are 59% (util 60) – 74% (util 75) of the core area. CTS must snake a deep buffer chain across the die around macro blockages:In-scope levers tried (all fail)
SKIP_INCREMENTAL_REPAIR=1(ODB-1200 workaround) also disables post-GRT hold repair, but data-side hold buffering cannot absorb a ~2.2 ns structural skew anyway.Why not just fix it
Closing hold would require modifying the RTL (the clock buffer/tree a real SoC integrator would add around the PCIe hard-IP user clock), which the HighTide benchmark charter forbids (RTL is a fixed input). So asap7's current deliverable is: setup closes, hold characterized as structurally skew-bound — a legitimate benchmark result.
Possible future directions (need discussion)
CTS_*clustering, per-sink-group roots). Lower confidence given the physical spread, but not yet exhausted.PRE_CTS_TCLapproach that gives CTS a better insertion point.pcie_us/user_clkboundary is considered integration glue (not core RTL), a minimal netlist-level CTS hint could be permitted.See
designs/src/litepci/DECISIONS.md("Per-platform PPA + the structural hold-skew limit") for the full analysis.