test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233
test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233thepastaclaw wants to merge 4 commits intodashpay:developfrom
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughTwo functional test files were modified to adjust synchronization and timing behavior. In Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
91c12f9 to
154d8e7
Compare
|
Sleep Just increasing timeout doesn't fix core reason for flackines.
|
- feature_llmq_signing.py: bump_mocktime(2) was shorter than the daemon's 5-second signing session cleanup cadence (CLEANUP_INTERVAL in signing_shares.cpp), so recovery responsibility never rotated to the next member. Use bump_mocktime(10) to guarantee at least one full cleanup cycle, and extend wait_for_sigs timeout from 2s to 15s for propagation margin. - feature_llmq_simplepose.py: replace bare `assert check_banned()` with `wait_until(check_banned, timeout=10)`. After mine_quorum() returns, PoSe ban state may not be visible via RPC immediately under CPU contention.
154d8e7 to
a0283cf
Compare
|
Thanks for the thorough review @knst. You're right on all three points. I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes: 1. The daemon's signing session cleanup runs every 5 seconds ( 2. After No framework changes, no sleep modifications, no DKG timeout inflation. Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions. |
|
@thepastaclaw I run Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful? |
The wait_for_sporks_same() calls in activate_by_name() use the default 30s timeout, which becomes 120s under UBSAN's 4x timeout_factor. After mining many blocks to reach activation height with 7 nodes under UBSAN, spork propagation can exceed this limit. Increase the base timeout to 60s (240s effective under UBSAN) to prevent consistent timeouts in tests like feature_dip3_v19.py.
Summary
Fix two confirmed flaky test bugs in LLMQ functional tests.
Fix 1:
feature_llmq_signing.py— sig recovery timeout (line 200-201)The
--spork21reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was provably insufficient: the daemon's signing session cleanup cadence is 5 seconds (src/llmq/signing_shares.cpp), so the test's wait window was shorter than one cleanup cycle.Change:
bump_mocktime(2)→bump_mocktime(5),wait_for_sigs(..., 2)→wait_for_sigs(..., 15).Fix 2:
feature_llmq_simplepose.py— bare PoSe ban assertion (line 206)After 6 quorum rounds of penalty accumulation, the test used a bare
assert check_banned()without polling. Under CPU contention, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately aftermine_quorum()returns.Change:
assert check_banned(...)→self.wait_until(lambda: check_banned(...), timeout=10).Validation
Reproduced locally on a 16-core Mac Studio by running 12 LLMQ tests in parallel with
--timeout-factor=1:Before (baseline):
feature_llmq_simplepose.py: 80% failure rate (broad), 58% (hammer 4×)feature_llmq_signing.py --spork21: 0% (broad), 50% (hammer 4×)After (with these fixes):
feature_llmq_simplepose.py: 0/24 failures in hammer runs ✅feature_llmq_signing.py --spork21: improved from 50% → 33% in hammer runsBroad runs under extreme contention (4 simultaneous reproduction suites) still show failures in other LLMQ tests (
is_cl_conflicts,is_retroactive) due to separate DKG timeout issues not addressed in this PR.