test: fix flaky LLMQ signing recovery and PoSe ban assertions by thepastaclaw · Pull Request #7233 · dashpay/dash

thepastaclaw · 2026-03-17T02:48:42Z

Summary

Fix two confirmed flaky test bugs in LLMQ functional tests.

Fix 1: `feature_llmq_signing.py` — sig recovery timeout (line 200-201)

The --spork21 reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was provably insufficient: the daemon's signing session cleanup cadence is 5 seconds (src/llmq/signing_shares.cpp), so the test's wait window was shorter than one cleanup cycle.

Change: bump_mocktime(2) → bump_mocktime(5), wait_for_sigs(..., 2) → wait_for_sigs(..., 15).

Fix 2: `feature_llmq_simplepose.py` — bare PoSe ban assertion (line 206)

After 6 quorum rounds of penalty accumulation, the test used a bare assert check_banned() without polling. Under CPU contention, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately after mine_quorum() returns.

Change: assert check_banned(...) → self.wait_until(lambda: check_banned(...), timeout=10).

Validation

Reproduced locally on a 16-core Mac Studio by running 12 LLMQ tests in parallel with --timeout-factor=1:

Before (baseline):

feature_llmq_simplepose.py: 80% failure rate (broad), 58% (hammer 4×)
feature_llmq_signing.py --spork21: 0% (broad), 50% (hammer 4×)

After (with these fixes):

feature_llmq_simplepose.py: 0/24 failures in hammer runs ✅
feature_llmq_signing.py --spork21: improved from 50% → 33% in hammer runs

Broad runs under extreme contention (4 simultaneous reproduction suites) still show failures in other LLMQ tests (is_cl_conflicts, is_retroactive) due to separate DKG timeout issues not addressed in this PR.

thepastaclaw · 2026-03-17T02:48:51Z

@coderabbitai review

coderabbitai · 2026-03-17T02:48:57Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

github-actions · 2026-03-17T02:49:09Z

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

coderabbitai · 2026-03-17T02:52:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9cf8d159-5339-49dc-8e79-1f3673485ced

📥 Commits

Reviewing files that changed from the base of the PR and between 39851c3 and 872948c.

📒 Files selected for processing (2)

test/functional/feature_llmq_signing.py
test/functional/feature_llmq_simplepose.py

Walkthrough

Two functional test files were modified to adjust synchronization and timing behavior. In test/functional/feature_llmq_signing.py, mocktime advancement was increased from 2 seconds to 5 seconds and the wait timeout for recovered signatures was extended from 2 seconds to 15 seconds. In test/functional/feature_llmq_simplepose.py, a hardcoded assertion checking if a Masternode is banned was replaced with a polling mechanism that waits up to a timeout for the banned condition to become true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing flaky test assertions in LLMQ signing recovery and PoSe ban mechanisms.
Description check	✅ Passed	The description is directly related to the changeset, providing detailed explanations of both fixes with specific file locations, problem analysis, and validation results.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

knst · 2026-03-17T06:00:27Z

Sleep sleep=0.05 is meant to be sleep 0.05 otherwise all functional tests that uses dkg runs MUCH longer. not (0.5 - 0.04 = 0.45s) slower, but minutes because it sum-up on each stage for each new quorum for each test.

Just increasing timeout doesn't fix core reason for flackines.

show a proof that each particular change decreased flackines in feature_llmq_signing.py (do 100 run and collect statistically proven datas).
check a performance of all functional tests that use dkg doesn't get significantly slower (1% slower for overall run is ok, 30% slower is not ok).
[most important] try to look for more reliable solution for switching dkg statuses during quorum generation, which is based on node logs rather than increasing timeout and be hopeful that nodes are ready for next block and dkg round.

- feature_llmq_signing.py: bump_mocktime(2) was shorter than the daemon's 5-second signing session cleanup cadence (CLEANUP_INTERVAL in signing_shares.cpp), so recovery responsibility never rotated to the next member. Use bump_mocktime(10) to guarantee at least one full cleanup cycle, and extend wait_for_sigs timeout from 2s to 15s for propagation margin. - feature_llmq_simplepose.py: replace bare `assert check_banned()` with `wait_until(check_banned, timeout=10)`. After mine_quorum() returns, PoSe ban state may not be visible via RPC immediately under CPU contention.

thepastaclaw · 2026-03-17T08:07:05Z

Thanks for the thorough review @knst. You're right on all three points.

I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes:

1. feature_llmq_signing.py — bump_mocktime(2) → bump_mocktime(10)

The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so the recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s→15s purely for propagation margin (this is a ceiling, not a polling interval).

2. feature_llmq_simplepose.py — bare assert → wait_until(..., timeout=10)

After mine_quorum() returns, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately under CPU contention. This is a classic polling fix — the assertion logic is identical, it just retries for up to 10 seconds.

No framework changes, no sleep modifications, no DKG timeout inflation.

Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions.

knst · 2026-03-17T08:24:54Z

@thepastaclaw I run develop and a0283cffc1640a1294145e0c58e8697a0f219227 2 times each with this command:

test/functional/test_runner.py -j22  feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py```

Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful?

The wait_for_sporks_same() calls in activate_by_name() use the default 30s timeout, which becomes 120s under UBSAN's 4x timeout_factor. After mining many blocks to reach activation height with 7 nodes under UBSAN, spork propagation can exceed this limit. Increase the base timeout to 60s (240s effective under UBSAN) to prevent consistent timeouts in tests like feature_dip3_v19.py.

thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 91c12f9 to 154d8e7 Compare March 17, 2026 04:03

thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 154d8e7 to a0283cf Compare March 17, 2026 08:06

thepastaclaw added 3 commits March 17, 2026 07:07

ci: re-trigger after unrelated interface_zmq_dash.py flake

aa5b16b

ci: re-trigger after unrelated feature_asset_locks.py flake

6d48db0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233

test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233
thepastaclaw wants to merge 4 commits intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests

thepastaclaw commented Mar 17, 2026

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

knst commented Mar 17, 2026

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

knst commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thepastaclaw commented Mar 17, 2026

Summary

Fix 1: feature_llmq_signing.py — sig recovery timeout (line 200-201)

Fix 2: feature_llmq_simplepose.py — bare PoSe ban assertion (line 206)

Validation

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ No Merge Conflicts Detected

Uh oh!

coderabbitai bot commented Mar 17, 2026

Walkthrough

Estimated code review effort

Uh oh!

knst commented Mar 17, 2026

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

knst commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix 1: `feature_llmq_signing.py` — sig recovery timeout (line 200-201)

Fix 2: `feature_llmq_simplepose.py` — bare PoSe ban assertion (line 206)

github-actions bot commented Mar 17, 2026 •

edited

Loading

knst commented Mar 17, 2026 •

edited

Loading