Skip to content

test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233

Draft
thepastaclaw wants to merge 4 commits intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests
Draft

test: fix flaky LLMQ signing recovery and PoSe ban assertions#7233
thepastaclaw wants to merge 4 commits intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests

Conversation

@thepastaclaw
Copy link

Summary

Fix two confirmed flaky test bugs in LLMQ functional tests.

Fix 1: feature_llmq_signing.py — sig recovery timeout (line 200-201)

The --spork21 reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was provably insufficient: the daemon's signing session cleanup cadence is 5 seconds (src/llmq/signing_shares.cpp), so the test's wait window was shorter than one cleanup cycle.

Change: bump_mocktime(2)bump_mocktime(5), wait_for_sigs(..., 2)wait_for_sigs(..., 15).

Fix 2: feature_llmq_simplepose.py — bare PoSe ban assertion (line 206)

After 6 quorum rounds of penalty accumulation, the test used a bare assert check_banned() without polling. Under CPU contention, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately after mine_quorum() returns.

Change: assert check_banned(...)self.wait_until(lambda: check_banned(...), timeout=10).

Validation

Reproduced locally on a 16-core Mac Studio by running 12 LLMQ tests in parallel with --timeout-factor=1:

Before (baseline):

  • feature_llmq_simplepose.py: 80% failure rate (broad), 58% (hammer 4×)
  • feature_llmq_signing.py --spork21: 0% (broad), 50% (hammer 4×)

After (with these fixes):

  • feature_llmq_simplepose.py: 0/24 failures in hammer runs ✅
  • feature_llmq_signing.py --spork21: improved from 50% → 33% in hammer runs

Broad runs under extreme contention (4 simultaneous reproduction suites) still show failures in other LLMQ tests (is_cl_conflicts, is_retroactive) due to separate DKG timeout issues not addressed in this PR.

@thepastaclaw
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link

github-actions bot commented Mar 17, 2026

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9cf8d159-5339-49dc-8e79-1f3673485ced

📥 Commits

Reviewing files that changed from the base of the PR and between 39851c3 and 872948c.

📒 Files selected for processing (2)
  • test/functional/feature_llmq_signing.py
  • test/functional/feature_llmq_simplepose.py

Walkthrough

Two functional test files were modified to adjust synchronization and timing behavior. In test/functional/feature_llmq_signing.py, mocktime advancement was increased from 2 seconds to 5 seconds and the wait timeout for recovered signatures was extended from 2 seconds to 15 seconds. In test/functional/feature_llmq_simplepose.py, a hardcoded assertion checking if a Masternode is banned was replaced with a polling mechanism that waits up to a timeout for the banned condition to become true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing flaky test assertions in LLMQ signing recovery and PoSe ban mechanisms.
Description check ✅ Passed The description is directly related to the changeset, providing detailed explanations of both fixes with specific file locations, problem analysis, and validation results.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@thepastaclaw thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 91c12f9 to 154d8e7 Compare March 17, 2026 04:03
@knst
Copy link
Collaborator

knst commented Mar 17, 2026

Sleep sleep=0.05 is meant to be sleep 0.05 otherwise all functional tests that uses dkg runs MUCH longer. not (0.5 - 0.04 = 0.45s) slower, but minutes because it sum-up on each stage for each new quorum for each test.

Just increasing timeout doesn't fix core reason for flackines.

  1. show a proof that each particular change decreased flackines in feature_llmq_signing.py (do 100 run and collect statistically proven datas).
  2. check a performance of all functional tests that use dkg doesn't get significantly slower (1% slower for overall run is ok, 30% slower is not ok).
  3. [most important] try to look for more reliable solution for switching dkg statuses during quorum generation, which is based on node logs rather than increasing timeout and be hopeful that nodes are ready for next block and dkg round.

- feature_llmq_signing.py: bump_mocktime(2) was shorter than the
  daemon's 5-second signing session cleanup cadence (CLEANUP_INTERVAL
  in signing_shares.cpp), so recovery responsibility never rotated to
  the next member. Use bump_mocktime(10) to guarantee at least one full
  cleanup cycle, and extend wait_for_sigs timeout from 2s to 15s for
  propagation margin.

- feature_llmq_simplepose.py: replace bare `assert check_banned()`
  with `wait_until(check_banned, timeout=10)`. After mine_quorum()
  returns, PoSe ban state may not be visible via RPC immediately under
  CPU contention.
@thepastaclaw thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 154d8e7 to a0283cf Compare March 17, 2026 08:06
@thepastaclaw
Copy link
Author

Thanks for the thorough review @knst. You're right on all three points.

I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes:

1. feature_llmq_signing.pybump_mocktime(2)bump_mocktime(10)

The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so the recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s→15s purely for propagation margin (this is a ceiling, not a polling interval).

2. feature_llmq_simplepose.py — bare assertwait_until(..., timeout=10)

After mine_quorum() returns, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately under CPU contention. This is a classic polling fix — the assertion logic is identical, it just retries for up to 10 seconds.

No framework changes, no sleep modifications, no DKG timeout inflation.

Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions.

@knst
Copy link
Collaborator

knst commented Mar 17, 2026

@thepastaclaw I run develop and a0283cffc1640a1294145e0c58e8697a0f219227 2 times each with this command:

test/functional/test_runner.py -j22  feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py```

Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful?

The wait_for_sporks_same() calls in activate_by_name() use the default
30s timeout, which becomes 120s under UBSAN's 4x timeout_factor. After
mining many blocks to reach activation height with 7 nodes under UBSAN,
spork propagation can exceed this limit. Increase the base timeout to
60s (240s effective under UBSAN) to prevent consistent timeouts in tests
like feature_dip3_v19.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants