Skip to content

Fix intermittent GenerationIdTest.testMultiRS race condition on RS-to-RS topology#622

Merged
vharseko merged 4 commits intomasterfrom
copilot/add-get-num-rss-method
Apr 3, 2026
Merged

Fix intermittent GenerationIdTest.testMultiRS race condition on RS-to-RS topology#622
vharseko merged 4 commits intomasterfrom
copilot/add-get-num-rss-method

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 2, 2026

testMultiRS fails intermittently with replServer2 reporting generation ID -1 instead of 48 because waitForDomainsOnAllReplicationServers() only verified domain object existence — not that RS-to-RS connections were actually established — so generation ID propagation hadn't completed when assertions ran.

Changes

  • ReplicationServerDomain — add getNumRSs() convenience method returning connectedRSs.size()

  • waitForDomainsOnAllReplicationServers() — strengthen the wait condition to assert each RS has ≥ 2 connected peer RSs (via getNumRSs()), guaranteeing the full 3-RS mesh is up before proceeding; increase timeout 30 s → 60 s

  • waitForStableGenerationId() — increase timeout 60 s → 120 s for slow CI runners

// Now waits for actual RS-to-RS connectivity, not just domain object existence
assertTrue(d1.getNumRSs() >= 2,
    "replServer1 not yet connected to other RSs, got " + d1.getNumRSs());
assertTrue(d2.getNumRSs() >= 2,
    "replServer2 not yet connected to other RSs, got " + d2.getNumRSs());
assertTrue(d3.getNumRSs() >= 2,
    "replServer3 not yet connected to other RSs, got " + d3.getNumRSs());
Original prompt

Problem

The GenerationIdTest.testMultiRS test is failing intermittently in CI with:

org.assertj.core.api.SoftAssertionError:
1) [in replServer2] expected:<[48]L> but was:<[-1]L>

The failure trace is:

GenerationIdTest.testMultiRS:927->waitForStableGenerationId:1017->assertGenIdEquals:1033

Root cause

The waitForDomainsOnAllReplicationServers() method at line 993-1009 of GenerationIdTest.java only checks that the ReplicationServerDomain object exists on each replication server, but does NOT verify that the RS-to-RS connections are fully established. As a result, when waitForStableGenerationId(EMPTY_DN_GENID) is called on line 927, the generation ID has not yet propagated from RS1 to RS2 because the inter-RS connections haven't been completed yet.

Required changes

1. Add getNumRSs() method to ReplicationServerDomain

File: opendj-server-legacy/src/main/java/org/opends/server/replication/server/ReplicationServerDomain.java

Add a new public method getNumRSs() that returns the number of connected replication servers. The class already has a connectedRSs field (ConcurrentHashMap<Integer, ReplicationServerHandler>) at line 125, and a getConnectedRSs() method at line 2134 that returns the map. Add a convenience method:

/**
 * Returns the number of replication servers currently connected to this domain.
 * @return The number of connected RSs
 */
public int getNumRSs()
{
    return connectedRSs.size();
}

Place this method right after the existing getConnectedRSs() method (after line 2137).

2. Strengthen waitForDomainsOnAllReplicationServers() in the test

File: opendj-server-legacy/src/test/java/org/opends/server/replication/GenerationIdTest.java

Modify the waitForDomainsOnAllReplicationServers() method (lines 993-1009) to:

  1. Increase the timeout from 30 seconds to 60 seconds
  2. In addition to checking that domains exist, also verify that each RS's domain has at least 2 connected peer RSs using the new getNumRSs() method

The updated method should look like:

private void waitForDomainsOnAllReplicationServers() throws Exception
{
    TestTimer timer = new TestTimer.Builder()
      .maxSleep(60, SECONDS)
      .sleepTimes(100, MILLISECONDS)
      .toTimer();
    timer.repeatUntilSuccess(new CallableVoid()
    {
      @Override
      public void call() throws Exception
      {
        ReplicationServerDomain d1 = replServer1.getReplicationServerDomain(baseDN);
        assertNotNull(d1, "domain missing on replServer1");
        ReplicationServerDomain d2 = replServer2.getReplicationServerDomain(baseDN);
        assertNotNull(d2, "domain missing on replServer2");
        ReplicationServerDomain d3 = replServer3.getReplicationServerDomain(baseDN);
        assertNotNull(d3, "domain missing on replServer3");
        // Ensure RS-to-RS topology connections are fully established
        assertTrue(d1.getNumRSs() >= 2,
            "replServer1 not yet connected to other RSs, got " + d1.getNumRSs());
        assertTrue(d2.getNumRSs() >= 2,
            "replServer2 not yet connected to other RSs, got " + d2.getNumRSs());
        assertTrue(d3.getNumRSs() >= 2,
            "replServer3 not yet connected to other RSs, got " + d3.getNumRSs());
      }
    });
}

3. Increase timeout in waitForStableGenerationId()

File: opendj-server-legacy/src/test/java/org/opends/server/replication/GenerationIdTest.java

At line 1013, increase the maxSleep from 60 seconds to 120 seconds to account for slow CI environments:

private void waitForStableGenerationId(final long expectedGenId) throws Exception
{
    TestTimer timer = new TestTimer.Builder()
      .maxSleep(120, SECONDS)
      .sleepTimes(100, MILLISECONDS)
      .toTimer();
    ...
}

Files to modify

  • opendj-server-legacy/src/main/java/org/opends/server/replication/server/ReplicationServerDomain.java — add getNumRSs() method after line 2137
  • opendj-server-legacy/src/test/java/org/opends/server/replication/GenerationIdTest.java — update waitForDomainsOnAllReplicationServers() (lines 993-1009) and waitForStableGenerationId() (line 1013)

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant:

I'll start by fetching the job logs and the workflow definition to understand the failure.

Assistant: ## Failing Test Analysis

The build fails due to a single test failure in GenerationIdTest.testMultiRS:

org.assertj.core.api.SoftAssertionError:
1) [in replServer2] expected:<[48]L> but was:<[-1]L>

Root Cause

The failure occurs at [line 927 → 1017 → 1033](https://github.com/OpenIdentityPlatform/OpenDJ/blob/930998062a181860ef20b4d3b081f696ae97ecb9/opendj-s...

This pull request was created from Copilot chat.

Base automatically changed from copilot/update-github-actions-versions to master April 2, 2026 13:39
…gthen domain wait, increase timeouts

Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenDJ/sessions/fcf7e435-1019-477d-9285-b3b182966527

Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Copilot AI changed the title [WIP] Add getNumRSs method to ReplicationServerDomain Fix intermittent GenerationIdTest.testMultiRS race condition on RS-to-RS topology Apr 2, 2026
Copilot AI requested a review from vharseko April 2, 2026 13:46
@vharseko vharseko marked this pull request as ready for review April 3, 2026 06:08
@vharseko vharseko merged commit 124603f into master Apr 3, 2026
17 checks passed
@vharseko vharseko deleted the copilot/add-get-num-rss-method branch April 3, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants