Skip to content

Fix Windows service premature exit due to transient lock file false negatives#623

Merged
vharseko merged 4 commits intocopilot/fix-dostopapplication-infinite-loopfrom
copilot/fix-opendj-windows-service-exit
Apr 2, 2026
Merged

Fix Windows service premature exit due to transient lock file false negatives#623
vharseko merged 4 commits intocopilot/fix-dostopapplication-infinite-loopfrom
copilot/fix-opendj-windows-service-exit

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 2, 2026

The OpenDJ Windows service exits prematurely when isServerRunning() returns a false negative — e.g., during JVM GC pressure or heavy I/O after a large ldapsearch — causing net stop to fail with exit code 2 ("service not running").

Changes in service.c

  • Retry logic in serviceMain monitoring loop: A single isServerRunning() == FALSE no longer immediately triggers SERVICE_STOPPED. The code now retries 3× with 2s between each attempt before concluding the server has actually stopped. If any retry finds the server running, the monitoring loop continues normally.

    // Before: one false-negative → immediate SERVICE_STOPPED + break
    else
    {
      DWORD state;
      BOOL success = getServiceStatus(serviceName, &state);
      if (!(success && ((state == SERVICE_STOPPED) || (state == SERVICE_STOP_PENDING))))
      {
        _serviceCurStatus = SERVICE_STOPPED;
        updateServiceStatus(..., ERROR_SERVICE_SPECIFIC_ERROR, ...);
        reportLogEvent(EVENTLOG_ERROR_TYPE, WIN_EVENT_ID_SERVER_STOPPED_OUTSIDE_SCM, ...);
      }
      break;
    }
    
    // After: 3 retries × 2s before concluding server stopped
    else
    {
      int retryCount = 3;
      BOOL confirmedStopped = TRUE;
      while (retryCount > 0)
      {
        retryCount--;
        Sleep(2000);
        code = isServerRunning(&running, TRUE);
        if (code == SERVICE_RETURN_OK && running) { confirmedStopped = FALSE; break; }
      }
      if (confirmedStopped)
      {
        // ... report SERVICE_STOPPED and break
      }
      // else: continue monitoring
    }
  • Increased stop timeout in doStopApplication: nTries raised from 10 to 30, extending the graceful shutdown wait from ~23s to ~63s for heavily loaded servers.

Original prompt

Problem

The Windows CI job is failing because the OpenDJ Windows service exits prematurely. The root cause is in opendj-server-legacy/src/build-tools/windows/service.c.

See failing job: https://github.com/OpenIdentityPlatform/OpenDJ/actions/runs/23903213518/job/69705010918

The error from the CI logs:

The OpenDJ Server service was started successfully.
10000                              ← ldapsearch verified 10000 entries
System error 109 has occurred.     ← pipe has been ended
The pipe has been ended.
...
net stop 'OpenDJ Server' failed with exit code 2

The service starts fine, ldapsearch returns the expected 10000 results, but then the service process dies on its own before net stop can be issued. Exit code 2 from net stop means the service doesn't exist or isn't running.

Root Cause 1: Service exits prematurely (false-negative lock check)

In service.c, the serviceMain function (around lines 1174–1271) has a monitoring loop that periodically calls isServerRunning() to check if the Java server process still holds the lock file at \locks\server.lock.

The problem is in the else branch around lines 1242–1268: when isServerRunning returns running = FALSE even once, the code immediately concludes the server has stopped, reports SERVICE_STOPPED with ERROR_SERVICE_SPECIFIC_ERROR, and exits the loop. This causes the Windows service process (opendj_service.exe) to exit.

However, the lock file check can produce transient false negatives — for example, during heavy I/O after ldapsearch finishes processing 10,000 entries, or during JVM GC pressure. A single false-negative check should NOT cause the service to terminate.

Fix

Add retry logic in the else branch (around lines 1242–1268) so that when isServerRunning returns running = FALSE, the code retries a few times (e.g., 3 retries with a 2-second sleep between each) before concluding the server has actually stopped. Only if the server is confirmed stopped after all retries should the code report SERVICE_STOPPED and break out of the loop. If any retry finds the server running, continue the monitoring loop as normal.

Here is the current code that needs to be changed (around lines 1242-1268):

        else
        {
      // Check current Status
      DWORD state;
      BOOL success = getServiceStatus(serviceName, &state);
          if (!(success &&
               ((state == SERVICE_STOPPED) ||
                (state == SERVICE_STOP_PENDING))))
          {
          WORD argCount = 1;
            const char *argc[] = {_instanceDir};
            _serviceCurStatus = SERVICE_STOPPED;
            debug("checking in serviceMain serviceHandler: service stopped with error.");

            updateServiceStatus (
              _serviceCurStatus,
              ERROR_SERVICE_SPECIFIC_ERROR,
              -1,
              CHECKPOINT_NO_ONGOING_OPERATION,
              TIMEOUT_NONE,
              _serviceStatusHandle);
            reportLogEvent(
              EVENTLOG_ERROR_TYPE,
              WIN_EVENT_ID_SERVER_STOPPED_OUTSIDE_SCM,
              argCount, argc);
           }
          break;
        }

Replace it with something like:

        else
        {
          // Server appears not running - retry a few times before concluding
          // it has actually stopped (the lock file check can be transient)
          int retryCount = 3;
          BOOL confirmedStopped = TRUE;
          while (retryCount > 0)
          {
            Sleep(2000);
            retryCount--;
            code = isServerRunning(&running, TRUE);
            if (code == SERVICE_RETURN_OK && running)
            {
              confirmedStopped = FALSE;
              break;
            }
          }
          
          if (confirmedStopped)
          {
            // Check current Status
            DWORD state;
            BOOL success = getServiceStatus(serviceName, &state);
            if (!(success &&
                 ((state == SERVICE_STOPPED) ||
                  (state == SERVICE_STOP_PENDING))))
            {
              WORD argCount = 1;
              const char *argc[] = {_instanceDir};
              _serviceCurStatus = SERVICE_STOPPED;
              debug("checking in serviceMain serviceHandler: service stopped with error.");

              updateServiceStatus (
                _serviceCurStatus,
                ERROR_SERVICE_SPECIFIC_ERROR,
                -1,
                CHECKPOINT_NO_ONGOING_OPERATION,
                TIMEOUT_NONE,
                _serviceStatusHandle);
              reportLogEvent(
                EVENTLOG_ERROR_TYPE,
                WIN_EVENT_ID_SERVER_STOPPED_OUTSIDE_SCM,
                argCount, argc);
            }
            break;
          }
          // else: server is actually still running, continue monitoring
        }

Root Cause 2: Stop timeout too short

In the doStopApplication function (around lines 702–761), after issuing stop-ds.bat, the code only retri...

This pull request was created from Copilot chat.

Copilot AI and others added 3 commits April 2, 2026 16:39
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Removed pull_request branch specification from build workflow.
- Add retry logic (3 retries × 2s) in serviceMain before declaring server
  stopped to avoid false negatives from transient lock file check failures
- Increase doStopApplication nTries from 10 to 30 for heavier loaded servers

Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenDJ/sessions/b618397b-6f5a-4c9b-8bec-3bb14df3e3e3

Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix OpenDJ Windows service premature exit issue Fix Windows service premature exit due to transient lock file false negatives Apr 2, 2026
Copilot AI requested a review from vharseko April 2, 2026 17:07
@vharseko vharseko changed the base branch from master to copilot/fix-dostopapplication-infinite-loop April 2, 2026 17:08
@vharseko vharseko marked this pull request as ready for review April 2, 2026 19:46
@vharseko vharseko merged commit 85d40f0 into copilot/fix-dostopapplication-infinite-loop Apr 2, 2026
15 checks passed
@vharseko vharseko deleted the copilot/fix-opendj-windows-service-exit branch April 2, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants