Fix Windows service premature exit due to transient lock file false negatives#623
Merged
vharseko merged 4 commits intocopilot/fix-dostopapplication-infinite-loopfrom Apr 2, 2026
Conversation
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Removed pull_request branch specification from build workflow.
- Add retry logic (3 retries × 2s) in serviceMain before declaring server stopped to avoid false negatives from transient lock file check failures - Increase doStopApplication nTries from 10 to 30 for heavier loaded servers Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenDJ/sessions/b618397b-6f5a-4c9b-8bec-3bb14df3e3e3 Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix OpenDJ Windows service premature exit issue
Fix Windows service premature exit due to transient lock file false negatives
Apr 2, 2026
85d40f0
into
copilot/fix-dostopapplication-infinite-loop
15 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The OpenDJ Windows service exits prematurely when
isServerRunning()returns a false negative — e.g., during JVM GC pressure or heavy I/O after a largeldapsearch— causingnet stopto fail with exit code 2 ("service not running").Changes in
service.cRetry logic in
serviceMainmonitoring loop: A singleisServerRunning() == FALSEno longer immediately triggersSERVICE_STOPPED. The code now retries 3× with 2s between each attempt before concluding the server has actually stopped. If any retry finds the server running, the monitoring loop continues normally.Increased stop timeout in
doStopApplication:nTriesraised from10to30, extending the graceful shutdown wait from ~23s to ~63s for heavily loaded servers.Original prompt
Problem
The Windows CI job is failing because the OpenDJ Windows service exits prematurely. The root cause is in
opendj-server-legacy/src/build-tools/windows/service.c.See failing job: https://github.com/OpenIdentityPlatform/OpenDJ/actions/runs/23903213518/job/69705010918
The error from the CI logs:
The service starts fine, ldapsearch returns the expected 10000 results, but then the service process dies on its own before
net stopcan be issued. Exit code 2 fromnet stopmeans the service doesn't exist or isn't running.Root Cause 1: Service exits prematurely (false-negative lock check)
In
service.c, theserviceMainfunction (around lines 1174–1271) has a monitoring loop that periodically callsisServerRunning()to check if the Java server process still holds the lock file at\locks\server.lock.The problem is in the
elsebranch around lines 1242–1268: whenisServerRunningreturnsrunning = FALSEeven once, the code immediately concludes the server has stopped, reportsSERVICE_STOPPEDwithERROR_SERVICE_SPECIFIC_ERROR, and exits the loop. This causes the Windows service process (opendj_service.exe) to exit.However, the lock file check can produce transient false negatives — for example, during heavy I/O after
ldapsearchfinishes processing 10,000 entries, or during JVM GC pressure. A single false-negative check should NOT cause the service to terminate.Fix
Add retry logic in the
elsebranch (around lines 1242–1268) so that whenisServerRunningreturnsrunning = FALSE, the code retries a few times (e.g., 3 retries with a 2-second sleep between each) before concluding the server has actually stopped. Only if the server is confirmed stopped after all retries should the code reportSERVICE_STOPPEDand break out of the loop. If any retry finds the server running, continue the monitoring loop as normal.Here is the current code that needs to be changed (around lines 1242-1268):
Replace it with something like:
Root Cause 2: Stop timeout too short
In the
doStopApplicationfunction (around lines 702–761), after issuingstop-ds.bat, the code only retri...This pull request was created from Copilot chat.