Skip to content

fix: retry loop for Windows Service start race condition (issue #259)#609

Open
Copilot wants to merge 15 commits intomasterfrom
copilot/fix-windows-service-start-issue
Open

fix: retry loop for Windows Service start race condition (issue #259)#609
Copilot wants to merge 15 commits intomasterfrom
copilot/fix-windows-service-start-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 31, 2026

  • Add Portions Copyright 2026 3A Systems, LLC. to the copyright header of service.c
  • Merge branch copilot/add-windows-native-executables
  • Replace opendj-server-legacy/lib Windows executables with freshly built binaries from CI artifact windows-exe-11 (run #23839486071):
    • launcher_administrator.exe (80 KB → 155 KB)
    • opendj_service.exe (144 KB → 166 KB)
    • winlauncher.exe (82 KB → 154 KB)
Original prompt

Problem

Issue: #259

OpenDJ 4.5 fails to start as a Windows Service on Windows 11 21H2. The service stops a few seconds after starting with the error:

ERROR: serviceMain: doStartApplication() failed

Root Cause

In opendj-server-legacy/src/build-tools/windows/service.c, the doStartApplication() function has a race condition. When start-ds.bat is invoked with --windowsNetStart, the following sequence occurs:

  1. opendj_service.exe launches start-ds.bat --windowsNetStart
  2. start-ds.bat calls --checkStartability which returns code 102 (START_AS_DETACH_CALLED_FROM_WINDOWS_SERVICE)
  3. winlauncher.exe spawns the Java DirectoryServer process in the background (background=1) and immediately returns the PID
  4. start-ds.bat waits for server.starting file to be deleted, then does a --checkStartability check and exits with code 0
  5. opendj_service.exe sees waitForProcess succeeded with exitCode=0, so both createOk and waitOk are TRUE
  6. The bug: In the if (createOk && waitOk) branch (around line 566), the code checks isServerRunning() only once — no retry loop
  7. The Java process hasn't yet acquired the exclusive lock on server.lock, so isServerRunning() returns FALSE
  8. The service is marked as failed

In contrast, the else if (createOk) branch (when the wait times out) does have a retry loop with up to 100 tries and 5-second sleep intervals. The createOk && waitOk branch incorrectly assumes that if start-ds.bat exited successfully, the Java server must already hold the lock — but on Windows 11 (and potentially other systems), the JVM startup is slow enough that this assumption fails.

Required Fix

Replace the single isServerRunning() check in the if (createOk && waitOk) branch with a retry loop similar to the one in the else if (createOk) branch. This ensures the service controller waits for the Java process to actually acquire the server lock before declaring failure.

The fix should be in file opendj-server-legacy/src/build-tools/windows/service.c in the doStartApplication() function.

Find this block (approximately lines 563-580):

  if (createOk && waitOk)
    {
    BOOL running;
      // Just check once if the server is running or not: since the wait
      // wait was successful, if the server is getting the lock, it already
      // got it.
    isServerRunning(&running, TRUE);
    if (running)
      {
        returnValue = SERVICE_RETURN_OK;
        debug("doStartApplication: server running.");
      }
      else
      {
        returnValue = SERVICE_RETURN_ERROR;
        debug("doStartApplication: server not running.");
      }
  }

And replace it with:

  if (createOk && waitOk)
    {
      // The batch file process completed successfully, but the Java server
      // process may not have acquired the lock file yet (especially on
      // Windows 11 where JVM startup can be slower). Retry with a loop
      // similar to the else-if branch below.
      // See: https://github.com/OpenIdentityPlatform/OpenDJ/issues/259
      const DWORD DEFAULT_TRIES = 100;
      int nTries = DEFAULT_TRIES;
      char * nTriesEnv = getenv("OPENDJ_WINDOWS_SERVICE_START_NTRIES");
      BOOL running = FALSE;
      if (nTriesEnv != NULL)
      {
        debug("OPENDJ_WINDOWS_SERVICE_START_NTRIES env var set to %s", nTriesEnv);
        nTries = (int)strtol(nTriesEnv, (char **)NULL, 10);
        if (nTries <= 0)
        {
          nTries = DEFAULT_TRIES;
        }
      }
      else
      {
        debug("OPENDJ_WINDOWS_SERVICE_START_NTRIES is not set.  Using default %d tries.", nTries);
      }

      while ((nTries > 0) && !running)
      {
        nTries--;
        if (isServerRunning(&running, TRUE) != SERVICE_RETURN_OK)
        {
          break;
        }
        if (!running)
        {
          debug("Sleeping for 5 seconds to allow the process to get the lock.  %d tries remaining.",
              nTries);
          Sleep(5000);
        }
      }
      if (running)
      {
        returnValue = SERVICE_RETURN_OK;
        debug("doStartApplication: server running.");
      }
      else
      {
        returnValue = SERVICE_RETURN_ERROR;
        debug("doStartApplication: server not running.");
      }
  }

This change makes both branches of doStartApplication() consistently use a retry loop, controlled by the same OPENDJ_WINDOWS_SERVICE_START_NTRIES environment variable.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: проанализируй

Assistant:

Let me analyze this issue by examining the relevant code in the repository.

Assistant: ## Анализ Issue #259: OpenDJ 4.5 не стартует как Windows Service на Windows 11

Корневая причина (Root Cause)

Проблема заключается в гонке состояний (race condition) между Java-процессом сервера и нативным Windows-сервис-контроллером (`opendj_servic...

This pull request was created from Copilot chat.

@vharseko vharseko linked an issue Mar 31, 2026 that may be closed by this pull request
@vharseko vharseko changed the title [WIP] Fix OpenDJ service start failure on Windows 11 21H2 [WIP] Fix race condition in doStartApplication for Windows service startup Mar 31, 2026
…fix Windows Service startup race condition (#259)

Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenDJ/sessions/bda703f7-83cc-4cf1-816a-65b26d4b8917

Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix race condition in doStartApplication for Windows service startup fix: retry loop for Windows Service start race condition (issue #259) Mar 31, 2026
Copilot AI requested a review from vharseko March 31, 2026 17:13
Copilot AI and others added 6 commits March 31, 2026 18:20
…429e1a9-dbd5-4b36-b891-14c7551bd9c5' into copilot/add-windows-native-executables

# Conflicts:
#	.github/workflows/build.yml

Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Added git status command to the build process.
@vharseko
Copy link
Copy Markdown
Member

vharseko commented Apr 1, 2026

@copilot merge copilot/add-windows-native-executables

Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 1, 2026

@copilot merge copilot/add-windows-native-executables

Merged copilot/add-windows-native-executables into this branch in commit 838b64b.

@vharseko vharseko marked this pull request as ready for review April 1, 2026 16:48
@vharseko vharseko requested a review from maximthomas April 1, 2026 16:48
@vharseko
Copy link
Copy Markdown
Member

vharseko commented Apr 1, 2026

@copilot merge from master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenDJ 4.5 not Starting as Windows Service on Windows 11 21H2

2 participants