Skip to content

fix: semaphore release submitted to a dead executor#2374

Merged
triceo merged 2 commits into
TimefoldAI:mainfrom
triceo:flake
Jun 16, 2026
Merged

fix: semaphore release submitted to a dead executor#2374
triceo merged 2 commits into
TimefoldAI:mainfrom
triceo:flake

Conversation

@triceo

@triceo triceo commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

I ran the original test 100_000 times, it failed zero times.
Claude's summary follows:


ConsumerSupport uses 3 semaphores to gate consumeFinalBestSolution(). Each is acquired before a task runs, and released after the task completes via
whenCompleteAsync(release, consumerExecutor).

whenCompleteAsync(action, executor) does not run action inline. It submits action as a new task to executor when the prior future completes. Two-step:

  1. Prior task finishes → future completes
  2. Java calls executor.execute(releaseTask)

Between steps 1 and 2, executor can be shut down. shutdownNow() rejects new submissions and cancels queued ones. RejectedExecutionException → release task
never runs → semaphore stays at 0.

Who shuts the executor down?

ConsumerSupport.close() — called from DefaultSolverJob.close() when a solver job is cleaned up. Its finally block calls shutdownConsumerExecutor()
unconditionally, even if acquireAll() threw InterruptedException. If the close-thread is interrupted mid-acquire(), the executor dies with semaphores still
at 0 and their release tasks still queued.

Race that produces the hang:

  1. Consumer thread finishes start-job task → future F completes
  2. Concurrently: close-thread interrupted in acquireAll() → finally → shutdownNow()
  3. whenCompleteAsync tries executor.execute(startSolverJobConsumption.release()) → rejected
  4. Solver thread later calls consumeFinalBestSolution() → acquireAll() → startSolverJobConsumption.acquire() → blocks forever

Why whenComplete fixes it:

whenComplete(action) (no executor) runs action synchronously on the thread that completed the future — here, the consumer executor's own thread, immediately
after the task finishes, before returning to the executor's work loop. No second submission, no queue, nothing to reject. The release is atomic with the
task's completion from the executor's perspective.

Executor shutdown can only happen after acquireAll() succeeds in consumeFinalBestSolution, which is after all releases have run. The release is now
unreachable by any shutdown path.

ConsumerSupport uses 3 semaphores to gate consumeFinalBestSolution(). Each is acquired before a task runs, and released after the task completes via
  whenCompleteAsync(release, consumerExecutor).

  whenCompleteAsync(action, executor) does not run action inline. It submits action as a new task to executor when the prior future completes. Two-step:
  1. Prior task finishes → future completes
  2. Java calls executor.execute(releaseTask)

  Between steps 1 and 2, executor can be shut down. shutdownNow() rejects new submissions and cancels queued ones. RejectedExecutionException → release task
  never runs → semaphore stays at 0.

  Who shuts the executor down?

  ConsumerSupport.close() — called from DefaultSolverJob.close() when a solver job is cleaned up. Its finally block calls shutdownConsumerExecutor()
  unconditionally, even if acquireAll() threw InterruptedException. If the close-thread is interrupted mid-acquire(), the executor dies with semaphores still
  at 0 and their release tasks still queued.

  Race that produces the hang:

  1. Consumer thread finishes start-job task → future F completes
  2. Concurrently: close-thread interrupted in acquireAll() → finally → shutdownNow()
  3. whenCompleteAsync tries executor.execute(startSolverJobConsumption.release()) → rejected
  4. Solver thread later calls consumeFinalBestSolution() → acquireAll() → startSolverJobConsumption.acquire() → blocks forever

  Why whenComplete fixes it:

  whenComplete(action) (no executor) runs action synchronously on the thread that completed the future — here, the consumer executor's own thread, immediately
  after the task finishes, before returning to the executor's work loop. No second submission, no queue, nothing to reject. The release is atomic with the
  task's completion from the executor's perspective.

  Executor shutdown can only happen after acquireAll() succeeds in consumeFinalBestSolution, which is after all releases have run. The release is now
  unreachable by any shutdown path.
@triceo triceo linked an issue Jun 15, 2026 that may be closed by this pull request
@triceo triceo self-assigned this Jun 15, 2026
@triceo triceo added this to the v2.3.0 milestone Jun 15, 2026
@triceo triceo requested a review from zepfred June 15, 2026 06:04
@triceo triceo marked this pull request as ready for review June 15, 2026 08:36
Copilot AI review requested due to automatic review settings June 15, 2026 08:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents solver event-consumption semaphores from getting stuck when the consumer executor is shut down by ensuring semaphore releases run inline on the completing thread (instead of being resubmitted to an executor that may already be terminated).

Changes:

  • Replace whenCompleteAsync(..., consumerExecutor) with whenComplete(...) for intermediate best-solution consumption to avoid rejected-release tasks.
  • Apply the same whenComplete(...) change for first-initialized solution consumption semaphore release.
  • Apply the same whenComplete(...) change for start-solver-job consumption semaphore release.

Comment thread core/src/main/java/ai/timefold/solver/core/impl/solver/ConsumerSupport.java Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 08:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

@sonarqubecloud

Copy link
Copy Markdown

@zepfred

zepfred commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Were you able to reproduce the issue before the fix?

@triceo

triceo commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Were you able to reproduce the issue before the fix?

The issue was failing our CI builds couple times a day.

@triceo triceo merged commit 2e9d73a into TimefoldAI:main Jun 16, 2026
19 checks passed
@triceo triceo deleted the flake branch June 16, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SolverManagerTest.submitMoreProblemsThanCpus_allGetSolved() is flaky

3 participants