Skip to content

fix(ipc): retry proc acquisition when all in-flight spawns fail#5874

Merged
longcw merged 2 commits into
mainfrom
longc/proc-pool-acquire-retry
Jun 1, 2026
Merged

fix(ipc): retry proc acquisition when all in-flight spawns fail#5874
longcw merged 2 commits into
mainfrom
longc/proc-pool-acquire-retry

Conversation

@longcw

@longcw longcw commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #5868. When every worker process fails to initialize, launch_job previously hung forever on _warmed_proc_queue.get() because nothing was ever put on the queue, and the 3-attempt retry loop was unreachable.

Split the responsibility: _acquire_proc now owns the wait-for-a-warmed-process loop. It races queue.get() against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. After MAX_ACQUIRE_ATTEMPTS such cycles, it raises a RuntimeError. launch_job keeps its own 3-attempt budget for post-acquire launch failures (the original retry semantics, untouched).

Alternative considered

#5871 uses None sentinels on _warmed_proc_queue to unblock waiters when a spawn fails. Two issues with that approach:

  1. It wakes too eagerly — a sentinel is pushed on each individual spawn failure, even though peer spawns may still succeed, burning a retry attempt.
  2. The retry doesn't trigger a fresh spawn — on loop-back the queue isn't empty (it has more None sentinels), so launch_job consumes another sentinel without ever requesting a new process. MAX_ATTEMPTS=3 becomes "fail 3 sentinels and give up".

Previously, when every worker process failed to initialize (e.g. on a
resource-constrained host where cold imports exceed `initialize_timeout`),
`launch_job` hung forever on `_warmed_proc_queue.get()` because nothing was
ever put on the queue. The 3-attempt retry loop was unreachable.

Split the responsibility: `_acquire_proc` now owns the wait-for-a-warmed-
process loop. It races `queue.get()` against every in-flight spawn task and
only retries once all in-flight spawns settle without producing a proc — so
peer spawns still in flight don't burn a retry attempt. After
MAX_ACQUIRE_ATTEMPTS such cycles, it raises. `launch_job` keeps its own
3-attempt budget for post-acquire launch failures.

Fixes #5868.
@chenghao-mou chenghao-mou requested a review from a team May 28, 2026 02:47

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@chenghao-mou chenghao-mou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One small nit.

Comment thread tests/test_ipc.py Outdated
Call pool.start() so the test exercises launch_job the way it runs in
production, and close the pool in a finally block so the background
_main_task doesn't leak past teardown.
@longcw longcw merged commit d008e67 into main Jun 1, 2026
23 checks passed
@longcw longcw deleted the longc/proc-pool-acquire-retry branch June 1, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: jobs stall indefinitely when process initialization times out (ProcPool._warmed_proc_queue.get() has no timeout)

2 participants