fix(ipc): retry proc acquisition when all in-flight spawns fail by longcw · Pull Request #5874 · livekit/agents

longcw · 2026-05-28T02:47:41Z

Summary

Fixes #5868. When every worker process fails to initialize, launch_job previously hung forever on _warmed_proc_queue.get() because nothing was ever put on the queue, and the 3-attempt retry loop was unreachable.

Split the responsibility: _acquire_proc now owns the wait-for-a-warmed-process loop. It races queue.get() against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. After MAX_ACQUIRE_ATTEMPTS such cycles, it raises a RuntimeError. launch_job keeps its own 3-attempt budget for post-acquire launch failures (the original retry semantics, untouched).

Alternative considered

#5871 uses None sentinels on _warmed_proc_queue to unblock waiters when a spawn fails. Two issues with that approach:

It wakes too eagerly — a sentinel is pushed on each individual spawn failure, even though peer spawns may still succeed, burning a retry attempt.
The retry doesn't trigger a fresh spawn — on loop-back the queue isn't empty (it has more None sentinels), so launch_job consumes another sentinel without ever requesting a new process. MAX_ATTEMPTS=3 becomes "fail 3 sentinels and give up".

Previously, when every worker process failed to initialize (e.g. on a resource-constrained host where cold imports exceed `initialize_timeout`), `launch_job` hung forever on `_warmed_proc_queue.get()` because nothing was ever put on the queue. The 3-attempt retry loop was unreachable. Split the responsibility: `_acquire_proc` now owns the wait-for-a-warmed- process loop. It races `queue.get()` against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. After MAX_ACQUIRE_ATTEMPTS such cycles, it raises. `launch_job` keeps its own 3-attempt budget for post-acquire launch failures. Fixes #5868.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

chenghao-mou

LGTM. One small nit.

Call pool.start() so the test exercises launch_job the way it runs in production, and close the pool in a finally block so the background _main_task doesn't leak past teardown.

chenghao-mou requested a review from a team May 28, 2026 02:47

longcw mentioned this pull request May 28, 2026

fix: unblock jobs after process init failure #5871

Closed

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

chenghao-mou approved these changes May 29, 2026

View reviewed changes

Comment thread tests/test_ipc.py Outdated

test(ipc): start pool before launching job in spawn-failure test

e4ff792

Call pool.start() so the test exercises launch_job the way it runs in production, and close the pool in a finally block so the background _main_task doesn't leak past teardown.

longcw merged commit d008e67 into main Jun 1, 2026
23 checks passed

longcw deleted the longc/proc-pool-acquire-retry branch June 1, 2026 06:57

rosetta-livekit-bot Bot mentioned this pull request Jun 1, 2026

fix(ipc): retry proc acquisition when spawns fail livekit/agents-js#1669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ipc): retry proc acquisition when all in-flight spawns fail#5874

fix(ipc): retry proc acquisition when all in-flight spawns fail#5874
longcw merged 2 commits into
mainfrom
longc/proc-pool-acquire-retry

longcw commented May 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

chenghao-mou left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

longcw commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Alternative considered

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

chenghao-mou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

longcw commented May 28, 2026 •

edited

Loading