feat(orchestrator): drain sandboxes during shutdown by wj-e2b · Pull Request #3005 · e2b-dev/infra

wj-e2b · 2026-06-13T02:16:44Z

Introduce a shared draingate.Gate (counter plus notification channel) and use it in the sandbox factory and gRPC server to reject new sandbox starts while draining, wait for in-flight starts, and drain or force-stop live sandboxes before closers run. Forced shutdown preserves buffered close errors on context cancellation and avoids duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a WaitGroup with context cancellation.

The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when it expires the drain escalates to a forced sandbox shutdown. By default the drain waits forever, until sandboxes exit on their own or a force-stop API call empties the node.

cursor · 2026-06-13T02:16:50Z

PR Summary

High Risk
Changes the core node shutdown and sandbox lifecycle path (drain, force-close, gate ordering); mis-tuning could leave workloads running or kill sandboxes abruptly during rolling deploys.

Overview
Orchestrator shutdown now rejects new sandbox and template-build starts, waits for in-flight work, and empties live sandboxes before tearing down services. A reusable draingate counts admitted operations and blocks new ones once draining starts; the sandbox factory gate is held across whole Create/Checkpoint RPCs so checkpoint’s nested resume is not cut off mid-drain, while fresh starts get gRPC Unavailable. Graceful drain uses SHUTDOWN_DRAIN_TIMEOUT when set (otherwise it can wait indefinitely); on timeout or FORCE_STOP, template builds and sandboxes escalate to forced cancellation/close with parallel force-stop passes for late-arriving lifecycles. Template creates use a separate drain gate; deletes and layer upload init stay available during drain. Cleanup and rollbacks use contexts that ignore parent cancel but keep deadlines so shutdown work can finish under a bounded drain window.

^{Reviewed by Cursor Bugbot for commit b2e186e. Bugbot is set up for automated code reviews on this repo. Configure here.}

gemini-code-assist

Code Review

Compilation errors exist in packages/orchestrator/pkg/server/main.go and packages/orchestrator/pkg/draingate/gate_test.go where wg.Go() is called on a sync.WaitGroup which does not have a Go method. In packages/orchestrator/pkg/server/main.go, calling context.WithoutCancel(ctx) strips the configured shutdown deadline, which can cause the orchestrator to hang indefinitely if a sandbox cleanup hangs. Additionally, the spin-loop using runtime.Gosched() in packages/shared/pkg/utils/waitgroup.go is flaky and can cause premature context cancellation errors under heavy CPU load.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T02:18:26Z

+	for range 10 {
+		select {
+		case <-done:
+			return nil
+		default:
+			runtime.Gosched()
+		}
+	}
+
+	select {
+	case <-ctx.Done():
+		return fmt.Errorf("waiting for wait group: %w", ctx.Err())
+	case <-done:
+		return nil
+	}


The spin-loop using runtime.Gosched() is flaky and does not guarantee that the goroutine closing done will be scheduled and executed within 10 iterations. Under heavy CPU load or thread starvation, the loop can easily fall through to the select block and return a context error even if the WaitGroup was already completed, making the associated test fragile. You should use a standard select block to wait for either the WaitGroup completion or context cancellation.

select { case <-done: return nil case <-ctx.Done(): return fmt.Errorf("waiting for wait group: %w", ctx.Err()) }

codecov · 2026-06-13T02:18:27Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
3000	1	2999	8

View the top 1 failed test(s) by shortest run time

github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Stack Traces | 15.6s run time

=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:26: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:26
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMetrics
        	Messages:   	sandbox metrics not available in time
--- FAIL: TestSandboxMetrics (15.58s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d6b2e7976

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T20:03:56Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

chatgpt-codex-connector · 2026-06-15T21:20:02Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

chatgpt-codex-connector · 2026-06-16T23:51:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

chatgpt-codex-connector · 2026-06-16T23:52:02Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Introduce a shared draingate.Gate (counter plus notification channel) and use it in the sandbox factory and gRPC server to reject new sandbox starts while draining, wait for in-flight starts, and drain or force-stop live sandboxes before closers run. Forced shutdown preserves buffered close errors on context cancellation and avoids duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a WaitGroup with context cancellation. The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when it expires the drain escalates to a forced sandbox shutdown. By default the drain waits forever, until sandboxes exit on their own or a force-stop API call empties the node.

Gate new template build, delete, and layer upload requests behind the shared drain gate, cancel or await in-flight builds via the build cache, and split ServerStore shutdown into Wait (graceful) and ForceStop so callers choose the escalation explicitly. If the graceful template drain is cut short (for example by SHUTDOWN_DRAIN_TIMEOUT), shutdown escalates to a bounded template force stop instead of abandoning in-flight builds.

The sandbox server kept its own drain gate in addition to the shared factory start gate, and drained them in sequence (server gate, then factory gate) so an admitted checkpoint's internal resume would not be rejected when the factory drained first. Collapse to a single factory gate. Create/Checkpoint now enter the factory gate at the handler boundary and mark the context with WithHeldStartGate so the nested ResumeSandbox does not re-enter (or get rejected by) the gate. This keeps the checkpoint's remove-then-resume atomic with respect to drain without a second gate. DrainSandboxes / ForceStopSandboxes wait only on the factory gate, and the starting-limit refresher stops on the factory's drain Done channel. Note: not built/tested locally (cgo userfaultfd can't cross-compile from darwin and Docker is unavailable here); needs linux CI.

TemplateBuildDelete was gated by the drain gate, so once a node entered drain it rejected deletes with Unavailable. Delete is the cancel/kill path (it fails a running build then removes artifacts), so it must keep working while draining. Ungate delete and stop tracking it on the build wait group; adding to the wg after a graceful drain's wg.Wait has started would trip the sync.WaitGroup misuse panic. In-flight delete RPCs are drained by grpcServer.GracefulStop during shutdown instead.

InitLayerFileUpload only mints a signed upload URL and checks existence in shared, content-addressed build storage; the orchestrator is not in the upload data path. A call landing on a draining node is therefore harmless: the client's upload to storage is unaffected by the node draining, and the cached layer is usable by a build on any node. Drop the rejectIfDraining guard (and the now-unused helper) so the upload-init is not rejected mid-drain. Any in-flight RPC is drained by the gRPC server's GracefulStop during shutdown. Note: not built/tested locally (cgo userfaultfd can't cross-compile from darwin and Docker is unavailable here); needs linux CI.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b2e186e. Configure here.}

cursor · 2026-06-19T00:01:15Z

+
+				err = forceStopTemplateBuilds()
+			}
+		}


Factory drain precedes template wait

High Severity

Graceful shutdown calls orchestratorService.StartDraining on the shared sandbox factory before tmpl.Wait finishes in-flight template builds. Template build steps call CreateSandbox/ResumeSandbox on that same factory without WithHeldStartGate, so mid-build layer work gets ErrFactoryDraining while shutdown still expects builds to complete gracefully.

^{Reviewed by Cursor Bugbot for commit b2e186e. Configure here.}

wj-e2b requested review from ValentaTomas, dobrac and jakubno as code owners June 13, 2026 02:16

cla-bot Bot added the cla-signed label Jun 13, 2026

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/factories/run.go Outdated

wj-e2b force-pushed the wj-orch-fix-4 branch from 5d6b2e7 to 4394912 Compare June 15, 2026 20:03

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/factories/run.go

wj-e2b force-pushed the wj-orch-fix-4 branch from 4394912 to 1aa9982 Compare June 15, 2026 21:19

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/main.go Outdated

wj-e2b force-pushed the wj-orch-fix-4 branch from 1aa9982 to cfed9fc Compare June 16, 2026 23:51

wj-e2b force-pushed the wj-orch-fix-4 branch from e9c61e1 to 36da8ba Compare June 17, 2026 00:08

wj-e2b added 9 commits June 18, 2026 16:49

refactor(orchestrator): rename template operation gate

c147a22

refactor(orchestrator): simplify forced sandbox drain

bac5790

fixup

16ad561

fix(orchestrator): preserve cleanup deadlines

b2e186e

wj-e2b force-pushed the wj-orch-fix-4 branch from 36da8ba to b2e186e Compare June 18, 2026 23:59

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Conversation

wj-e2b commented Jun 13, 2026

Uh oh!

cursor Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 15, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 15, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 16, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 19, 2026

Choose a reason for hiding this comment

Factory drain precedes template wait

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading