fix(orch): prevent NBD dispatch read-loop stall on WRITE_ZEROES (behind flag)#3048
fix(orch): prevent NBD dispatch read-loop stall on WRITE_ZEROES (behind flag)#3048kalyazin wants to merge 3 commits into
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit fce9507. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
In packages/orchestrator/pkg/sandbox/nbd/dispatch.go, cmdWriteZeroes does not handle context cancellation when executing the backend operation. If the backend WriteZeroesAt call blocks indefinitely during shutdown, the goroutine will hang and prevent the orchestrator from draining pending responses cleanly, leading to a shutdown deadlock.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| performWriteZeroes := func() error { | ||
| var respErr uint32 | ||
| if _, err := d.prov.WriteZeroesAt(int64(cmdFrom), cmdLength); err != nil { | ||
| respErr = 1 | ||
| logger.L().Error(ctx, "nbd backend write-zeroes failed", | ||
| zap.Error(err), | ||
| zap.String("nbd_provider", d.provName), | ||
| zap.Uint64("nbd_handle", cmdHandle), | ||
| zap.Uint64("nbd_offset", cmdFrom), | ||
| zap.Int64("nbd_length", cmdLength), | ||
| ) | ||
| } | ||
|
|
||
| return d.writeResponse(respErr, cmdHandle, nil) | ||
| } |
There was a problem hiding this comment.
Unlike cmdRead and cmdWrite, cmdWriteZeroes does not handle context cancellation when executing the backend operation. If the backend WriteZeroesAt call blocks indefinitely during shutdown, the goroutine will hang and prevent the orchestrator from draining pending responses cleanly, leading to a shutdown deadlock. Spawning a goroutine for the backend call and selecting on ctx.Done allows the dispatcher to unblock and exit gracefully.
performWriteZeroes := func() error {
errchan := make(chan error, 1)
go func() {
_, err := d.prov.WriteZeroesAt(int64(cmdFrom), cmdLength)
errchan <- err
}()
var writeErr error
select {
case <-ctx.Done():
writeErr = ctx.Err()
case err := <-errchan:
writeErr = err
}
var respErr uint32
if writeErr != nil {
respErr = 1
logger.L().Error(ctx, "nbd backend write-zeroes failed",
zap.Error(writeErr),
zap.String("nbd_provider", d.provName),
zap.Uint64("nbd_handle", cmdHandle),
zap.Uint64("nbd_offset", cmdFrom),
zap.Int64("nbd_length", cmdLength),
)
}
return d.writeResponse(respErr, cmdHandle, nil)
}There was a problem hiding this comment.
Good catch — fixed. cmdWriteZeroes now runs WriteZeroesAt in a goroutine and selects on ctx, mirroring cmdRead/cmdWrite, so a blocked backend call can't hang the goroutine or the pending-response drain on shutdown. Folded into the refactor commit (performWriteZeroes).
195a974 to
e835f18
Compare
Add an asyncWriteZeroes field to Dispatch and a corresponding NewDispatch parameter, and split cmdWriteZeroes into a shared performWriteZeroes closure that can run either inline or in a goroutine. All callers pass false, so behavior is unchanged: WRITE_ZEROES/TRIM still run inline on the read loop. This prepares for gating the async path behind a feature flag. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the nbd-async-write-zeroes feature flag (default false), evaluate it per device mount, and pass it to NewDispatch. When enabled, WRITE_ZEROES and TRIM are handled in a goroutine instead of inline on the dispatch read loop. Inline handling shares writeLock with the asynchronous read/write reply writers. If a reply write blocks on a full socket send buffer while holding writeLock, a WRITE_ZEROES processed on the read loop blocks acquiring the lock; the loop then stops draining the socket and the kernel times out the NBD connection, returning EIO to the guest and crashing Firecracker on the next flush. Moving the work off the loop keeps the socket drained. Disabled by default for a gradual rollout. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drive the real Dispatch loop over an in-memory socket whose reply Write blocks, simulating a full kernel send buffer. With a READ reply blocked while holding writeLock, dispatch a WRITE_ZEROES and probe with a fresh READ: - inline (flag off): the loop stalls and stops serving the probe - async (flag on): the loop keeps serving the probe Both modes recover once the blocked reply drains, matching the transient, self-clearing incidents this fix targets. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b0ac16d to
fce9507
Compare
Why
The NBD dispatcher can stall its per-connection read loop.
Each connection is served by a single goroutine that reads requests off the socket (
Dispatch.Handle).READandWRITEreplies are written from spawned goroutines viawriteResponse, which holds a sharedwriteLockacross the socket write (and the socket has no write deadline).WRITE_ZEROESandTRIM, however, are handled inline on the read loop and also go throughwriteResponse.If a reply writer is blocked inside
writeResponsewhile holdingwriteLock— e.g. a large reply whose socket write blocks because the peer is momentarily not draining the socket — an inlineWRITE_ZEROES/TRIMon the read loop blocks acquiringwriteLock. The read loop then stops reading the socket until the lock is released. Since that loop is the only thing draining the connection, no further requests are processed while it is blocked, and the kernel NBD client can time out the connection.What
nbd-async-write-zeroesfeature flag (disabled by default).WRITE_ZEROES/TRIMare handled in a goroutine likecmdRead/cmdWrite, so the read loop never acquireswriteLockand a blocked reply writer cannot stall it.WRITE_ZEROESbackend call context-aware (run in a goroutine, select onctx), matchingcmdRead/cmdWrite, so it cannot hang the pending-response drain on shutdown.Validation
Regression test
TestDispatchWriteZeroesReadLoopStalldrives the real dispatch loop with a reply writer blocked while holdingwriteLock, then dispatches aWRITE_ZEROESand probes with a freshREAD:Passes under
-race.go test ./pkg/sandbox/nbd/...(incl.-race),featureflagstests,golangci-lint,go vet,gofmtall clean.Making
WRITE_ZEROESasync is consistent with the existing async read/write design: NBD allows out-of-order replies, the guest orders dependent I/O via completions (the reply is sent only afterWriteZeroesAtcompletes), and the overlay cache serializes concurrent writes via its own mutex.Out of scope (follow-ups)
writeResponseto bound any reply-write stall.ReleaseDevicewithoutWithInfiniteRetry()on error paths).🤖 Generated with Claude Code