Skip to content

feat(envd): bound stalled transfers with idle write/read timeouts#3047

Open
mishushakov wants to merge 3 commits into
mainfrom
mishushakov/envd-write-idle-timeout
Open

feat(envd): bound stalled transfers with idle write/read timeouts#3047
mishushakov wants to merge 3 commits into
mainfrom
mishushakov/envd-write-idle-timeout

Conversation

@mishushakov

Copy link
Copy Markdown
Member

envd serves long-lived streams over HTTP/1.1, so a peer that stalls mid-transfer can pin a handler — and the process/file behind it — indefinitely. This adds a netlimit package with a per-write idle timeout (applied globally via a listener wrapper, since a write deadline only arms on genuine TCP backpressure and is safe everywhere) and a per-read idle timeout scoped to file upload (PostFiles), where "no bytes for N seconds" is pathological — unlike StreamInput/process streams where a quiet gap is normal. It also sets ReadHeaderTimeout for slowloris protection (header read only, never bodies or streams) and bumps the envd version to 0.6.7. Both timeouts reset per chunk, so slow-but-progressing transfers survive and only true stalls are reclaimed.

🤖 Generated with Claude Code

Add a per-write idle timeout (netlimit.WriteTimeoutListener) so a client that
stops reading can't pin a streaming handler via TCP backpressure, and a
per-read idle timeout scoped to file upload (PostFiles) so a client that stops
sending mid-upload can't hold the open file open forever. Also set
ReadHeaderTimeout for slowloris protection. Bump envd version to 0.6.7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 19, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes global HTTP connection behavior for all envd traffic; mis-tuned idle limits could drop slow-but-valid transfers or affect connection reuse if body restoration were wrong (the PR explicitly guards smuggling).

Overview
envd can now reclaim connections when a peer stops sending or reading mid-transfer instead of pinning upload handlers or long-lived streams indefinitely. A new netlimit layer applies a 60s per-chunk write idle deadline on every accepted TCP connection (listener wrapper, with CloseWrite forwarded), while ReadTimeout and WriteTimeout stay at zero so normal idle streams are not cut off by absolute server timeouts. ReadHeaderTimeout (60s) is added for slowloris-style header stalls only.

File PostFiles wraps the raw request body with a matching per-read idle deadline via ResponseController, then restores the original r.Body and clears the read deadline on exit so net/http connection reuse and early-close detection stay correct. Version bumps to 0.6.7.

Reviewed by Cursor Bugbot for commit fffee4b. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2971 1 2970 8
View the top 1 failed test(s) by shortest run time
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestUpdateTemplateWithInvalidAPIKey
Stack Traces | 158s run time
=== RUN   TestUpdateTemplateWithInvalidAPIKey
=== PAUSE TestUpdateTemplateWithInvalidAPIKey
=== CONT  TestUpdateTemplateWithInvalidAPIKey
    template_update_test.go:107: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestUpdateTemplateWithInvalidAPIKey (158.02s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Wrapping the accepted connection in writeTimeoutConn prevents http.Server from asserting the underlying connection to *net.TCPConn, which bypasses the server's default TCP keep-alive configuration and disables keep-alives entirely for all incoming connections. To prevent this, TCP keep-alives should be enabled on the raw connection before wrapping it.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/envd/internal/netlimit/writedeadline.go
@mishushakov

Copy link
Copy Markdown
Member Author

@codex review
@claude review

@mishushakov

Copy link
Copy Markdown
Member Author

Thanks @gemini-code-assist — I looked into this and it doesn't apply on our Go version (1.26). TCP keep-alive isn't affected by wrapping the accepted conn:

  • http.Server never sets SO_KEEPALIVE itself. Its only *net.TCPConn assertions are for sendfile/ReadFrom and Hijacker — not keep-alive. Server.doKeepAlives() is HTTP-level connection reuse, unrelated to TCP keep-alive and unaffected by the wrapper.
  • TCP keep-alive is enabled by the net package at accept time, inside the underlying *net.TCPListener.Accept(), before writeTimeoutListener ever wraps the conn (net/tcpsock_posix.gonewTCPConn, which enables keep-alive when ListenConfig.KeepAlive == 0, i.e. the zero value). It's a socket option on the fd, so wrapping the Go object can't disable it.
  • http.Server.ListenAndServe itself just calls net.Listen("tcp", addr) with the zero-value ListenConfig. Our (&net.ListenConfig{}).Listen(...) is exactly equivalent, so keep-alive behavior is identical to before this change.

The tcpKeepAliveListener that ListenAndServe used to install was removed once ListenConfig started enabling keep-alives, so there's nothing for the wrapper to bypass here.

(One real effect of wrapping — losing the sendfile fast path because we deliberately don't implement io.ReaderFrom — is intentional: a single sendfile deadline would span an entire download and kill a legitimate large-but-progressing transfer. See the comment in netlimit/writedeadline.go.)

@gemini-code-assist

Copy link
Copy Markdown
Contributor

That is a very clear and thorough explanation. You are correct that net.ListenConfig enables keep-alives at the socket level during Accept(), so wrapping the resulting net.Conn does not affect the underlying socket options. Your decision to sacrifice the sendfile fast path in favor of per-chunk deadline granularity is a sound trade-off for preventing stalled transfers in this context. The implementation looks solid.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1caa8ba011

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/envd/internal/api/upload.go Outdated
Comment thread packages/envd/internal/netlimit/writedeadline.go
Address two issues from review of the idle-timeout change:

- Upload: restore r.Body to net/http's native *body before PostFiles returns.
  The server type-asserts r.Body to *body for keep-alive reuse / early-close
  accounting; leaving the IdleReadCloser wrapper in place defeated that and
  could let leftover bytes from an aborted upload be parsed as the next request
  (smuggling) on a reused connection. Also clear the read deadline on return.

- writeTimeoutConn: forward CloseWrite to the underlying conn so net/http can
  half-close (send FIN, keep reading) before a full close when replying with an
  error mid-upload, instead of an RST that can truncate the response.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mishushakov

Copy link
Copy Markdown
Member Author

Thanks @codex — both findings were real and are fixed in 689620e.

1. Body reuse / smuggling (upload.go): confirmed. shouldReuseConnection()closedRequestBodyEarly() type-asserts w.req.Body.(*body); replacing r.Body with the IdleReadCloser wrapper made that assertion fail, so the server couldn't detect an aborted large upload's undrained bytes and would reuse the connection. Fixed by restoring r.Body to the original *body in a defer before PostFiles returns (and clearing the read deadline). The wrapper is now only in place while the handler reads; net/http does its own early-close/reuse accounting on the native type, exactly as before.

2. CloseWrite (writedeadline.go): confirmed. closeWriteAndWait() checks c.rwc.(closeWriter), and embedding only net.Conn hid *net.TCPConn.CloseWrite. Added a CloseWrite method that forwards to the underlying conn (returns errors.ErrUnsupported otherwise), so the server can half-close before a full close on mid-upload error replies. (ReaderFrom/sendfile stays intentionally unforwarded — a single sendfile deadline would span a whole download.)

A full regression test needs a live keep-alive server since the bug only manifests on connection reuse with a real *body; the wrapper-level behavior (CloseWrite forwarding) has unit tests.

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

The restore (not the deadline clear) is what prevents request smuggling:
shouldReuseConnection -> closedRequestBodyEarly type-asserts r.Body to *body,
so a leftover wrapper hides an early-closed/undrained upload and the conn gets
reused. Verified empirically. Attribute the deadline clear as hygiene only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant