Skip to content

mcp: retry transient errors on SSE reconnect#1027

Open
aditya-786 wants to merge 1 commit into
modelcontextprotocol:mainfrom
aditya-786:fix/streamable-sse-transient-retry
Open

mcp: retry transient errors on SSE reconnect#1027
aditya-786 wants to merge 1 commit into
modelcontextprotocol:mainfrom
aditya-786:fix/streamable-sse-transient-retry

Conversation

@aditya-786

Copy link
Copy Markdown

Summary

Completes the remaining work for #683 that #723 called out: "we should also retry transient errors in handleSSE."

When an SSE stream is interrupted and the client reconnects, streamableClientConn.handleSSE passed any checkResponse error to c.fail, permanently breaking the session. But checkResponse intentionally wraps transient HTTP statuses (429, 502, 503, 504) in jsonrpc2.ErrRejected so they should not break the connection — the POST path already honors this (if !errors.Is(err, jsonrpc2.ErrRejected) { c.fail(err) }), while the SSE reconnect path did not. As a result a transient 503 during a reconnect (server restart, load-balancer hiccup, load spike) poisoned the whole session — the failure mode reported in #683.

Change

In the reconnect loop, when checkResponse returns an ErrRejected-wrapped (transient) error, retry the reconnect instead of failing. Each retry counts against the existing no-progress retry budget (#679), so a persistently unavailable server still eventually gives up rather than looping forever. Non-transient errors are unchanged.

Test

TestStreamableClientReconnectTransientErrors reproduces the bug with the existing fakeStreamableServer harness: a POST tool call returns an SSE stream that emits a resumable event then ends without the response (forcing a reconnect); the reconnect returns 503 once, then the real response. The test fails before this change (... Service Unavailable (session should survive)) and passes after.

Verified locally: go test ./..., go test -race ./..., go vet ./..., gofmt -l, and staticcheck all clean.

Scope note

connectStandaloneSSE has the same unconditional c.fail on a transient status. The standalone GET stream is optional (§2.2.3), so I scoped this PR to handleSSE (the case #723 named). Happy to also fix the standalone path here or in a follow-up — whichever you prefer.

For #683

When an interrupted SSE stream is reconnected, a transient HTTP status
(429, 502, 503, 504) returned by the reconnect was passed to c.fail,
permanently breaking the session. This is the same failure mode fixed
for the POST path in modelcontextprotocol#723, which explicitly left the handleSSE case as
follow-up work.

Retry the reconnect when checkResponse returns an error wrapped with
jsonrpc2.ErrRejected, instead of failing the connection. Each retry
counts against the existing no-progress budget, so a persistently
unavailable server still eventually gives up (modelcontextprotocol#679).

For modelcontextprotocol#683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant