mcp: retry transient errors on SSE reconnect#1027
Open
aditya-786 wants to merge 1 commit into
Open
Conversation
When an interrupted SSE stream is reconnected, a transient HTTP status (429, 502, 503, 504) returned by the reconnect was passed to c.fail, permanently breaking the session. This is the same failure mode fixed for the POST path in modelcontextprotocol#723, which explicitly left the handleSSE case as follow-up work. Retry the reconnect when checkResponse returns an error wrapped with jsonrpc2.ErrRejected, instead of failing the connection. Each retry counts against the existing no-progress budget, so a persistently unavailable server still eventually gives up (modelcontextprotocol#679). For modelcontextprotocol#683
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the remaining work for #683 that #723 called out: "we should also retry transient errors in handleSSE."
When an SSE stream is interrupted and the client reconnects,
streamableClientConn.handleSSEpassed anycheckResponseerror toc.fail, permanently breaking the session. ButcheckResponseintentionally wraps transient HTTP statuses (429, 502, 503, 504) injsonrpc2.ErrRejectedso they should not break the connection — the POST path already honors this (if !errors.Is(err, jsonrpc2.ErrRejected) { c.fail(err) }), while the SSE reconnect path did not. As a result a transient 503 during a reconnect (server restart, load-balancer hiccup, load spike) poisoned the whole session — the failure mode reported in #683.Change
In the reconnect loop, when
checkResponsereturns anErrRejected-wrapped (transient) error, retry the reconnect instead of failing. Each retry counts against the existing no-progress retry budget (#679), so a persistently unavailable server still eventually gives up rather than looping forever. Non-transient errors are unchanged.Test
TestStreamableClientReconnectTransientErrorsreproduces the bug with the existingfakeStreamableServerharness: a POST tool call returns an SSE stream that emits a resumable event then ends without the response (forcing a reconnect); the reconnect returns 503 once, then the real response. The test fails before this change (... Service Unavailable (session should survive)) and passes after.Verified locally:
go test ./...,go test -race ./...,go vet ./...,gofmt -l, andstaticcheckall clean.Scope note
connectStandaloneSSEhas the same unconditionalc.failon a transient status. The standalone GET stream is optional (§2.2.3), so I scoped this PR tohandleSSE(the case #723 named). Happy to also fix the standalone path here or in a follow-up — whichever you prefer.For #683