feat(acp): emit agent_error session/update when LLM call fails after retries#26352
Open
truenorth-lj wants to merge 2 commits into
Open
feat(acp): emit agent_error session/update when LLM call fails after retries#26352truenorth-lj wants to merge 2 commits into
truenorth-lj wants to merge 2 commits into
Conversation
…or propagation Phase 1B of the LLM error propagation refactor. This is the TS mirror of the Python contract merged in tn-mono PR anomalyco#721 (commit 884bcf71). Why a new file instead of an ambient .d.ts augmentation: the SDK's SessionUpdate is a closed `type` alias (discriminated union), not an `interface`. TypeScript declaration merging only works on interfaces, so we cannot extend SessionUpdate via a `.d.ts` patch. The clean TS-native approach is a local extended type that consumers opt into. Adds: - LLMErrorType: closed string union of 6 categories (budget, rate_limit, provider_unavailable, context_overflow, auth, unknown) - LLMErrorPayload: snake_case wire shape mirroring Python LLMErrorPayload; retryable is on-the-wire explicit even though derivable from type - AgentErrorUpdate: { sessionUpdate: "agent_error"; error; stopReason? } - SessionUpdateWithAgentError: SessionUpdate | AgentErrorUpdate (Phase 4 emit site at session/processor.ts halt() will use this) - isRetriable(type): TS mirror of the Python is_retriable() classifier - isAgentErrorUpdate(value): type guard for narrowing unknown frames 12 unit tests cover: classifier per type, vocabulary stability, type-guard positive/negative paths, JSON round-trip, compile-time discriminated union narrowing. Spec: specs/20260508-llm-error-propagation/spec.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…retries When the retry policy in session/retry.ts is exhausted, halt() updates internal state (ctx.assistantMessage.error, Bus.Session.Event.Error, EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACP frame to the connected client. From an ACP client's perspective the turn is silently stuck — no stopReason, no error notification, no final session/update — and the only signal is whatever timeout the client imposes locally. Add a session.error case in acp/agent.ts handleEvent() that translates the SDK error variant into an LLMErrorPayload and emits the agent_error session/update kind with stopReason: "error". The payload prefers headers set by an upstream classifying proxy (x-llm-error-type / x-llm-error-retryable / x-llm-error-reset-at / retry-after) over status-code heuristics. ContextOverflowError is intentionally NOT emitted — halt() routes that variant into in-process compaction; the turn continues on a smaller context window rather than ending in an error state. Tests cover the SDK→LLMErrorPayload mapping for every error variant (APIError with classification headers, APIError status-code fallback, ContextOverflowError, ProviderAuthError, explicit retryable header override) and the integration path that pushes a session.error event through the agent's event subscription and asserts the resulting agent_error session/update. Builds on anomalyco#26306 (which adds the agent_error SessionUpdate kind + LLMErrorPayload type definitions). Closes anomalyco#26350. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #24494
Prior report: #24494 by @hancengiz — the original ACP-side observation with source-level evidence pointing to where the failure stays internal in
session/processor.ts. This PR is the wiring change on the agent-loop side. —halt()never emits an ACP frame when the LLM call fails after retries are exhausted. The turn appears silently stuck to ACP clients.Stacks on #26306, which adds the
agent_errorSessionUpdatekind and theLLMErrorPayloadshape this PR uses. Diffs from that PR are included here until it merges; rebase will be clean.Type of change
What does this PR do?
Context: When the retry policy in
session/retry.tsis exhausted,session/processor.ts halt()updates internal state (ctx.assistantMessage.error,Bus.Session.Event.Error,EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACPsession/updatenotification or other frame to the connected client. From an ACP client's perspective the turn is silently stuck:stopReasonsession/updateSymptom: send
session/promptto a backend that returns a non-retriable failure (or a429/502storm that exhausts the retry budget). Client emits someagent_message_chunks (or zero, for an immediate failure), then nothing. The session looks indistinguishable from "agent hung."Fix: Add a
case "session.error":branch inpackages/opencode/src/acp/agent.ts handleEvent()that translates the SDK error variant into anLLMErrorPayload(defined inacp/agent-error.tsfrom #26306) and emits the newagent_errorsession/updatekind withstopReason: "error".The payload extraction prefers headers set by a classifying upstream proxy:
x-llm-error-type:budget|rate_limit|provider_unavailable|context_overflow|auth|unknownx-llm-error-retryable:"true"/"false"(overrides type-derivedretryable)x-llm-error-reset-at: epoch ms — populatesreset_at_epoch_msretry-after: seconds — populatesretry_after_secondsFalls back to status-code heuristics when headers are absent (
401→auth,5xx→provider_unavailable, else →unknown).ProviderAuthErrorandContextOverflowErrorSDK variants are mapped to their typed forms by name.ContextOverflowErroris intentionally not emitted as a turn-ending error —halt()routes that variant into in-process compaction; the turn continues on a smaller context window. Surfacing it asagent_errorwould race with the compaction path and produce a confusing FE state.The existing event subscription in
acp/agent.tsalready receivessession.errorevents viaBus.subscribeAll()→ SDK SSE/eventstream →runEventSubscription(the same channel that deliverspermission.asked,message.part.updated, etc.) — there was simply nocasefor it. No bus / SDK / event-stream wiring changes were required.How did you verify your code works?
packages/opencode/test/acp/halt-emits-agent-error.test.ts(new) — 10 tests covering:llmErrorPayloadFromSDKunit tests (pure mapping):APIErrorwithx-llm-error-type=budgetheaders → typedbudgetpayload,retryable=false,reset_at_epoch_mspopulatedAPIErrorwithx-llm-error-type=rate_limit+retry-after→retry_after_secondspopulatedAPIErrorno headers, status503→provider_unavailableretriableAPIErrorno headers, status401→authnon-retriableContextOverflowErrorvariant →context_overflownon-retriableProviderAuthErrorvariant →authnon-retriablex-llm-error-retryable: "false"overrides type-derived retryableIntegration tests (event push → connection.sessionUpdate assertion via
createFakeAgentmirroringevent-subscription.test.ts):APIErrorwith classification headers → exactly oneagent_errorsession/updateemitted, payload fields preserved,stopReason: "error"ContextOverflowError→ noagent_erroremit (compaction path)session.errorwith unknown sessionID → no emit (defensive, no cross-session leak)Screenshots / recordings
N/A — purely backend / ACP wire change. Client-side rendering of the
agent_errorframe is downstream consumer work.Checklist
caseper event type,connection.sessionUpdate(...).catch(log)per the prevailing pattern inhandleEvent)packages/opencode/src/acp/{agent.ts,agent-error.ts}plus the new test fileagent_errorkind is additive and is the subject of feat(acp): add AgentErrorUpdate session/update kind for typed LLM error propagation #26306)bun run typecheckclean