Skip to content

feat(acp): emit agent_error session/update when LLM call fails after retries#26352

Open
truenorth-lj wants to merge 2 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error
Open

feat(acp): emit agent_error session/update when LLM call fails after retries#26352
truenorth-lj wants to merge 2 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error

Conversation

@truenorth-lj
Copy link
Copy Markdown

@truenorth-lj truenorth-lj commented May 8, 2026

Issue for this PR

Closes #24494

Prior report: #24494 by @hancengiz — the original ACP-side observation with source-level evidence pointing to where the failure stays internal in session/processor.ts. This PR is the wiring change on the agent-loop side. — halt() never emits an ACP frame when the LLM call fails after retries are exhausted. The turn appears silently stuck to ACP clients.

Stacks on #26306, which adds the agent_error SessionUpdate kind and the LLMErrorPayload shape this PR uses. Diffs from that PR are included here until it merges; rebase will be clean.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement / new feature (non-breaking change which adds functionality)

What does this PR do?

Context: When the retry policy in session/retry.ts is exhausted, session/processor.ts halt() updates internal state (ctx.assistantMessage.error, Bus.Session.Event.Error, EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACP session/update notification or other frame to the connected client. From an ACP client's perspective the turn is silently stuck:

  • No stopReason
  • No error notification
  • No final session/update

Symptom: send session/prompt to a backend that returns a non-retriable failure (or a 429/502 storm that exhausts the retry budget). Client emits some agent_message_chunks (or zero, for an immediate failure), then nothing. The session looks indistinguishable from "agent hung."

Fix: Add a case "session.error": branch in packages/opencode/src/acp/agent.ts handleEvent() that translates the SDK error variant into an LLMErrorPayload (defined in acp/agent-error.ts from #26306) and emits the new agent_error session/update kind with stopReason: "error".

The payload extraction prefers headers set by a classifying upstream proxy:

  • x-llm-error-type: budget | rate_limit | provider_unavailable | context_overflow | auth | unknown
  • x-llm-error-retryable: "true" / "false" (overrides type-derived retryable)
  • x-llm-error-reset-at: epoch ms — populates reset_at_epoch_ms
  • retry-after: seconds — populates retry_after_seconds

Falls back to status-code heuristics when headers are absent (401auth, 5xxprovider_unavailable, else → unknown). ProviderAuthError and ContextOverflowError SDK variants are mapped to their typed forms by name.

ContextOverflowError is intentionally not emitted as a turn-ending error — halt() routes that variant into in-process compaction; the turn continues on a smaller context window. Surfacing it as agent_error would race with the compaction path and produce a confusing FE state.

The existing event subscription in acp/agent.ts already receives session.error events via Bus.subscribeAll() → SDK SSE /event stream → runEventSubscription (the same channel that delivers permission.asked, message.part.updated, etc.) — there was simply no case for it. No bus / SDK / event-stream wiring changes were required.

How did you verify your code works?

packages/opencode/test/acp/halt-emits-agent-error.test.ts (new) — 10 tests covering:

llmErrorPayloadFromSDK unit tests (pure mapping):

  • APIError with x-llm-error-type=budget headers → typed budget payload, retryable=false, reset_at_epoch_ms populated
  • APIError with x-llm-error-type=rate_limit + retry-afterretry_after_seconds populated
  • APIError no headers, status 503provider_unavailable retriable
  • APIError no headers, status 401auth non-retriable
  • ContextOverflowError variant → context_overflow non-retriable
  • ProviderAuthError variant → auth non-retriable
  • Explicit x-llm-error-retryable: "false" overrides type-derived retryable

Integration tests (event push → connection.sessionUpdate assertion via createFakeAgent mirroring event-subscription.test.ts):

  • APIError with classification headers → exactly one agent_error session/update emitted, payload fields preserved, stopReason: "error"
  • ContextOverflowError → no agent_error emit (compaction path)
  • session.error with unknown sessionID → no emit (defensive, no cross-session leak)
$ bun test test/acp/halt-emits-agent-error.test.ts
 10 pass
 0 fail
 23 expect() calls
Ran 10 tests across 1 file. [4.89s]

$ bun test test/acp/
 33 pass
 0 fail
 79 expect() calls
Ran 33 tests across 4 files. [1.60s]

$ bun run typecheck
 Tasks:    12 successful, 12 total

Screenshots / recordings

N/A — purely backend / ACP wire change. Client-side rendering of the agent_error frame is downstream consumer work.

Checklist

  • All tests pass locally
  • Code follows the existing event-handling style (case per event type, connection.sessionUpdate(...).catch(log) per the prevailing pattern in handleEvent)
  • Diff confined to packages/opencode/src/acp/{agent.ts,agent-error.ts} plus the new test file
  • No new dependencies, no schema migrations, no protocol breaking changes (the new agent_error kind is additive and is the subject of feat(acp): add AgentErrorUpdate session/update kind for typed LLM error propagation #26306)
  • bun run typecheck clean

truenorth-lj and others added 2 commits May 8, 2026 15:53
…or propagation

Phase 1B of the LLM error propagation refactor. This is the TS mirror
of the Python contract merged in tn-mono PR anomalyco#721 (commit 884bcf71).

Why a new file instead of an ambient .d.ts augmentation: the SDK's
SessionUpdate is a closed `type` alias (discriminated union), not an
`interface`. TypeScript declaration merging only works on interfaces,
so we cannot extend SessionUpdate via a `.d.ts` patch. The clean
TS-native approach is a local extended type that consumers opt into.

Adds:
- LLMErrorType: closed string union of 6 categories (budget, rate_limit,
  provider_unavailable, context_overflow, auth, unknown)
- LLMErrorPayload: snake_case wire shape mirroring Python LLMErrorPayload;
  retryable is on-the-wire explicit even though derivable from type
- AgentErrorUpdate: { sessionUpdate: "agent_error"; error; stopReason? }
- SessionUpdateWithAgentError: SessionUpdate | AgentErrorUpdate (Phase 4
  emit site at session/processor.ts halt() will use this)
- isRetriable(type): TS mirror of the Python is_retriable() classifier
- isAgentErrorUpdate(value): type guard for narrowing unknown frames

12 unit tests cover: classifier per type, vocabulary stability, type-guard
positive/negative paths, JSON round-trip, compile-time discriminated
union narrowing.

Spec: specs/20260508-llm-error-propagation/spec.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…retries

When the retry policy in session/retry.ts is exhausted, halt() updates
internal state (ctx.assistantMessage.error, Bus.Session.Event.Error,
EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACP frame
to the connected client. From an ACP client's perspective the turn is
silently stuck — no stopReason, no error notification, no final
session/update — and the only signal is whatever timeout the client
imposes locally.

Add a session.error case in acp/agent.ts handleEvent() that translates
the SDK error variant into an LLMErrorPayload and emits the
agent_error session/update kind with stopReason: "error". The payload
prefers headers set by an upstream classifying proxy
(x-llm-error-type / x-llm-error-retryable / x-llm-error-reset-at /
retry-after) over status-code heuristics.

ContextOverflowError is intentionally NOT emitted — halt() routes that
variant into in-process compaction; the turn continues on a smaller
context window rather than ending in an error state.

Tests cover the SDK→LLMErrorPayload mapping for every error variant
(APIError with classification headers, APIError status-code fallback,
ContextOverflowError, ProviderAuthError, explicit retryable header
override) and the integration path that pushes a session.error event
through the agent's event subscription and asserts the resulting
agent_error session/update.

Builds on anomalyco#26306 (which adds the agent_error SessionUpdate kind +
LLMErrorPayload type definitions). Closes anomalyco#26350.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

The following comment was made by an LLM, it may be inaccurate:

Based on my search, PR #26306 is a related PR but not a duplicate:

The other PRs found (26292, 26192, 19116, 24369, 18443) address different aspects of retry/error handling but are not duplicates of the current PR.

Conclusion: No duplicate PRs found. The only related PR is #26306 (a required dependency that this PR builds upon).

@truenorth-lj truenorth-lj changed the title feat(acp): emit agent_error session/update when LLM call fails after retries (closes #26350) feat(acp): emit agent_error session/update when LLM call fails after retries May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACP adapter returns end_turn even when assistant message has internal error

1 participant