Skip to content

Non-atomic final assistant persist / activeStreamId clear can replay the response on refresh #845

@PentaTea

Description

@PentaTea

Summary

In apps/web/app/workflows/chat.ts, the current success path on main persists the final assistant snapshot before clearing chats.activeStreamId. Those are two separate DB writes with no ordering guarantee beyond control flow:

  • the DB already contains the final assistant message
  • activeStreamId still points at the finishing workflow run

That intermediate state is observable by the client on refresh and produces a user-visible duplicate-response bug. It also leaves a silent-loss path because persistAssistantMessage catches DB errors and continues.

Simplified current success path:

await persistAssistantMessage(options.chatId, pendingAssistantResponse);

// auto-commit / auto-PR / sandbox state updates may run here
// and extend the window significantly

if (didUpdateGitData) {
  await persistAssistantMessage(options.chatId, pendingAssistantResponse);
}

await Promise.all([
  clearActiveStream(options.chatId, workflowRunId),
  sendFinish(writable).then(() => closeStream(writable)),
]);

Failure mode 1 — Duplicate final response on refresh

Between the last successful assistant persist and clearActiveStream(...):

  • the page loads initialMessages from DB, including the final assistant message
  • chat.activeStreamId is still non-null, so the client resumes /api/chat/{id}/stream
  • the durable workflow stream replays the run on top of the already-hydrated assistant message
  • on the AI SDK version used here, at minimum reasoning parts are reconciled duplicates on replay (see Workflow stream resume can replay OpenAI reasoning rs_* items and poison chat history #545); in practice the full final response visibly renders a second time

Failure mode 2 — Final assistant state can be silently lost

persistAssistantMessage(...) currently catches DB errors and only logs them.

If the final assistant persist fails, but the workflow still proceeds to clearActiveStream(...) and sends finish, the chat can end up in this state:

  • no final assistant snapshot in DB
  • activeStreamId already cleared
  • the client already saw the response live and transitioned to a finished state

After a hard refresh, the final assistant response is missing from durable chat history and there is no active stream left to resume.

Reproduction

Natural path (no code changes)

  1. Enable autoCommitEnabled and/or autoCreatePrEnabled on a chat with repo context.
  2. In chat.ts the final assistant persist runs before the auto-commit / auto-PR steps, while clearActiveStream(...) runs after them — the window is the full duration of those steps (seconds to tens of seconds depending on repo size and network).
  3. Send a prompt, wait until the assistant text is visible, then hard refresh during the auto-commit / auto-PR phase.
  4. Observe the same assistant response rendered a second time on top of the DB-hydrated message.

Deterministic repro

  1. Insert await new Promise(r => setTimeout(r, 3000)); immediately before the final Promise.all([ clearActiveStream(...), sendFinish(...), ... ]) in the workflow success path.
  2. Send a prompt and wait until the assistant response is visible.
  3. Hard refresh during the sleep.
  4. Same duplicate rendering as above, reliably.

Silent-loss path

Force the final upsertChatMessageScoped(...) write to fail (e.g. inject a thrown error in persistAssistantMessage). The workflow still clears the slot and sends finish, and the final assistant message is not durable.

Why ordering alone doesn't fix this

Swapping the order just moves the bad window:

  • persist first, clear later: duplicate-on-refresh window
  • clear first, persist later: missing-message / silent-loss window if persist fails

The invariant that matters is:

  • once the final assistant snapshot is durable, that run should no longer be resumable
  • while the run is still resumable, the final assistant snapshot should not yet be durable

Only a single DB transaction closes that gap.

Proposed fix

Add a helper that atomically upserts the final assistant snapshot and clears the active stream slot with CAS semantics:

export async function upsertChatMessageAndClearActiveStream(
  data: NewChatMessage,
  workflowRunId: string,
): Promise<UpsertChatMessageScopedResult> {
  return db.transaction(async (tx) => {
    // existing scoped upsert logic...

    await tx
      .update(chats)
      .set({ activeStreamId: null })
      .where(
        and(
          eq(chats.id, data.chatId),
          eq(chats.activeStreamId, workflowRunId),
        ),
      );

    return result;
  });
}

Then in the workflow:

  • keep activeStreamId claimed until the last point the assistant message can still change (i.e. after any auto-commit / auto-PR data-part updates)
  • replace the final persistAssistantMessage(...) + clearActiveStream(...) pair with the transactional helper
  • after that transaction commits, emit finish and close the stream
  • keep bare clearActiveStream(...) only as a fallback for abort / error paths where there is no final assistant snapshot to persist

This also makes the silent-loss path benign: if the transaction fails, activeStreamId stays set and the client can resume on refresh instead of landing in a "no message, no stream" terminal state.

Relation to existing issues

This is narrower than #526 (rethink long-running stream replay) and orthogonal to #545 (reasoning-part duplication on resume):

A single DB transaction is sufficient and independent of the transport-level decisions in #526 / #545.

Notes

Observed and verified locally on main. Happy to turn the fix into a PR if this aligns with your preferred direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions