Skip to content

Guard WebSocket lifecycle with connection generations#739

Merged
shanselman merged 5 commits into
openclaw:mainfrom
christineyan4:fix-696-websocket-generation-guard
Jun 21, 2026
Merged

Guard WebSocket lifecycle with connection generations#739
shanselman merged 5 commits into
openclaw:mainfrom
christineyan4:fix-696-websocket-generation-guard

Conversation

@christineyan4

@christineyan4 christineyan4 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Fixes #696.

Summary

  • add a per-connection generation guard so stale connect/listen work cannot act on newer WebSocket connections
  • make listen loops operate on their captured socket and skip stale message, disconnect, error, and reconnect handling
  • add a loopback WebSocket regression test for overlapping connects

Validation

  • .\build.ps1
  • dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Tray.Tests\OpenClaw.Tray.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Connection.Tests\OpenClaw.Connection.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj --no-restore --filter "FullyQualifiedName~WebSocketClientBaseTests"
  • git diff --check

Review

  • Ran Hanselman-style dual-model adversarial review and addressed the actionable findings: stale text message processing and test flake risk.

Proof

WebSocketClientBase tests passing

christineyan4 and others added 2 commits June 10, 2026 12:27
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codex review: needs changes before merge. Reviewed June 21, 2026, 6:35 PM ET / 22:35 UTC.

Summary
The branch adds connection-generation ownership checks to shared WebSocket connect/listen/reconnect handling and adds loopback regression tests for overlapping connects and reconnect takeover.

Reproducibility: yes. source-level. A failed current ConnectAsync can start ReconnectWithBackoffAsync() without owner context, then a later successful ConnectAsync can be disposed when the stale backoff resumes.

Review metrics: 2 noteworthy metrics.

  • Patch surface: 2 files changed, +465/-26. The production change is concentrated in one shared WebSocket base that affects both operator and node connections.
  • Proof sources: 1 terminal screenshot, 2 reporter live-validation comments. The central self-hosted WSS fix has useful real-world support, while the remaining blocker is a source-level edge case.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #696
Summary: This PR is the focused candidate fix for the open self-hosted WSS reconnect/re-approval loop; broader lifecycle PRs overlap but are not a safe canonical replacement for this narrow work.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦞 diamond lobster
Patch quality: 🦪 silver shellfish
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Scope failed-connect reconnect backoff to the initiating socket/generation.
  • [P2] Add a loopback regression where a failed connect starts backoff and a newer connect succeeds before the stale loop resumes.
  • [P2] Update validation after the fix with the required build, shared tests, tray tests, and targeted WebSocketClientBase test run.

Risk before merge

  • [P1] A failed current connect can still start an unowned reconnect backoff loop; if a newer connect succeeds before the delay expires, the stale loop can dispose or replace the newer operator or node socket.
  • [P1] The PR changes shared WebSocket lifecycle behavior used by both operator and node clients, so the repair needs focused tests for both default compatibility and stale-generation takeover cases before merge.

Maintainer options:

  1. Carry owner context through failed-connect backoff (recommended)
    Pass or otherwise retain the initiating socket and generation for failed-connect reconnect starts, abort stale backoff before shared-state mutation, and add a regression where a newer connect wins while the failed connect is waiting.
  2. Pause broader lifecycle merge
    If maintainers do not want another lifecycle repair on this branch, pause merge and keep the WebSocketBase lifecycle work tracked with the existing related lifecycle PR discussion.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Carry socket/generation ownership through failed-connect reconnect backoff and any other unowned reconnect kickoff in the shared WebSocket client, abort superseded loops before disposing or replacing `_webSocket`, and add a focused loopback regression where a failed connect enters backoff and a newer `ConnectAsync` succeeds before the stale loop resumes.

Next step before merge

  • [P2] A narrow source-level blocker remains in one shared reconnect path, with clear files and regression-test shape for an automated repair attempt.

Security
Cleared: The diff changes two C# source/test files and introduces no concrete credential, authorization, dependency, workflow, or supply-chain concern.

Review findings

  • [P1] Scope failed-connect reconnects to their generation — src/OpenClaw.Shared/WebSocketClientBase.cs:196
Review details

Best possible solution:

Land the focused generation guard only after every reconnect kickoff is generation-aware and targeted WebSocketClientBase tests cover listener, failed-connect, and closing-socket takeover cases.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level. A failed current ConnectAsync can start ReconnectWithBackoffAsync() without owner context, then a later successful ConnectAsync can be disposed when the stale backoff resumes.

Is this the best way to solve the issue?

No, not yet. Generation guarding is the right ownership boundary, but the fix needs to carry that ownership through every reconnect kickoff, including failed-connect backoff.

Full review comments:

  • [P1] Scope failed-connect reconnects to their generation — src/OpenClaw.Shared/WebSocketClientBase.cs:196
    The new reconnect owner checks only work when a socket/generation is passed in. This catch path still starts ReconnectWithBackoffAsync() with no owner, so if a later ConnectAsync succeeds during the delay the stale loop is still allowed to dispose or replace the newer _webSocket. Carry the failed connect's socket/generation into backoff, and add the failed-connect takeover regression.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against ae61c99bd858.

Label changes

Label justifications:

  • P2: This is a bounded but real connection-stability fix for a self-hosted WSS workflow, not a confirmed repository-wide outage.
  • merge-risk: 🚨 availability: The current diff can still leave stale reconnect backoff able to close or replace a newer healthy WebSocket connection.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦞 diamond lobster and patch quality is 🦪 silver shellfish.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR has a terminal screenshot for targeted tests and linked live reporter output showing the branch fixed the self-hosted WSS reconnect/re-approval loop during forced drops and a cold restart.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR has a terminal screenshot for targeted tests and linked live reporter output showing the branch fixed the self-hosted WSS reconnect/re-approval loop during forced drops and a cold restart.
Evidence reviewed

Acceptance criteria:

  • [P1] ./build.ps1.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Connection.Tests/OpenClaw.Connection.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore --filter "FullyQualifiedName~WebSocketClientBaseTests".

What I checked:

  • PR head adds generation ownership: The head branch increments _connectionGeneration, captures the socket in ConnectAsync, passes socket/generation into ListenForMessagesAsync, and checks IsCurrentConnection before stale listen work handles messages, disconnects, errors, or reconnects. (src/OpenClaw.Shared/WebSocketClientBase.cs:115, 6670bee4f94e)
  • Remaining unowned reconnect path: In the catch path for a current connect failure, the branch still calls ReconnectWithBackoffAsync() with no expected socket/generation. Because expectedSocket is null, IsReconnectOwner treats the loop as owner even if a newer open socket appears during the delay, allowing stale backoff to dispose or replace that newer socket. (src/OpenClaw.Shared/WebSocketClientBase.cs:196, 6670bee4f94e)
  • Targeted tests cover adjacent races: The branch adds loopback tests for stale connect/listener takeover, stale reconnect backoff not disposing a newer connection, failed retry continuation, and closing-socket recovery, but it does not cover failed initial connect entering backoff before a newer connect succeeds. (tests/OpenClaw.Shared.Tests/WebSocketClientBaseTests.cs:284, 6670bee4f94e)
  • Current main still lacks the generation guard: The checked-out main branch still has the original mutable _webSocket connect/listen/reconnect flow with no _connectionGeneration, so this PR is not obsolete on main. (src/OpenClaw.Shared/WebSocketClientBase.cs:106, ae61c99bd858)
  • Related issue has live proof for the central fix: The linked report says this PR was built and run against a self-hosted WSS gateway over Tailscale, with forced transient drops and a cold node restart producing zero re-approval requests, policy violations, or rate limits after the generation-guard change.
  • Repository policy and architecture context applied: AGENTS.md directs connection changes through the connection architecture docs; those docs place WebSocketClientBase in OpenClaw.Shared as the low-level WebSocket transport under GatewayConnectionManager-owned lifecycle behavior. (docs/CONNECTION_ARCHITECTURE.md:17, ae61c99bd858)

Likely related people:

  • shanselman: Commit 76f7811 extracted WebSocketClientBase, and later history plus the latest PR head commit touch the same shared WebSocket lifecycle area. (role: introduced shared base and recent area contributor; confidence: high; commits: 76f7811a1448, d23f8ca50013, 6670bee4f94e; files: src/OpenClaw.Shared/WebSocketClientBase.cs, tests/OpenClaw.Shared.Tests/WebSocketClientBaseTests.cs)
  • Sytone: Commit 1d83639 changed reconnect hardening and adjacent gateway lifecycle behavior in WebSocketClientBase and OpenClawGatewayClient. (role: reconnect hardening contributor; confidence: high; commits: 1d836390e9ee; files: src/OpenClaw.Shared/WebSocketClientBase.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs)
  • RBrid: Recent main history shows commit 753828f editing exception handling in the shared WebSocket reconnect/listen path, and PR history shows related broader lifecycle work in this same area. (role: recent shared WebSocket contributor; confidence: medium; commits: 753828f63e96; files: src/OpenClaw.Shared/WebSocketClientBase.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs)
  • ranjeshj: Merged PR Harden gateway connection and pairing flows #741 and recent history cover gateway connection, pairing, stale-client, and token-recovery flows adjacent to this shared WebSocket fix. (role: adjacent connection lifecycle contributor; confidence: high; commits: 5d14408f3947, 9d304c110c0f; files: src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Connection/NodeConnector.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 10, 2026
christineyan4 and others added 2 commits June 10, 2026 15:35
Bind listener-initiated reconnect loops to the socket generation that requested them so a stale loop cannot dispose or replace a newer successful connection after backoff delay.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christineyan4

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Guard reconnect loops so failed retries and closing current sockets do not strand the websocket client while still avoiding active newer sockets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@shanselman shanselman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainer validation looks good. I added and validated two small reconnect-loop hardening tests/fixes so stale generation guards do not strand reconnects after failed retries or current close handshakes.

@shanselman shanselman merged commit 17d0637 into openclaw:main Jun 21, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconnect / "re-approval, repair" loop connecting to a pre-existing (self-hosted) gateway over wss

2 participants