Skip to content

Harden tray<->gateway keep-alive and reconnect lifecycle#627

Closed
RBrid wants to merge 2 commits into
openclaw:mainfrom
RBrid:user/rbrid/KeepAliveFixes2
Closed

Harden tray<->gateway keep-alive and reconnect lifecycle#627
RBrid wants to merge 2 commits into
openclaw:mainfrom
RBrid:user/rbrid/KeepAliveFixes2

Conversation

@RBrid

@RBrid RBrid commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

WebSocketClientBase:

  • Add _disposing flag distinct from _disposed; set before OnDisposing, promote to _disposed after, so teardown callbacks see a stable state.
  • Gate ConnectAsync, ReconnectWithBackoffAsync, listener-finally (CAS/event/reconnect-kickoff), and reconnect kickoff sites on (!_disposed && !_disposing) to prevent post-Dispose work.
  • Re-check disposal after OnConnectedAsync and before spawning the listener Task so a Dispose racing the connect path does not leak a background listener.
  • CAS-clear + Abort() + Dispose() any ClientWebSocket installed into _webSocket if disposal wins the race during ConnectAsync; mirror in the orphan-clean else-if branch.
  • Abort() before Dispose() on owned sockets so peers see a clean RST instead of an unsent CLOSE.
  • Volatile reads on shutdown flags in HeartbeatLoopAsync to avoid cache staleness across cores.
  • Catch ObjectDisposedException narrowly in SendRawAsync and CloseWebSocketAsync so torn-down sockets do not surface as errors.
  • Guard OnDisposing with try/catch so a throwing subclass cannot skip later cleanup steps.
  • Per-event try/catch wrappers around status/error event raises so a throwing subscriber cannot block teardown.

OpenClawGatewayClient:

  • Apply matching reconnect/teardown hygiene around the keep-alive and heartbeat paths so connection state stays consistent across forced disconnects.

Tests:

  • Relax reconnect-backoff log assertion to tolerate jitter in the delay value (still asserts attempt number).

Validation:

  • ./build.ps1 clean (0/0)
  • Shared.Tests: 2045 passed / 29 skipped
  • Tray.Tests: 877 passed
  • Manual: tray launched from net10.0-windows10.0.22621.0 build

@clawsweeper

clawsweeper Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codex review: needs real behavior proof before merge. Reviewed June 18, 2026, 8:19 PM ET / 00:19 UTC.

Summary
The PR changes shared WebSocket connection/disposal/reconnect handling, adds an operator gateway application heartbeat, and updates one shared WebSocket test file.

Reproducibility: Source-reproducible, but not live-reproduced in this review. The linked issue provides a real reconnect/re-approval scenario, and the heartbeat finding follows from the PR source path where missing req/ping responses abort the socket.

Review metrics: 2 noteworthy metrics.

  • Patch surface: 3 files, +605/-99. The PR changes shared connection lifecycle code plus tests, so small mistakes can affect both operator and node WebSocket behavior.
  • Merge state: draft, conflicting. The final merged behavior cannot be reviewed until the branch is refreshed against current main.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦪 silver shellfish
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted terminal output, logs, or a short recording showing stable reconnect/keep-alive behavior against a real gateway after the change.
  • Make the application heartbeat safe for gateways that do not support tray-initiated ping.
  • Refresh the branch against current main and resolve the conflict without dropping current connection and pairing fixes.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body lists validation commands and a tray launch note, but it does not include inspectable after-fix proof of the reconnect/keep-alive behavior in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] The new operator application heartbeat can create a reconnect loop for existing gateways that do not answer tray-initiated req/ping; current local docs/tests do not prove that method is a supported operator RPC.
  • [P1] The PR is draft and conflicting against current main, so the branch must be refreshed while preserving current pairing-scope and merged connection hardening from PR Harden gateway connection and pairing flows #741.
  • [P1] The PR body lists build/test/manual claims, but it does not provide inspectable after-fix logs, terminal output, or a recording of the reconnect/keep-alive behavior against a real gateway.

Maintainer options:

  1. Make the heartbeat protocol-compatible (recommended)
    Use a documented/supported liveness RPC, capability-detect support, or keep the new application heartbeat disabled when a gateway cannot prove it answers the request.
  2. Refresh and re-review the branch
    Resolve the current main conflict and show the merge result preserves current pairing scopes and the merged connection fixes from PR Harden gateway connection and pairing flows #741.
  3. Pause the broader rewrite
    If maintainers only want the narrower generation-guard fix now, pause or close this broader keep-alive/disposal rewrite and keep that work tracked separately.

Next step before merge

  • [P1] The remaining blockers are contributor proof, branch refresh, and maintainer review of the heartbeat protocol compatibility rather than a safe automated repair lane.

Security
Cleared: No concrete security or supply-chain concern was found in the three-file C# lifecycle/test patch.

Review findings

  • [P1] Gate the application heartbeat on a supported RPC — src/OpenClaw.Shared/OpenClawGatewayClient.cs:208
Review details

Best possible solution:

Refresh on current main, preserve the merged connection/pairing fixes, make the heartbeat compatible with unsupported gateways or a documented health RPC, add focused coverage, and attach redacted real-gateway proof.

Do we have a high-confidence way to reproduce the issue?

Source-reproducible, but not live-reproduced in this review. The linked issue provides a real reconnect/re-approval scenario, and the heartbeat finding follows from the PR source path where missing req/ping responses abort the socket.

Is this the best way to solve the issue?

No, not yet. The lifecycle ownership direction is plausible, but the default application heartbeat needs a compatibility-safe contract and the branch needs current-main refresh plus real behavior proof.

Full review comments:

  • [P1] Gate the application heartbeat on a supported RPC — src/OpenClaw.Shared/OpenClawGatewayClient.cs:208
    This enables the heartbeat for every gateway connection, and SendApplicationPingAsync sends req/ping before aborting the socket on timeout. I could not find a current operator-gateway contract for tray-initiated ping; an existing self-hosted gateway that does not answer that method can be forcibly reconnected every heartbeat interval. Use a documented RPC, capability detection, or leave the application heartbeat off for unsupported gateways.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.78

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against c0514bd2d026.

Label changes

Label changes:

  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists validation commands and a tray launch note, but it does not include inspectable after-fix proof of the reconnect/keep-alive behavior in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

  • P2: The PR targets a bounded but important connection-stability problem rather than a confirmed repository-wide outage.
  • merge-risk: 🚨 compatibility: The branch can change behavior for existing gateways by requiring a tray-initiated heartbeat response that is not established as a supported operator RPC.
  • merge-risk: 🚨 availability: If the heartbeat assumption is wrong, users can see forced reconnects or reconnect loops after merge.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists validation commands and a tray launch note, but it does not include inspectable after-fix proof of the reconnect/keep-alive behavior in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

What I checked:

Likely related people:

  • ranjeshj: Merged PR Harden gateway connection and pairing flows #741 and recent history touch gateway connection, pairing, stale-client, and token-recovery flows adjacent to this PR. (role: recent connection and pairing contributor; confidence: high; commits: ea36b12f9e4c; files: src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs, src/OpenClaw.Shared/WindowsNodeClient.cs)
  • RBrid: Current main history shows recent changes in OpenClawGatewayClient/WebSocketClientBase, and the same person authored this PR, so they have direct area context beyond this branch. (role: recent shared gateway client contributor; confidence: high; commits: 753828f63e96, 5505a85da7df; files: src/OpenClaw.Shared/WebSocketClientBase.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs)
  • shanselman: Recent main history includes WebSocketClientBase hardening/annotation work, and related issue history ties earlier lifecycle work to this same shared base. (role: shared WebSocket area contributor; confidence: medium; commits: d23f8ca50013; files: src/OpenClaw.Shared/WebSocketClientBase.cs, tests/OpenClaw.Shared.Tests/WebSocketClientBaseTests.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 2, 2026
WebSocketClientBase:
- Add _disposing flag distinct from _disposed; set before OnDisposing,
  promote to _disposed after, so teardown callbacks see a stable state.
- Gate ConnectAsync, ReconnectWithBackoffAsync, listener-finally
  (CAS/event/reconnect-kickoff), and reconnect kickoff sites on
  (!_disposed && !_disposing) to prevent post-Dispose work.
- Re-check disposal after OnConnectedAsync and before spawning the
  listener Task so a Dispose racing the connect path does not leak a
  background listener.
- CAS-clear + Abort() + Dispose() any ClientWebSocket installed into
  _webSocket if disposal wins the race during ConnectAsync; mirror in
  the orphan-clean else-if branch.
- Abort() before Dispose() on owned sockets so peers see a clean RST
  instead of an unsent CLOSE.
- Volatile reads on shutdown flags in HeartbeatLoopAsync to avoid
  cache staleness across cores.
- Catch ObjectDisposedException narrowly in SendRawAsync and
  CloseWebSocketAsync so torn-down sockets do not surface as errors.
- Guard OnDisposing with try/catch so a throwing subclass cannot skip
  later cleanup steps.
- Per-event try/catch wrappers around status/error event raises so a
  throwing subscriber cannot block teardown.

OpenClawGatewayClient:
- Apply matching reconnect/teardown hygiene around the keep-alive and
  heartbeat paths so connection state stays consistent across forced
  disconnects.

Tests:
- Relax reconnect-backoff log assertion to tolerate jitter in the
  delay value (still asserts attempt number).

Validation:
- ./build.ps1 clean (0/0)
- Shared.Tests: 2045 passed / 29 skipped
- Tray.Tests:   877 passed
- Manual: tray launched from net10.0-windows10.0.22621.0 build

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RBrid RBrid force-pushed the user/rbrid/KeepAliveFixes2 branch from b42fb20 to 1f48af3 Compare June 2, 2026 02:07
@RBrid RBrid marked this pull request as ready for review June 2, 2026 02:20
No code changes; previous push 1f48af3 addressed the Dispose() race finding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RBrid RBrid marked this pull request as draft June 4, 2026 23:29
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 4, 2026
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 19, 2026
@shanselman shanselman closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants