Skip to content

[NemoClaw][Sandbox][Recovery] Pod restart leaves OpenClaw gateway and dashboard port-forward dead; recovery is a side-effect of nemoclaw connect #2042

@davidglogan

Description

@davidglogan

When a NemoClaw sandbox pod is recreated (for example after a hostAliases CR edit and kubectl delete pod to trigger reconciliation), two long-running pieces do not come back on the new pod:

  1. The in-sandbox OpenClaw gateway (the WebSocket/HTTP service that the dashboard connects to on 127.0.0.1:18789 inside the sandbox). The old process was on the destroyed pod; the new pod has sleep infinity as its command and does not re-run nemoclaw-start.
  2. The host-side SSH port-forward that openshell forward set up at onboard time to bridge 127.0.0.1:18789 on the OpenShell host into the sandbox pod. It was an -L 127.0.0.1:18789:127.0.0.1:18789 -f SSH session; it dies with the old pod and is not re-created.

Net result: the dashboard and any clients tunnelling through the host return connection refused, even though openclaw agent calls still work (they spawn the gateway on-demand per call).

Environment

  • NemoClaw v0.0.17, OpenShell v0.0.26
  • Host: Linux x86_64, MicroK8s
  • Trigger: docker exec openshell-cluster-nemoclaw kubectl -n openshell delete pod my-assistant

Reproduction

  1. Onboard a sandbox (nemoclaw onboard --resume). Confirm the dashboard at http://127.0.0.1:18789/ is reachable (via SSH tunnel if remote) and that ss -lnt | grep 18789 inside the sandbox shows a listener.
  2. Edit the Sandbox CR (for example, add a hostAliases entry) via kubectl patch; there is no first-class CLI.
  3. Force reconcile by deleting the pod: kubectl -n openshell delete pod <name>.
  4. Wait for the operator to recreate the pod and reach Running.
  5. Try the dashboard URL: connection refused.
  6. openshell sandbox exec -n <name> --no-tty -- ss -lnt | grep 18789: empty.
  7. ps -ef | grep '18789' on the OpenShell host: port-forward is gone.

Self-recovery exists, but is inside connect

Running nemoclaw <name> connect prints:

  OpenClaw gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
→ Found forward on sandbox 'my-assistant'
✓ Stopped forward of port 18789 for sandbox my-assistant
✓ Forwarding port 18789 to sandbox my-assistant in the background
  ✓ OpenClaw gateway restarted inside sandbox.
  ✓ Dashboard port forward re-established.

That is the exact recovery an operator needs after a pod reconcile, but it only runs as a side-effect of connect, which is an interactive command. It is not straightforward to script ("open a shell so services restart, then exit"), and operators who do not run connect after a reconcile will not observe the recovery.

Expected behaviour

One of:

  1. Auto-recovery by the sandbox supervisor: when the pod comes up, re-run nemoclaw-start, re-bind the dashboard port, and re-establish the port-forward from the OpenShell host. This matches the behaviour users expect from pods rolled via normal Kubernetes operations.
  2. First-class idempotent command: nemoclaw <name> services-recover (or services-restart), exposing the same logic that already lives inside connect without the side-effect of opening a shell. Safe to run at any time; a no-op if services are already up.

Actual behaviour

Services stay dead on the new pod. The operator needs to know that connect performs the recovery, or to run the nemoclaw-start / openshell forward start sequence manually.

Impact

  • CR-edit-driven pod reconciles leave the dashboard without a listener on 127.0.0.1:18789.
  • Operators not using the TUI (remote dashboard, RDP, API clients) see connection refused with no log message indicating the cause.
  • Adds an additional manual step to the post-pod-restart sequence.

Additional observation: openshell forward list already exposes the failure

openshell forward list prints the forward's health status:

SANDBOX      BIND      PORT     PID        STATUS
my-assistant 127.0.0.1 18789    1803950    dead

So openshell detects that the tunnel is gone (the PID is checked on each list). The detection is present; an automatic remediation is not. The same healthcheck that colours STATUS red could trigger an automatic forward start or at minimum log a warning to the gateway logs.

A related orphan-state exists: if an operator runs openshell forward start 18789 <name> without --background, the command blocks on the foreground SSH tunnel. When that SSH session drops (for example, the operator logs out of the host shell), the SSH process is orphaned: still bound to port 18789, but no longer tracked by openshell forward list. ss -tlnp | grep 18789 shows it; openshell forward list reports "No active forwards." The port continues to serve but no supervisor is tracking it, and openshell forward stop responds with "No active forward found."

Suggested fix

  1. Short term: add nemoclaw <name> services-recover as a public, idempotent command that runs the existing connect recovery logic without the interactive shell. Safe to cron or script.
  2. Medium term: have openshell forward supervise the forwards it creates. When dead status is detected on the next list (or on a periodic healthcheck), automatically restart. A --no-auto-restart flag for opt-out is fine; the default should be to keep the forward alive.
  3. Medium term: warn or error when openshell forward start is invoked without --background over a non-interactive connection (for example, a detached SSH session). The current behaviour exits quietly when the parent SSH drops and leaves an orphan process.
  4. Longer term: have the sandbox supervisor re-run nemoclaw-start automatically when the pod comes up, and have host-side services detect dead tunnels and re-establish them without operator intervention.

Notes

Discovered on 2026-04-17 alongside the hostAliases bug and the proxy-DNS bug. Together, the sequence of "change pod spec" involves: CR patch has no CLI, reconcile has no CLI, services do not auto-restart, and the recovery path only runs inside connect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Integration: OpenClawSupport for OpenClawNemoClaw CLIUse this label to identify issues with the NemoClaw command-line interface (CLI).OpenShellSupport for OpenShell, a safe, private runtime for autonomous AI agentsbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions