[NemoClaw][Sandbox][Recovery] Pod restart leaves OpenClaw gateway and dashboard port-forward dead; recovery is a side-effect of nemoclaw connect


When a NemoClaw sandbox pod is recreated (for example after a `hostAliases` CR edit and `kubectl delete pod` to trigger reconciliation), two long-running pieces do not come back on the new pod:

1. The in-sandbox **OpenClaw gateway** (the WebSocket/HTTP service that the dashboard connects to on `127.0.0.1:18789` inside the sandbox). The old process was on the destroyed pod; the new pod has `sleep infinity` as its command and does not re-run `nemoclaw-start`.
2. The host-side **SSH port-forward** that `openshell forward` set up at onboard time to bridge `127.0.0.1:18789` on the OpenShell host into the sandbox pod. It was an `-L 127.0.0.1:18789:127.0.0.1:18789 -f` SSH session; it dies with the old pod and is not re-created.

Net result: the dashboard and any clients tunnelling through the host return `connection refused`, even though `openclaw agent` calls still work (they spawn the gateway on-demand per call).

### Environment

- NemoClaw v0.0.17, OpenShell v0.0.26
- Host: Linux x86_64, MicroK8s
- Trigger: `docker exec openshell-cluster-nemoclaw kubectl -n openshell delete pod my-assistant`

### Reproduction

1. Onboard a sandbox (`nemoclaw onboard --resume`). Confirm the dashboard at `http://127.0.0.1:18789/` is reachable (via SSH tunnel if remote) and that `ss -lnt | grep 18789` inside the sandbox shows a listener.
2. Edit the Sandbox CR (for example, add a `hostAliases` entry) via `kubectl patch`; there is no first-class CLI.
3. Force reconcile by deleting the pod: `kubectl -n openshell delete pod <name>`.
4. Wait for the operator to recreate the pod and reach `Running`.
5. Try the dashboard URL: connection refused.
6. `openshell sandbox exec -n <name> --no-tty -- ss -lnt | grep 18789`: empty.
7. `ps -ef | grep '18789'` on the OpenShell host: port-forward is gone.

### Self-recovery exists, but is inside `connect`

Running `nemoclaw <name> connect` prints:

```
  OpenClaw gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
→ Found forward on sandbox 'my-assistant'
✓ Stopped forward of port 18789 for sandbox my-assistant
✓ Forwarding port 18789 to sandbox my-assistant in the background
  ✓ OpenClaw gateway restarted inside sandbox.
  ✓ Dashboard port forward re-established.
```

That is the exact recovery an operator needs after a pod reconcile, but it only runs as a side-effect of `connect`, which is an interactive command. It is not straightforward to script ("open a shell so services restart, then exit"), and operators who do not run `connect` after a reconcile will not observe the recovery.

### Expected behaviour

One of:

1. **Auto-recovery by the sandbox supervisor:** when the pod comes up, re-run `nemoclaw-start`, re-bind the dashboard port, and re-establish the port-forward from the OpenShell host. This matches the behaviour users expect from pods rolled via normal Kubernetes operations.
2. **First-class idempotent command:** `nemoclaw <name> services-recover` (or `services-restart`), exposing the same logic that already lives inside `connect` without the side-effect of opening a shell. Safe to run at any time; a no-op if services are already up.

### Actual behaviour

Services stay dead on the new pod. The operator needs to know that `connect` performs the recovery, or to run the `nemoclaw-start` / `openshell forward start` sequence manually.

### Impact

- CR-edit-driven pod reconciles leave the dashboard without a listener on `127.0.0.1:18789`.
- Operators not using the TUI (remote dashboard, RDP, API clients) see `connection refused` with no log message indicating the cause.
- Adds an additional manual step to the post-pod-restart sequence.

### Additional observation: `openshell forward list` already exposes the failure

`openshell forward list` prints the forward's health status:

```
SANDBOX      BIND      PORT     PID        STATUS
my-assistant 127.0.0.1 18789    1803950    dead
```

So openshell detects that the tunnel is gone (the PID is checked on each `list`). The detection is present; an automatic remediation is not. The same healthcheck that colours STATUS red could trigger an automatic `forward start` or at minimum log a warning to the gateway logs.

A related orphan-state exists: if an operator runs `openshell forward start 18789 <name>` without `--background`, the command blocks on the foreground SSH tunnel. When that SSH session drops (for example, the operator logs out of the host shell), the SSH process is orphaned: still bound to port 18789, but no longer tracked by `openshell forward list`. `ss -tlnp | grep 18789` shows it; `openshell forward list` reports "No active forwards." The port continues to serve but no supervisor is tracking it, and `openshell forward stop` responds with "No active forward found."

### Suggested fix

1. **Short term:** add `nemoclaw <name> services-recover` as a public, idempotent command that runs the existing `connect` recovery logic without the interactive shell. Safe to cron or script.
2. **Medium term:** have `openshell forward` supervise the forwards it creates. When `dead` status is detected on the next `list` (or on a periodic healthcheck), automatically restart. A `--no-auto-restart` flag for opt-out is fine; the default should be to keep the forward alive.
3. **Medium term:** warn or error when `openshell forward start` is invoked without `--background` over a non-interactive connection (for example, a detached SSH session). The current behaviour exits quietly when the parent SSH drops and leaves an orphan process.
4. **Longer term:** have the sandbox supervisor re-run `nemoclaw-start` automatically when the pod comes up, and have host-side services detect dead tunnels and re-establish them without operator intervention.

### Notes

Discovered on 2026-04-17 alongside the hostAliases bug and the proxy-DNS bug. Together, the sequence of "change pod spec" involves: CR patch has no CLI, reconcile has no CLI, services do not auto-restart, and the recovery path only runs inside `connect`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NemoClaw][Sandbox][Recovery] Pod restart leaves OpenClaw gateway and dashboard port-forward dead; recovery is a side-effect of nemoclaw connect #2042

Environment

Reproduction

Self-recovery exists, but is inside `connect`

Expected behaviour

Actual behaviour

Impact

Additional observation: `openshell forward list` already exposes the failure

Suggested fix

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NemoClaw][Sandbox][Recovery] Pod restart leaves OpenClaw gateway and dashboard port-forward dead; recovery is a side-effect of nemoclaw connect #2042

Description

Environment

Reproduction

Self-recovery exists, but is inside connect

Expected behaviour

Actual behaviour

Impact

Additional observation: openshell forward list already exposes the failure

Suggested fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Self-recovery exists, but is inside `connect`

Additional observation: `openshell forward list` already exposes the failure