When a NemoClaw sandbox pod is recreated (for example after a hostAliases CR edit and kubectl delete pod to trigger reconciliation), two long-running pieces do not come back on the new pod:
- The in-sandbox OpenClaw gateway (the WebSocket/HTTP service that the dashboard connects to on
127.0.0.1:18789 inside the sandbox). The old process was on the destroyed pod; the new pod has sleep infinity as its command and does not re-run nemoclaw-start.
- The host-side SSH port-forward that
openshell forward set up at onboard time to bridge 127.0.0.1:18789 on the OpenShell host into the sandbox pod. It was an -L 127.0.0.1:18789:127.0.0.1:18789 -f SSH session; it dies with the old pod and is not re-created.
Net result: the dashboard and any clients tunnelling through the host return connection refused, even though openclaw agent calls still work (they spawn the gateway on-demand per call).
Environment
- NemoClaw v0.0.17, OpenShell v0.0.26
- Host: Linux x86_64, MicroK8s
- Trigger:
docker exec openshell-cluster-nemoclaw kubectl -n openshell delete pod my-assistant
Reproduction
- Onboard a sandbox (
nemoclaw onboard --resume). Confirm the dashboard at http://127.0.0.1:18789/ is reachable (via SSH tunnel if remote) and that ss -lnt | grep 18789 inside the sandbox shows a listener.
- Edit the Sandbox CR (for example, add a
hostAliases entry) via kubectl patch; there is no first-class CLI.
- Force reconcile by deleting the pod:
kubectl -n openshell delete pod <name>.
- Wait for the operator to recreate the pod and reach
Running.
- Try the dashboard URL: connection refused.
openshell sandbox exec -n <name> --no-tty -- ss -lnt | grep 18789: empty.
ps -ef | grep '18789' on the OpenShell host: port-forward is gone.
Self-recovery exists, but is inside connect
Running nemoclaw <name> connect prints:
OpenClaw gateway is not running inside the sandbox (sandbox likely restarted).
Recovering...
→ Found forward on sandbox 'my-assistant'
✓ Stopped forward of port 18789 for sandbox my-assistant
✓ Forwarding port 18789 to sandbox my-assistant in the background
✓ OpenClaw gateway restarted inside sandbox.
✓ Dashboard port forward re-established.
That is the exact recovery an operator needs after a pod reconcile, but it only runs as a side-effect of connect, which is an interactive command. It is not straightforward to script ("open a shell so services restart, then exit"), and operators who do not run connect after a reconcile will not observe the recovery.
Expected behaviour
One of:
- Auto-recovery by the sandbox supervisor: when the pod comes up, re-run
nemoclaw-start, re-bind the dashboard port, and re-establish the port-forward from the OpenShell host. This matches the behaviour users expect from pods rolled via normal Kubernetes operations.
- First-class idempotent command:
nemoclaw <name> services-recover (or services-restart), exposing the same logic that already lives inside connect without the side-effect of opening a shell. Safe to run at any time; a no-op if services are already up.
Actual behaviour
Services stay dead on the new pod. The operator needs to know that connect performs the recovery, or to run the nemoclaw-start / openshell forward start sequence manually.
Impact
- CR-edit-driven pod reconciles leave the dashboard without a listener on
127.0.0.1:18789.
- Operators not using the TUI (remote dashboard, RDP, API clients) see
connection refused with no log message indicating the cause.
- Adds an additional manual step to the post-pod-restart sequence.
Additional observation: openshell forward list already exposes the failure
openshell forward list prints the forward's health status:
SANDBOX BIND PORT PID STATUS
my-assistant 127.0.0.1 18789 1803950 dead
So openshell detects that the tunnel is gone (the PID is checked on each list). The detection is present; an automatic remediation is not. The same healthcheck that colours STATUS red could trigger an automatic forward start or at minimum log a warning to the gateway logs.
A related orphan-state exists: if an operator runs openshell forward start 18789 <name> without --background, the command blocks on the foreground SSH tunnel. When that SSH session drops (for example, the operator logs out of the host shell), the SSH process is orphaned: still bound to port 18789, but no longer tracked by openshell forward list. ss -tlnp | grep 18789 shows it; openshell forward list reports "No active forwards." The port continues to serve but no supervisor is tracking it, and openshell forward stop responds with "No active forward found."
Suggested fix
- Short term: add
nemoclaw <name> services-recover as a public, idempotent command that runs the existing connect recovery logic without the interactive shell. Safe to cron or script.
- Medium term: have
openshell forward supervise the forwards it creates. When dead status is detected on the next list (or on a periodic healthcheck), automatically restart. A --no-auto-restart flag for opt-out is fine; the default should be to keep the forward alive.
- Medium term: warn or error when
openshell forward start is invoked without --background over a non-interactive connection (for example, a detached SSH session). The current behaviour exits quietly when the parent SSH drops and leaves an orphan process.
- Longer term: have the sandbox supervisor re-run
nemoclaw-start automatically when the pod comes up, and have host-side services detect dead tunnels and re-establish them without operator intervention.
Notes
Discovered on 2026-04-17 alongside the hostAliases bug and the proxy-DNS bug. Together, the sequence of "change pod spec" involves: CR patch has no CLI, reconcile has no CLI, services do not auto-restart, and the recovery path only runs inside connect.
When a NemoClaw sandbox pod is recreated (for example after a
hostAliasesCR edit andkubectl delete podto trigger reconciliation), two long-running pieces do not come back on the new pod:127.0.0.1:18789inside the sandbox). The old process was on the destroyed pod; the new pod hassleep infinityas its command and does not re-runnemoclaw-start.openshell forwardset up at onboard time to bridge127.0.0.1:18789on the OpenShell host into the sandbox pod. It was an-L 127.0.0.1:18789:127.0.0.1:18789 -fSSH session; it dies with the old pod and is not re-created.Net result: the dashboard and any clients tunnelling through the host return
connection refused, even thoughopenclaw agentcalls still work (they spawn the gateway on-demand per call).Environment
docker exec openshell-cluster-nemoclaw kubectl -n openshell delete pod my-assistantReproduction
nemoclaw onboard --resume). Confirm the dashboard athttp://127.0.0.1:18789/is reachable (via SSH tunnel if remote) and thatss -lnt | grep 18789inside the sandbox shows a listener.hostAliasesentry) viakubectl patch; there is no first-class CLI.kubectl -n openshell delete pod <name>.Running.openshell sandbox exec -n <name> --no-tty -- ss -lnt | grep 18789: empty.ps -ef | grep '18789'on the OpenShell host: port-forward is gone.Self-recovery exists, but is inside
connectRunning
nemoclaw <name> connectprints:That is the exact recovery an operator needs after a pod reconcile, but it only runs as a side-effect of
connect, which is an interactive command. It is not straightforward to script ("open a shell so services restart, then exit"), and operators who do not runconnectafter a reconcile will not observe the recovery.Expected behaviour
One of:
nemoclaw-start, re-bind the dashboard port, and re-establish the port-forward from the OpenShell host. This matches the behaviour users expect from pods rolled via normal Kubernetes operations.nemoclaw <name> services-recover(orservices-restart), exposing the same logic that already lives insideconnectwithout the side-effect of opening a shell. Safe to run at any time; a no-op if services are already up.Actual behaviour
Services stay dead on the new pod. The operator needs to know that
connectperforms the recovery, or to run thenemoclaw-start/openshell forward startsequence manually.Impact
127.0.0.1:18789.connection refusedwith no log message indicating the cause.Additional observation:
openshell forward listalready exposes the failureopenshell forward listprints the forward's health status:So openshell detects that the tunnel is gone (the PID is checked on each
list). The detection is present; an automatic remediation is not. The same healthcheck that colours STATUS red could trigger an automaticforward startor at minimum log a warning to the gateway logs.A related orphan-state exists: if an operator runs
openshell forward start 18789 <name>without--background, the command blocks on the foreground SSH tunnel. When that SSH session drops (for example, the operator logs out of the host shell), the SSH process is orphaned: still bound to port 18789, but no longer tracked byopenshell forward list.ss -tlnp | grep 18789shows it;openshell forward listreports "No active forwards." The port continues to serve but no supervisor is tracking it, andopenshell forward stopresponds with "No active forward found."Suggested fix
nemoclaw <name> services-recoveras a public, idempotent command that runs the existingconnectrecovery logic without the interactive shell. Safe to cron or script.openshell forwardsupervise the forwards it creates. Whendeadstatus is detected on the nextlist(or on a periodic healthcheck), automatically restart. A--no-auto-restartflag for opt-out is fine; the default should be to keep the forward alive.openshell forward startis invoked without--backgroundover a non-interactive connection (for example, a detached SSH session). The current behaviour exits quietly when the parent SSH drops and leaves an orphan process.nemoclaw-startautomatically when the pod comes up, and have host-side services detect dead tunnels and re-establish them without operator intervention.Notes
Discovered on 2026-04-17 alongside the hostAliases bug and the proxy-DNS bug. Together, the sequence of "change pod spec" involves: CR patch has no CLI, reconcile has no CLI, services do not auto-restart, and the recovery path only runs inside
connect.