Skip to content

feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14

Merged
CMGS merged 4 commits into
mainfrom
feat/restore-from-hibernate-producer
Jul 1, 2026
Merged

feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14
CMGS merged 4 commits into
mainfrom
feat/restore-from-hibernate-producer

Conversation

@CMGS

@CMGS CMGS commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

Revives the restore-from-hibernate consumer in vk-cocoon (which had no producer). When the operator (re)creates a pod for a currently-hibernated agent, it stamps vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from its :hibernate snapshot instead of booting fresh.

Why

Without a producer, a hibernated agent whose pod moves nodes (drain/failure) boots clean; a subsequent re-hibernate then overwrites the real snapshot → silent state loss. This closes that gap (the annotation's doc comment already specced "Written by the operator on the rebuilt pod").

How

Predicate = intent ∧ existence:

  • intent: a CocoonHibernation CR in phase Hibernated/Waking, or Spec.Suspend.
  • existence: Registry.HasManifest(vmName, hibernate) — the same lookup vk runs at wake, so it never flags without a snapshot to restore, and fails closed on a probe error.

Wired at the 3 create sites (main / sub / suspend-main); extracted createMainAgent to keep Reconcile under the gocyclo budget.

Depends on

cocoonstack/cocoon-common#5 (MarkRestoreFromHibernate); go.mod pins the setter commit — rebump to common main once #5 merges.

Known follow-ups (out of scope)

  • A sub deleted during whole-set suspend isn't recreated until unsuspend (no CR then).
  • A desire=Hibernate agent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).

Tests: markRestoreIfHibernated (4 cases + nil-registry), restorableFromHibernateByCR (phase filter).

CMGS added 4 commits July 2, 2026 00:43
Revives the restore-from-hibernate consumer (vk-cocoon) that had no producer:
when the operator (re)creates a pod for a currently-hibernated agent, stamp
vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from
its :hibernate snapshot instead of booting fresh. Without it, a hibernated agent
whose pod moves nodes boots clean and a later re-hibernate overwrites the real
snapshot — silent state loss.

Predicate = intent AND existence: a CocoonHibernation CR in phase
Hibernated/Waking (or Spec.Suspend) AND Registry.HasManifest(vmName, hibernate).
The registry probe is the same lookup vk runs at wake, so it never flags without
a snapshot to restore, and it fails closed on a probe error. Wired at the three
create sites (main/sub/suspend-main); extracted createMainAgent to keep Reconcile
under the complexity budget.

Follow-ups (out of scope): a sub deleted during whole-set suspend isn't recreated
until unsuspend (no CR then); a desire=Hibernate agent recreated mid-hibernation
restores then re-hibernates (wasteful but preserves state).
- extract hibernationPodNames (shared by podsRestorableByCR + podsHibernatedByCR)
  and hasHibernateSnapshot (shared with allOwnedPodsHibernated)
- bind logger at the top of markRestoreIfHibernated; pass a literal true from the
  suspend path (intent is unconditional there); rename restorableFromHibernateByCR
  to the noun-led podsRestorableByCR; drop the non-ASCII AND symbol from the doc
ensureToolboxes built and created toolbox pods directly, bypassing the producer
the main/sub/suspend-main paths use. Managed toolboxes are VM-backed and
hibernatable (per-CR or whole-set suspend), so a hibernated toolbox that goes
terminal or drifts would cold-boot on recreate and a later hibernate would then
overwrite the real :hibernate snapshot — the same state loss the producer
prevents elsewhere. Compute podsRestorableByCR once and stamp each rebuilt
toolbox, mirroring ensureSubAgents. Regression test included.
@CMGS CMGS merged commit 5485764 into main Jul 1, 2026
2 checks passed
@CMGS CMGS deleted the feat/restore-from-hibernate-producer branch July 1, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant