feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14
Merged
Conversation
Revives the restore-from-hibernate consumer (vk-cocoon) that had no producer: when the operator (re)creates a pod for a currently-hibernated agent, stamp vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from its :hibernate snapshot instead of booting fresh. Without it, a hibernated agent whose pod moves nodes boots clean and a later re-hibernate overwrites the real snapshot — silent state loss. Predicate = intent AND existence: a CocoonHibernation CR in phase Hibernated/Waking (or Spec.Suspend) AND Registry.HasManifest(vmName, hibernate). The registry probe is the same lookup vk runs at wake, so it never flags without a snapshot to restore, and it fails closed on a probe error. Wired at the three create sites (main/sub/suspend-main); extracted createMainAgent to keep Reconcile under the complexity budget. Follow-ups (out of scope): a sub deleted during whole-set suspend isn't recreated until unsuspend (no CR then); a desire=Hibernate agent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).
- extract hibernationPodNames (shared by podsRestorableByCR + podsHibernatedByCR) and hasHibernateSnapshot (shared with allOwnedPodsHibernated) - bind logger at the top of markRestoreIfHibernated; pass a literal true from the suspend path (intent is unconditional there); rename restorableFromHibernateByCR to the noun-led podsRestorableByCR; drop the non-ASCII AND symbol from the doc
ensureToolboxes built and created toolbox pods directly, bypassing the producer the main/sub/suspend-main paths use. Managed toolboxes are VM-backed and hibernatable (per-CR or whole-set suspend), so a hibernated toolbox that goes terminal or drifts would cold-boot on recreate and a later hibernate would then overwrite the real :hibernate snapshot — the same state loss the producer prevents elsewhere. Compute podsRestorableByCR once and stamp each rebuilt toolbox, mirroring ensureSubAgents. Regression test included.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Revives the
restore-from-hibernateconsumer in vk-cocoon (which had no producer). When the operator (re)creates a pod for a currently-hibernated agent, it stampsvm.cocoonstack.io/restore-from-hibernateso the new node restores the VM from its:hibernatesnapshot instead of booting fresh.Why
Without a producer, a hibernated agent whose pod moves nodes (drain/failure) boots clean; a subsequent re-hibernate then overwrites the real snapshot → silent state loss. This closes that gap (the annotation's doc comment already specced "Written by the operator on the rebuilt pod").
How
Predicate = intent ∧ existence:
CocoonHibernationCR in phaseHibernated/Waking, orSpec.Suspend.Registry.HasManifest(vmName, hibernate)— the same lookup vk runs at wake, so it never flags without a snapshot to restore, and fails closed on a probe error.Wired at the 3 create sites (main / sub / suspend-main); extracted
createMainAgentto keepReconcileunder the gocyclo budget.Depends on
cocoonstack/cocoon-common#5 (
MarkRestoreFromHibernate); go.mod pins the setter commit — rebump to commonmainonce #5 merges.Known follow-ups (out of scope)
desire=Hibernateagent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).Tests:
markRestoreIfHibernated(4 cases + nil-registry),restorableFromHibernateByCR(phase filter).