EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778
Open
mwkang wants to merge 1 commit into
Open
EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778mwkang wants to merge 1 commit into
mwkang wants to merge 1 commit into
Conversation
EvenScheduler/DefaultScheduler do not move workers onto a supervisor
that returns to service after maintenance: the topology already has
its desired worker count spread across the surviving supervisors, so
the returned supervisor sits at used=0 until an operator rebalances by
hand. Add an opt-in, binary-trigger pass that relocates workers onto
such idle supervisors, round-robin across topologies, in a single
scheduling round. The feature is disabled by default, so existing
clusters see no behavior change.
needsScheduling is deliberately left untouched. The new trigger lives
in Cluster.hasIdleSupervisorReusableBy and is reached only from
EvenScheduler.redistributeOntoIdleSupervisors, which runs at the top of
scheduleTopologiesEvenly and DefaultScheduler.defaultSchedule.
ResourceAwareScheduler (needsSchedulingRas) and the multitenant pools
keep their existing needsScheduling behavior and never enter the new
path, so the feature is scoped to EvenScheduler/DefaultScheduler (and
the leftover topologies IsolationScheduler delegates to them) only.
The trigger is binary -- it fires only when at least one stable,
non-blacklisted supervisor has zero used slots and the topology is not
already on it -- so an "almost balanced" cluster never moves. Each
topology contributes at most one worker per round-robin iteration, so
the returned supervisor ends up hosting workers from several topologies
(preserving the per-supervisor workload diversity a fresh submission
has) instead of letting the first scheduled topology grab the whole
idle capacity. Per-topology relocations in one round are capped at
floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount,
tightened further by max.free.per.topology when positive. Workers are
pulled from the supervisor where the topology has the most workers
(ties broken by supervisor id, lexicographically), never draining one
below a single worker, and each pulled worker is placed directly onto
an idle slot so the regular sortSlots/interleave pass cannot drop it
back into the just-vacated slot.
- DaemonConfig / conf/defaults.yaml (dot-only keys):
nimbus.even.rebalance.idle.supervisor.enabled (false)
nimbus.even.rebalance.max.free.per.topology (0 = unbounded)
nimbus.even.rebalance.idle.supervisor.min.stable.rounds (3)
- Cluster: new hasIdleSupervisorReusableBy (trigger) plus
isIdleSupervisorAvailableForEvenRebalance and
hasMinimumIdleSupervisorStability (eligibility + uptime guard,
uptime >= min.stable.rounds * supervisor.monitor.frequency.secs)
that skips a just-returned, possibly-flapping supervisor. All gated
by the enabled flag; needsScheduling itself is unchanged.
- SupervisorDetails.uptimeSecs surfaced from SupervisorInfo so the
uptime guard can be evaluated; legacy constructors default it to
Long.MAX_VALUE (always stable) to leave existing callers unchanged.
- EvenScheduler.redistributeOntoIdleSupervisors returns immediately
when the feature is disabled, so a default (disabled) cluster does
no per-scheduling-round supervisor scanning.
- Add TestEvenSchedulerIdleSupervisor covering the trigger, the
per-topology drain cap, single-worker no-op, one-round even
distribution, round-robin sharing across topologies, the uptime
flap guard, deterministic donor tie-break, blacklist handling, the
DefaultScheduler leftover-subset path, and the IsolationScheduler
interaction (idle non-isolated target only; a reserved host stays
out even when its isolated topology is down).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
EvenScheduler(and thereforeDefaultScheduler) does not move workers onto a supervisor that returns to service after maintenance. The topology's desired worker count is already satisfied across the surviving supervisors, soneedsSchedulingreports nothing to do and the returned supervisor sits atused = 0until an operator manually rebalances or restarts every affected topology.This PR adds an opt-in, binary-trigger pass to
EvenSchedulerthat relocates already-assigned workers onto such idle supervisors, round-robin across topologies, in a single scheduling round. It is disabled by default, so existing clusters see no behavior change. Implements the proposal in #8590 and folds in the review feedback from that thread.How it works
The trigger and the relocation live entirely on the
EvenSchedulerpath;Cluster.needsSchedulingis intentionally left unchanged (see Scope below).Cluster.hasIdleSupervisorReusableBy(topology)returns true only when at least one stable, non-blacklisted supervisor has zero used slots and the topology is not already on it. Because the check is binary (a supervisor either has zero used slots or it does not), it never fires for an "almost balanced" cluster, so no time-based cooldown is needed.floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCountworkers per round, tightened further bynimbus.even.rebalance.max.free.per.topologywhen positive. A topology whose budget computes to0(typicallynumWorkers < supervisorCount) is skipped entirely — this is also what stops a single-worker topology from ping-ponging.EvenScheduler.redistributeOntoIdleSupervisorswalks the eligible topologies (ordered by id) and moves at most one worker per topology per iteration until the idle slots are exhausted. A single returning supervisor therefore ends up hosting workers from several topologies, preserving the per-supervisor workload diversity a fresh submission has, instead of letting the first scheduled topology grab the entire idle capacity.cluster.freeSlot()+cluster.assign(), bypassing the regularsortSlots/interleave pass that would otherwise drop some of them straight back into the just-vacated slots.Nimbuspropagates the result the usual way: it diffs the resulting assignments against the existing ones and pushes the delta, so the relocation takes effect even thoughneedsSchedulingis untouched.Scope: RAS, Multitenant, Isolation
The feature is scoped to
EvenScheduler/DefaultScheduler(and the leftover topologiesIsolationSchedulerdelegates to them).Cluster.needsSchedulingis deliberately not modified — the new logic lives in three newClustermethods (hasIdleSupervisorReusableBy,isIdleSupervisorAvailableForEvenRebalance,hasMinimumIdleSupervisorStability) reached only fromEvenScheduler.redistributeOntoIdleSupervisors. This keeps any scheduler that consultsneedsSchedulingfrom picking up a surprise "needs rescheduling" signal.Call-path audit:
EvenScheduler.scheduleTopologiesEvenlyneedsScheduling(unchanged)redistributeOntoIdleSupervisorsdirectly; gated by the default-off flagDefaultScheduler.defaultScheduleneedsScheduling(unchanged)IsolationSchedulerDefaultScheduler.defaultScheduleResourceAwareSchedulerneedsSchedulingRas(unchanged)MultitenantScheduler(DefaultPool/IsolatedPool)cluster.needsScheduling(unchanged)redistribute;needsSchedulingitself is unmodifiedIn words: RAS is intentionally out of scope — it uses
needsSchedulingRasand a different placement engine; a parallel mechanism, if wanted, belongs in a follow-up. Multitenant pools do callneedsScheduling, but since that method is unchanged they are unaffected. Isolation: bothhasIdleSupervisorReusableByandredistributeOntoIdleSupervisorsskip blacklisted supervisors, andIsolationSchedulerrepresents a reserved host by blacklisting it before delegating leftovers — so an isolated host can never be a donor or a target, including the case where its isolated topology is down and the reserved host looks idle.Configuration
All keys are dot-only, matching Storm's convention.
nimbus.even.rebalance.idle.supervisor.enabledfalsenimbus.even.rebalance.max.free.per.topology00= unbounded; the even-distribution budget applies)nimbus.even.rebalance.idle.supervisor.min.stable.rounds30disables the guardThe flap guard keeps workers off a supervisor that has only just returned and may still be flapping on a slow JVM startup or a transient network blip. A supervisor is eligible only once it has been up for at least
min.stable.rounds * supervisor.monitor.frequency.secs(≈9s with the defaults). It reusesSupervisorInfo.uptime_secs, surfaced ontoSupervisorDetails.When you would NOT want to enable this
A relocation is a worker JVM restart: brief tuple replay, JIT re-warmup, and possible windowed/stateful bolt state churn. Keep this off for:
min.stable.rounds, or leave the feature off);Blast radius
In one scheduling pass the simultaneous worker-restart count is
min(idle_slots, eligible_topologies), with each topology's contribution capped atfloor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount(tightened bymax.free.per.topology). Because every relocation consumes one idle slot, the total per pass is hard-bounded by the returning supervisor's free-slot count.Worked example: one returning supervisor with 8 slots in a 50-topology cluster → 8 simultaneous worker restarts across 8 topologies in one pass, not 50.
A cluster-wide ceiling (
nimbus.even.rebalance.max.relocations.per.round) was considered but not added: the per-topology cap plus the natural idle-slot ceiling already bound the disruption, and an extra knob would only let operators throttle below "fill the returned supervisor in one pass." Happy to add it if reviewers prefer an explicit cluster-wide cap.How was this change tested
New
TestEvenSchedulerIdleSupervisor(storm-server), 16 cases:needsScheduling/needsSchedulingRaspaths staying unaffected;max.free.per.topology;DefaultSchedulerleftover-subset path;IsolationSchedulerinteraction (idle non-isolated target only; reserved host stays out even when its isolated topology is down).Backward compatibility
Default-off with no API removals — only additions. When disabled,
redistributeOntoIdleSupervisorsreturns before scanning any supervisor, so a cluster that has not opted in does no extra per-round work. ExistingSupervisorDetailsconstructors defaultuptimeSecstoLong.MAX_VALUE(always "stable"), leaving every existing caller unchanged.Closes #8590