EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors by mwkang · Pull Request #8778 · apache/storm

mwkang · 2026-06-01T10:48:02Z

What is the purpose of the change

EvenScheduler (and therefore DefaultScheduler) does not move workers onto a supervisor that returns to service after maintenance. The topology's desired worker count is already satisfied across the surviving supervisors, so needsScheduling reports nothing to do and the returned supervisor sits at used = 0 until an operator manually rebalances or restarts every affected topology.

This PR adds an opt-in, binary-trigger pass to EvenScheduler that relocates already-assigned workers onto such idle supervisors, round-robin across topologies, in a single scheduling round. It is disabled by default, so existing clusters see no behavior change. Implements the proposal in #8590 and folds in the review feedback from that thread.

How it works

The trigger and the relocation live entirely on the EvenScheduler path; Cluster.needsScheduling is intentionally left unchanged (see Scope below).

Binary trigger — Cluster.hasIdleSupervisorReusableBy(topology) returns true only when at least one stable, non-blacklisted supervisor has zero used slots and the topology is not already on it. Because the check is binary (a supervisor either has zero used slots or it does not), it never fires for an "almost balanced" cluster, so no time-based cooldown is needed.
Per-topology budget — each topology may relocate at most floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount workers per round, tightened further by nimbus.even.rebalance.max.free.per.topology when positive. A topology whose budget computes to 0 (typically numWorkers < supervisorCount) is skipped entirely — this is also what stops a single-worker topology from ping-ponging.
Round-robin relocation — EvenScheduler.redistributeOntoIdleSupervisors walks the eligible topologies (ordered by id) and moves at most one worker per topology per iteration until the idle slots are exhausted. A single returning supervisor therefore ends up hosting workers from several topologies, preserving the per-supervisor workload diversity a fresh submission has, instead of letting the first scheduled topology grab the entire idle capacity.
Deterministic donor selection — a worker is pulled from the supervisor where the topology currently has the most workers (measured by worker count), ties broken by supervisor id, lexicographically. The source supervisor is never drained below one worker for that topology, so it cannot itself become the next round's idle supervisor.
Direct placement — each pulled worker's executors are assigned directly onto an idle slot via cluster.freeSlot() + cluster.assign(), bypassing the regular sortSlots/interleave pass that would otherwise drop some of them straight back into the just-vacated slots.

Nimbus propagates the result the usual way: it diffs the resulting assignments against the existing ones and pushes the delta, so the relocation takes effect even though needsScheduling is untouched.

Scope: RAS, Multitenant, Isolation

The feature is scoped to EvenScheduler/DefaultScheduler (and the leftover topologies IsolationScheduler delegates to them). Cluster.needsScheduling is deliberately not modified — the new logic lives in three new Cluster methods (hasIdleSupervisorReusableBy, isIdleSupervisorAvailableForEvenRebalance, hasMinimumIdleSupervisorStability) reached only from EvenScheduler.redistributeOntoIdleSupervisors. This keeps any scheduler that consults needsScheduling from picking up a surprise "needs rescheduling" signal.

Call-path audit:

Caller	Trigger it uses	Reaches the new idle-rebalance path?
`EvenScheduler.scheduleTopologiesEvenly`	`needsScheduling` (unchanged)	Yes — calls `redistributeOntoIdleSupervisors` directly; gated by the default-off flag
`DefaultScheduler.defaultSchedule`	`needsScheduling` (unchanged)	Yes — same, gated by the flag
`IsolationScheduler`	delegates leftover (non-isolated) topologies to `DefaultScheduler.defaultSchedule`	Reached but neutralized — isolated hosts are blacklisted, so they are never a donor or a target
`ResourceAwareScheduler`	`needsSchedulingRas` (unchanged)	No — never calls Even/Default scheduler
`MultitenantScheduler` (`DefaultPool` / `IsolatedPool`)	calls `cluster.needsScheduling` (unchanged)	No — uses its own pools and never reaches `redistribute`; `needsScheduling` itself is unmodified

In words: RAS is intentionally out of scope — it uses needsSchedulingRas and a different placement engine; a parallel mechanism, if wanted, belongs in a follow-up. Multitenant pools do call needsScheduling, but since that method is unchanged they are unaffected. Isolation: both hasIdleSupervisorReusableBy and redistributeOntoIdleSupervisors skip blacklisted supervisors, and IsolationScheduler represents a reserved host by blacklisting it before delegating leftovers — so an isolated host can never be a donor or a target, including the case where its isolated topology is down and the reserved host looks idle.

Configuration

All keys are dot-only, matching Storm's convention.

Key	Type	Default	Purpose
`nimbus.even.rebalance.idle.supervisor.enabled`	boolean	`false`	Master switch (opt-in)
`nimbus.even.rebalance.max.free.per.topology`	int	`0`	Optional per-topology upper bound per round (`0` = unbounded; the even-distribution budget applies)
`nimbus.even.rebalance.idle.supervisor.min.stable.rounds`	int	`3`	Flap guard; `0` disables the guard

The flap guard keeps workers off a supervisor that has only just returned and may still be flapping on a slow JVM startup or a transient network blip. A supervisor is eligible only once it has been up for at least min.stable.rounds * supervisor.monitor.frequency.secs (≈9s with the defaults). It reuses SupervisorInfo.uptime_secs, surfaced onto SupervisorDetails.

When you would NOT want to enable this

A relocation is a worker JVM restart: brief tuple replay, JIT re-warmup, and possible windowed/stateful bolt state churn. Keep this off for:

topologies with windowed or stateful bolts that pay a non-trivial replay/restore cost;
latency-sensitive topologies sensitive to JIT re-warmup;
clusters whose supervisors flap (raise min.stable.rounds, or leave the feature off);
RAS users (no effect — see Scope).

Blast radius

In one scheduling pass the simultaneous worker-restart count is min(idle_slots, eligible_topologies), with each topology's contribution capped at floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount (tightened by max.free.per.topology). Because every relocation consumes one idle slot, the total per pass is hard-bounded by the returning supervisor's free-slot count.

Worked example: one returning supervisor with 8 slots in a 50-topology cluster → 8 simultaneous worker restarts across 8 topologies in one pass, not 50.

A cluster-wide ceiling (nimbus.even.rebalance.max.relocations.per.round) was considered but not added: the per-topology cap plus the natural idle-slot ceiling already bound the disruption, and an extra knob would only let operators throttle below "fill the returned supervisor in one pass." Happy to add it if reviewers prefer an explicit cluster-wide cap.

How was this change tested

New TestEvenSchedulerIdleSupervisor (storm-server), 16 cases:

disabled-by-default no-op and the binary trigger;
the generic needsScheduling / needsSchedulingRas paths staying unaffected;
the per-topology drain cap and max.free.per.topology;
drain-to-zero protection and the single-worker no-op;
one-round even distribution and round-robin sharing across topologies;
the uptime flap guard (below-threshold → no move, at-threshold → move);
deterministic donor tie-break by supervisor id;
blacklisted idle supervisor excluded as a target;
the DefaultScheduler leftover-subset path;
the IsolationScheduler interaction (idle non-isolated target only; reserved host stays out even when its isolated topology is down).

Backward compatibility

Default-off with no API removals — only additions. When disabled, redistributeOntoIdleSupervisors returns before scanning any supervisor, so a cluster that has not opted in does no extra per-round work. Existing SupervisorDetails constructors default uptimeSecs to Long.MAX_VALUE (always "stable"), leaving every existing caller unchanged.

Closes #8590

EvenScheduler/DefaultScheduler do not move workers onto a supervisor that returns to service after maintenance: the topology already has its desired worker count spread across the surviving supervisors, so the returned supervisor sits at used=0 until an operator rebalances by hand. Add an opt-in, binary-trigger pass that relocates workers onto such idle supervisors, round-robin across topologies, in a single scheduling round. The feature is disabled by default, so existing clusters see no behavior change. needsScheduling is deliberately left untouched. The new trigger lives in Cluster.hasIdleSupervisorReusableBy and is reached only from EvenScheduler.redistributeOntoIdleSupervisors, which runs at the top of scheduleTopologiesEvenly and DefaultScheduler.defaultSchedule. ResourceAwareScheduler (needsSchedulingRas) and the multitenant pools keep their existing needsScheduling behavior and never enter the new path, so the feature is scoped to EvenScheduler/DefaultScheduler (and the leftover topologies IsolationScheduler delegates to them) only. The trigger is binary -- it fires only when at least one stable, non-blacklisted supervisor has zero used slots and the topology is not already on it -- so an "almost balanced" cluster never moves. Each topology contributes at most one worker per round-robin iteration, so the returned supervisor ends up hosting workers from several topologies (preserving the per-supervisor workload diversity a fresh submission has) instead of letting the first scheduled topology grab the whole idle capacity. Per-topology relocations in one round are capped at floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount, tightened further by max.free.per.topology when positive. Workers are pulled from the supervisor where the topology has the most workers (ties broken by supervisor id, lexicographically), never draining one below a single worker, and each pulled worker is placed directly onto an idle slot so the regular sortSlots/interleave pass cannot drop it back into the just-vacated slot. - DaemonConfig / conf/defaults.yaml (dot-only keys): nimbus.even.rebalance.idle.supervisor.enabled (false) nimbus.even.rebalance.max.free.per.topology (0 = unbounded) nimbus.even.rebalance.idle.supervisor.min.stable.rounds (3) - Cluster: new hasIdleSupervisorReusableBy (trigger) plus isIdleSupervisorAvailableForEvenRebalance and hasMinimumIdleSupervisorStability (eligibility + uptime guard, uptime >= min.stable.rounds * supervisor.monitor.frequency.secs) that skips a just-returned, possibly-flapping supervisor. All gated by the enabled flag; needsScheduling itself is unchanged. - SupervisorDetails.uptimeSecs surfaced from SupervisorInfo so the uptime guard can be evaluated; legacy constructors default it to Long.MAX_VALUE (always stable) to leave existing callers unchanged. - EvenScheduler.redistributeOntoIdleSupervisors returns immediately when the feature is disabled, so a default (disabled) cluster does no per-scheduling-round supervisor scanning. - Add TestEvenSchedulerIdleSupervisor covering the trigger, the per-topology drain cap, single-worker no-op, one-round even distribution, round-robin sharing across topologies, the uptime flap guard, deterministic donor tie-break, blacklist handling, the DefaultScheduler leftover-subset path, and the IsolationScheduler interaction (idle non-isolated target only; a reserved host stays out even when its isolated topology is down).

mwkang mentioned this pull request Jun 1, 2026

[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors #8590

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778

EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor

mwkang commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwkang commented Jun 1, 2026

What is the purpose of the change

How it works

Scope: RAS, Multitenant, Isolation

Configuration

When you would NOT want to enable this

Blast radius

How was this change tested

Backward compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant