Skip to content

EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778

Open
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor
Open

EvenScheduler: opt-in round-robin rebalance onto returning idle supervisors#8778
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor

Conversation

@mwkang
Copy link
Copy Markdown

@mwkang mwkang commented Jun 1, 2026

What is the purpose of the change

EvenScheduler (and therefore DefaultScheduler) does not move workers onto a supervisor that returns to service after maintenance. The topology's desired worker count is already satisfied across the surviving supervisors, so needsScheduling reports nothing to do and the returned supervisor sits at used = 0 until an operator manually rebalances or restarts every affected topology.

This PR adds an opt-in, binary-trigger pass to EvenScheduler that relocates already-assigned workers onto such idle supervisors, round-robin across topologies, in a single scheduling round. It is disabled by default, so existing clusters see no behavior change. Implements the proposal in #8590 and folds in the review feedback from that thread.

How it works

The trigger and the relocation live entirely on the EvenScheduler path; Cluster.needsScheduling is intentionally left unchanged (see Scope below).

  1. Binary triggerCluster.hasIdleSupervisorReusableBy(topology) returns true only when at least one stable, non-blacklisted supervisor has zero used slots and the topology is not already on it. Because the check is binary (a supervisor either has zero used slots or it does not), it never fires for an "almost balanced" cluster, so no time-based cooldown is needed.
  2. Per-topology budget — each topology may relocate at most floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount workers per round, tightened further by nimbus.even.rebalance.max.free.per.topology when positive. A topology whose budget computes to 0 (typically numWorkers < supervisorCount) is skipped entirely — this is also what stops a single-worker topology from ping-ponging.
  3. Round-robin relocationEvenScheduler.redistributeOntoIdleSupervisors walks the eligible topologies (ordered by id) and moves at most one worker per topology per iteration until the idle slots are exhausted. A single returning supervisor therefore ends up hosting workers from several topologies, preserving the per-supervisor workload diversity a fresh submission has, instead of letting the first scheduled topology grab the entire idle capacity.
  4. Deterministic donor selection — a worker is pulled from the supervisor where the topology currently has the most workers (measured by worker count), ties broken by supervisor id, lexicographically. The source supervisor is never drained below one worker for that topology, so it cannot itself become the next round's idle supervisor.
  5. Direct placement — each pulled worker's executors are assigned directly onto an idle slot via cluster.freeSlot() + cluster.assign(), bypassing the regular sortSlots/interleave pass that would otherwise drop some of them straight back into the just-vacated slots.

Nimbus propagates the result the usual way: it diffs the resulting assignments against the existing ones and pushes the delta, so the relocation takes effect even though needsScheduling is untouched.

Scope: RAS, Multitenant, Isolation

The feature is scoped to EvenScheduler/DefaultScheduler (and the leftover topologies IsolationScheduler delegates to them). Cluster.needsScheduling is deliberately not modified — the new logic lives in three new Cluster methods (hasIdleSupervisorReusableBy, isIdleSupervisorAvailableForEvenRebalance, hasMinimumIdleSupervisorStability) reached only from EvenScheduler.redistributeOntoIdleSupervisors. This keeps any scheduler that consults needsScheduling from picking up a surprise "needs rescheduling" signal.

Call-path audit:

Caller Trigger it uses Reaches the new idle-rebalance path?
EvenScheduler.scheduleTopologiesEvenly needsScheduling (unchanged) Yes — calls redistributeOntoIdleSupervisors directly; gated by the default-off flag
DefaultScheduler.defaultSchedule needsScheduling (unchanged) Yes — same, gated by the flag
IsolationScheduler delegates leftover (non-isolated) topologies to DefaultScheduler.defaultSchedule Reached but neutralized — isolated hosts are blacklisted, so they are never a donor or a target
ResourceAwareScheduler needsSchedulingRas (unchanged) No — never calls Even/Default scheduler
MultitenantScheduler (DefaultPool / IsolatedPool) calls cluster.needsScheduling (unchanged) No — uses its own pools and never reaches redistribute; needsScheduling itself is unmodified

In words: RAS is intentionally out of scope — it uses needsSchedulingRas and a different placement engine; a parallel mechanism, if wanted, belongs in a follow-up. Multitenant pools do call needsScheduling, but since that method is unchanged they are unaffected. Isolation: both hasIdleSupervisorReusableBy and redistributeOntoIdleSupervisors skip blacklisted supervisors, and IsolationScheduler represents a reserved host by blacklisting it before delegating leftovers — so an isolated host can never be a donor or a target, including the case where its isolated topology is down and the reserved host looks idle.

Configuration

All keys are dot-only, matching Storm's convention.

Key Type Default Purpose
nimbus.even.rebalance.idle.supervisor.enabled boolean false Master switch (opt-in)
nimbus.even.rebalance.max.free.per.topology int 0 Optional per-topology upper bound per round (0 = unbounded; the even-distribution budget applies)
nimbus.even.rebalance.idle.supervisor.min.stable.rounds int 3 Flap guard; 0 disables the guard

The flap guard keeps workers off a supervisor that has only just returned and may still be flapping on a slow JVM startup or a transient network blip. A supervisor is eligible only once it has been up for at least min.stable.rounds * supervisor.monitor.frequency.secs (≈9s with the defaults). It reuses SupervisorInfo.uptime_secs, surfaced onto SupervisorDetails.

When you would NOT want to enable this

A relocation is a worker JVM restart: brief tuple replay, JIT re-warmup, and possible windowed/stateful bolt state churn. Keep this off for:

  • topologies with windowed or stateful bolts that pay a non-trivial replay/restore cost;
  • latency-sensitive topologies sensitive to JIT re-warmup;
  • clusters whose supervisors flap (raise min.stable.rounds, or leave the feature off);
  • RAS users (no effect — see Scope).

Blast radius

In one scheduling pass the simultaneous worker-restart count is min(idle_slots, eligible_topologies), with each topology's contribution capped at floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount (tightened by max.free.per.topology). Because every relocation consumes one idle slot, the total per pass is hard-bounded by the returning supervisor's free-slot count.

Worked example: one returning supervisor with 8 slots in a 50-topology cluster → 8 simultaneous worker restarts across 8 topologies in one pass, not 50.

A cluster-wide ceiling (nimbus.even.rebalance.max.relocations.per.round) was considered but not added: the per-topology cap plus the natural idle-slot ceiling already bound the disruption, and an extra knob would only let operators throttle below "fill the returned supervisor in one pass." Happy to add it if reviewers prefer an explicit cluster-wide cap.

How was this change tested

New TestEvenSchedulerIdleSupervisor (storm-server), 16 cases:

  • disabled-by-default no-op and the binary trigger;
  • the generic needsScheduling / needsSchedulingRas paths staying unaffected;
  • the per-topology drain cap and max.free.per.topology;
  • drain-to-zero protection and the single-worker no-op;
  • one-round even distribution and round-robin sharing across topologies;
  • the uptime flap guard (below-threshold → no move, at-threshold → move);
  • deterministic donor tie-break by supervisor id;
  • blacklisted idle supervisor excluded as a target;
  • the DefaultScheduler leftover-subset path;
  • the IsolationScheduler interaction (idle non-isolated target only; reserved host stays out even when its isolated topology is down).

Backward compatibility

Default-off with no API removals — only additions. When disabled, redistributeOntoIdleSupervisors returns before scanning any supervisor, so a cluster that has not opted in does no extra per-round work. Existing SupervisorDetails constructors default uptimeSecs to Long.MAX_VALUE (always "stable"), leaving every existing caller unchanged.

Closes #8590

EvenScheduler/DefaultScheduler do not move workers onto a supervisor
that returns to service after maintenance: the topology already has
its desired worker count spread across the surviving supervisors, so
the returned supervisor sits at used=0 until an operator rebalances by
hand. Add an opt-in, binary-trigger pass that relocates workers onto
such idle supervisors, round-robin across topologies, in a single
scheduling round. The feature is disabled by default, so existing
clusters see no behavior change.

needsScheduling is deliberately left untouched. The new trigger lives
in Cluster.hasIdleSupervisorReusableBy and is reached only from
EvenScheduler.redistributeOntoIdleSupervisors, which runs at the top of
scheduleTopologiesEvenly and DefaultScheduler.defaultSchedule.
ResourceAwareScheduler (needsSchedulingRas) and the multitenant pools
keep their existing needsScheduling behavior and never enter the new
path, so the feature is scoped to EvenScheduler/DefaultScheduler (and
the leftover topologies IsolationScheduler delegates to them) only.

The trigger is binary -- it fires only when at least one stable,
non-blacklisted supervisor has zero used slots and the topology is not
already on it -- so an "almost balanced" cluster never moves. Each
topology contributes at most one worker per round-robin iteration, so
the returned supervisor ends up hosting workers from several topologies
(preserving the per-supervisor workload diversity a fresh submission
has) instead of letting the first scheduled topology grab the whole
idle capacity. Per-topology relocations in one round are capped at
floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount,
tightened further by max.free.per.topology when positive. Workers are
pulled from the supervisor where the topology has the most workers
(ties broken by supervisor id, lexicographically), never draining one
below a single worker, and each pulled worker is placed directly onto
an idle slot so the regular sortSlots/interleave pass cannot drop it
back into the just-vacated slot.

  - DaemonConfig / conf/defaults.yaml (dot-only keys):
      nimbus.even.rebalance.idle.supervisor.enabled (false)
      nimbus.even.rebalance.max.free.per.topology (0 = unbounded)
      nimbus.even.rebalance.idle.supervisor.min.stable.rounds (3)
  - Cluster: new hasIdleSupervisorReusableBy (trigger) plus
    isIdleSupervisorAvailableForEvenRebalance and
    hasMinimumIdleSupervisorStability (eligibility + uptime guard,
    uptime >= min.stable.rounds * supervisor.monitor.frequency.secs)
    that skips a just-returned, possibly-flapping supervisor. All gated
    by the enabled flag; needsScheduling itself is unchanged.
  - SupervisorDetails.uptimeSecs surfaced from SupervisorInfo so the
    uptime guard can be evaluated; legacy constructors default it to
    Long.MAX_VALUE (always stable) to leave existing callers unchanged.
  - EvenScheduler.redistributeOntoIdleSupervisors returns immediately
    when the feature is disabled, so a default (disabled) cluster does
    no per-scheduling-round supervisor scanning.
  - Add TestEvenSchedulerIdleSupervisor covering the trigger, the
    per-topology drain cap, single-worker no-op, one-round even
    distribution, round-robin sharing across topologies, the uptime
    flap guard, deterministic donor tie-break, blacklist handling, the
    DefaultScheduler leftover-subset path, and the IsolationScheduler
    interaction (idle non-isolated target only; a reserved host stays
    out even when its isolated topology is down).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors

1 participant