A locality-first virtual thread scheduler for the JVM. Virtual threads that work together run on the same carrier thread, reducing context switches and improving cache locality.
You want locality. The default ForkJoinPool scatters virtual threads across carriers randomly. When a Netty handler spawns a VT for blocking work, it runs on a different thread; when it posts back, that's a wakeup and a cache miss. This scheduler keeps related work on the same carrier — fewer context switches, better cache hit rates, less CPU wasted on handoffs. See Netty with NIO or without Netty.
You need native transports. Kernel I/O calls (epoll_wait, io_uring_enter) pin the virtual thread to its carrier. On ForkJoinPool, each native poller permanently occupies a shared carrier thread. This scheduler gives each native poller its own dedicated carrier, so pinning doesn't starve the rest of the system. See Netty with native transport.
You want control. The scheduler exposes a simple API to register your own I/O pollers, get per-carrier thread factories, and decide exactly which virtual threads share a carrier. No black-box scheduling decisions — you control the topology. See writing a custom pinned poller.
This scheduler doesn't replace ForkJoinPool — it runs alongside it. Thread.ofVirtual() still creates virtual threads on the default FJP, and third-party libraries that create their own virtual threads are unaffected. You choose which work runs on which scheduler:
var group = new VirtualIoNativePollerEventLoopGroup(EpollIoHandler.newFactory());
var schedulerFactory = group.vThreadFactory();
schedulerFactory.newThread(() -> {
// scope with our factory → forked tasks stay on the same carrier
try (var scope = StructuredTaskScope.open(allSuccessfulOrThrow(),
cf -> cf.withThreadFactory(schedulerFactory))) {
scope.fork(() -> fetchFromCache());
scope.join();
}
// scope without factory → forked tasks run on default FJP
try (var scope = StructuredTaskScope.open(allSuccessfulOrThrow())) {
scope.fork(() -> callThirdPartyLibrary());
scope.join();
}
}).start();Split topology (default) — 8 OS threads for 4 cores, every arrow is a handoff:
graph LR
subgraph Netty I/O threads
IO0[IO thread 0]
IO1[IO thread 1]
IO2[IO thread 2]
IO3[IO thread 3]
end
subgraph ForkJoinPool
FJ0[FJP carrier 0]
FJ1[FJP carrier 1]
FJ2[FJP carrier 2]
FJ3[FJP carrier 3]
end
IO0 -- handoff --> FJ0
IO1 -- handoff --> FJ1
IO2 -- handoff --> FJ2
IO3 -- handoff --> FJ3
FJ0 -. wakeup .-> IO0
FJ1 -. wakeup .-> IO1
FJ2 -. wakeup .-> IO2
FJ3 -. wakeup .-> IO3
Unified topology (this scheduler) — 4 OS threads, no cross-pool handoffs:
graph TB
subgraph EventLoopSchedulerGroup
subgraph Carrier 0
P0[I/O poller VT]
V0[handler VTs]
end
subgraph Carrier 1
P1[I/O poller VT]
V1[handler VTs]
end
subgraph Carrier 2
P2[I/O poller VT]
V2[handler VTs]
end
subgraph Carrier 3
P3[I/O poller VT]
V3[handler VTs]
end
end
This scheduler runs I/O and virtual threads on the same carrier threads. No oversubscription, no cross-pool handoffs, and native pollers get their own dedicated carrier.
For the full analysis with benchmarks, see the talk at Devoxx and slides.
- A carrier-affinity scheduler (
EventLoopSchedulerGroup) — a global pool of permanent carrier threads, each with its own MPSC queue. Virtual threads created from a carrier's factory have affinity to that carrier. - Netty integration — drop-in event loop groups that run Netty I/O on the scheduler's carriers, so handler-spawned virtual threads stay on the same carrier as the event loop that received the request.
| Transport | Class | Pinned poller? |
|---|---|---|
| NIO | VirtualIoNioPollerEventLoopGroup |
Yes — priority poller, anti-steal; NIO parks via Loom (carrier freed) |
| EPOLL / IO_URING | VirtualIoNativePollerEventLoopGroup |
Yes — one per carrier, kernel I/O pins carrier |
| NIO (lightweight) | VirtualIoEventLoopGroup |
No — event loop as regular VT, no anti-steal |
| No Netty | EventLoopSchedulerGroup |
Optional — registerPinnedPoller() |
EventLoopSchedulerGroup (global singleton, N carriers)
├── EventLoopScheduler[0]
│ ├── carrier thread (permanent, daemon)
│ ├── MPSC run queue
│ └── virtualThreadFactory() → VTs with affinity to this carrier
├── EventLoopScheduler[1]
│ └── ...
└── EventLoopScheduler[N-1]
Carrier threads run forever (like ForkJoinPool workers). They drain the run queue and execute virtual thread continuations. When idle, they park.
Virtual threads created via a scheduler's virtualThreadFactory() are scheduled on that carrier. When they block, they park (freeing the carrier); when they resume, the continuation goes back into the same carrier's run queue.
A carrier can optionally host a pinned poller — a long-running virtual thread dedicated to kernel I/O (epoll_wait, io_uring_enter). It runs as a VT (not directly on the carrier thread) to avoid deadlocking the carrier: all carrier-to-VT coordination goes through lock-free structures. The scheduler coordinates preemption: when external VTs have work queued, the poller yields; when nothing is pending, the poller can block in kernel I/O.
- A Loom-enabled JDK (Java 27+, builds.shipilev.net or build from openjdk/loom)
- JVM flag:
-Djdk.virtualThreadScheduler.implClass=io.netty.loom.scheduler.NettyScheduler - Maven 3.6+
Use VirtualIoNioPollerEventLoopGroup. NIO parks via Loom (carrier is freed), so the event loop doesn't pin the carrier. The poller registration gives the event loop priority scheduling and prevents it from being stolen when work stealing is enabled.
var group = new VirtualIoNioPollerEventLoopGroup(NioIoHandler.newFactory());
new ServerBootstrap()
.group(group)
.channel(NioServerSocketChannel.class)
.childHandler(/* ... */)
.bind(8080).sync();For a lightweight alternative without anti-steal guarantees, use VirtualIoEventLoopGroup:
var group = new VirtualIoEventLoopGroup(4, NioIoHandler.newFactory());Use VirtualIoNativePollerEventLoopGroup. It registers a pinned poller on each carrier — a virtual thread that runs the Netty event loop and does kernel I/O (epoll_wait, io_uring_enter) with affinity to that carrier.
var group = new VirtualIoNativePollerEventLoopGroup(EpollIoHandler.newFactory());
new ServerBootstrap()
.group(group)
.channel(EpollServerSocketChannel.class)
.childHandler(new ChannelInitializer<SocketChannel>() {
@Override
protected void initChannel(SocketChannel ch) {
ch.pipeline().addLast(new MyHandler(group));
}
})
.bind(8080).sync();Spawn virtual threads from a handler — they run on the same carrier as the event loop:
class MyHandler extends SimpleChannelInboundHandler<FullHttpRequest> {
private final VirtualIoNativePollerEventLoopGroup group;
@Override
protected void channelRead0(ChannelHandlerContext ctx, FullHttpRequest req) {
group.vThreadFactory().newThread(() -> {
// blocking work here — same carrier as the event loop
byte[] result = blockingHttpCall();
ctx.channel().eventLoop().execute(() -> writeResponse(ctx, result));
}).start();
}
}Use the scheduler directly. No Netty dependency needed — just the bootstrap module.
var group = EventLoopSchedulerGroup.instance();
var scheduler = group.scheduler(0);
// all VTs from this factory have affinity to carrier 0
ThreadFactory tf = scheduler.virtualThreadFactory();
tf.newThread(() -> {
// runs on carrier 0
doWork();
}).start();Round-robin across all carriers:
var group = EventLoopSchedulerGroup.instance();
for (int i = 0; i < tasks; i++) {
var scheduler = group.scheduler(i % group.size());
scheduler.virtualThreadFactory().newThread(() -> doWork()).start();
}Register your own I/O poller on a carrier via registerPinnedPoller. The poller runs as a virtual thread with affinity to the carrier (not directly on the carrier thread — this avoids deadlocks by keeping all coordination lock-free). The returned CompletionStage completes when the poller exits and the slot is freed.
var scheduler = EventLoopSchedulerGroup.instance().scheduler(0);
CompletionStage<Void> termination = scheduler.registerPinnedPoller(
() -> {}, // spinning — never sleeps, wakeup is a no-op
() -> {
while (!shutdown) {
int events = doPollNonBlocking();
scheduler.maybeYield(events > 0); // report I/O activity
processTasks();
scheduler.maybeYield(events > 0);
}
}
);That's the simplest correct poller — non-blocking poll, yield between phases, no wakeup coordination needed. The scheduler handles the rest.
A pinned poller has three responsibilities:
-
Yield CPU time and report I/O activity. Call
maybeYield(hadIoWork)between phases. The boolean argument tells the scheduler whether the poller processed I/O events since the last call:hadIoWork=true: the poller has I/O to process. The scheduler only yields for local VT preemption — no work stealing, because the carrier is doing useful work.hadIoWork=false: the poller is idle. The scheduler may attempt to steal work from an overloaded sibling (if work stealing is enabled), helping reduce tail latency when load is uneven.
What
hadIoWorkcontrols is the steal path: whenfalse, the scheduler may steal from an overloaded sibling; whentrue, stealing is suppressed because the carrier is doing useful I/O work. Passingtrueincorrectly would prevent the carrier from ever stealing, while passingfalseincorrectly would make the carrier steal when it should be doing I/O. Get this right — it's the feedback loop between the poller and the scheduler.A spinning poller (one that never blocks) will never enter PARKED state, so the nSearching directed-steal protocol cannot target it. But it participates in the pull path —
tryStealinginmaybeYieldprobes siblings whenhadIoWork=false. It can also be stolen FROM via the directed path (another carrier wakes an idle sibling to steal from this carrier's queue viasignalWorkFor). -
Terminate. The poller slot is freed when the body
Runnablereturns (via try-finally internally). The returnedCompletionStagecompletes after cleanup, so callers can wait for the slot to be available again. -
Handle wakeup signals. The wakeup
Runnableis called from any thread when the scheduler needs to interrupt the poller's blocking I/O (e.g., eventfd write for epoll/io_uring). It must be thread-safe and idempotent. The scheduler only calls it after CAS'ing the carrier state — never spuriously.- Non-blocking poller (spin-poll): use a no-op wakeup (
() -> {}) — the poller never sleeps, so there is nothing to interrupt. - Blocking poller (native transport): track whether you're inside the blocking syscall (e.g. a volatile flag set before
epoll_wait()and cleared after). When the wakeup fires, send the interrupt signal (e.g. eventfd write) if the flag is set. The wakeup is aRunnable— it doesn't return a value. - Loom-friendly poller (NIO): use a no-op wakeup — NIO's
select()parks via Loom, freeing the carrier. The carrier parks in its scheduler loop and is woken viaLockSupport.unpark(), not via the poller wakeup. SeeVirtualIoNioPollerEventLoopGroup.
- Non-blocking poller (spin-poll): use a no-op wakeup (
-
Never miss a wakeup — if you choose to block. The simple poller above never blocks, so it doesn't need wakeup coordination. But if you want to block in kernel I/O when idle (to save CPU), the blocking path introduces a coordination problem.
The blocking decision must be made inside your transport — the transport must advertise it's about to sleep (store a flag) before checking
canParkPoller()(a load), with a StoreLoad barrier between the two so the load cannot slip before the advertisement. This is the Seastar sleep/wakeup pattern: the symmetric store-barrier-load on both producer and consumer sides ensures at least one side always sees the other's store.This is how the native transport integration works:
canParkPoller()is injected into the ManualIoEventLoop via override, the transport advertises sleep via a volatile write (pollerRunning.set(false)) before readingcanParkPoller(), and the wakeup sends an eventfd write to interrupt the blocked poller (see netty#15922):var pollerRunning = new AtomicBoolean(false); var eventLoop = new ManualIoEventLoop(parent, null, handlerFactory) { @Override public boolean canBlock() { return scheduler.canParkPoller(); } }; scheduler.registerPinnedPoller( () -> eventLoop.wakeup(), // wakeup: interrupt blocking I/O () -> { pollerRunning.set(true); boolean canBlock = false; while (!eventLoop.isShuttingDown()) { int events; if (canBlock && scheduler.tryParkPoller()) { pollerRunning.set(false); try { events = eventLoop.run(MAX_WAIT_NS, YIELD_NS); } finally { pollerRunning.set(true); scheduler.unpark(); } } else { events = eventLoop.runNow(YIELD_NS); } boolean hadVtWork = scheduler.maybeYield(events > 0); canBlock = events == 0 && !hadVtWork; } } );
A blocking poller must follow three steps:
tryParkPoller()— transitions the carrier to PARKED. If it returnsfalse, do a non-blocking poll instead.canParkPoller()(viacanBlock()) — called by the transport right before the actual blocking syscall. Verifies the carrier is still PARKED and no work arrived. BetweentryParkPoller()and the blocking call, the poller VT may be transiently descheduled (e.g. contended lock), during which the carrier loop can reset PARKED to RUNNING. IfcanParkPoller()returnsfalse, the transport must skip the blocking syscall and do a non-blocking poll — the poller loop will callunpark()and retry on the next iteration.unpark()— called immediately after waking from blocking I/O. Resets the carrier state and handles any pending directed-steal signals.
Beyond the state check, the wakeup signal itself must not be lost. Two approaches:
Permit-based (lock-free): The transport's wakeup is sticky — if called before the blocking call starts, the blocking call returns immediately. Examples:
eventfd(stays readable until consumed),Selector.wakeup()(sets a flag),LockSupport.unpark()(stores a permit). This is what Netty's transports use.Lock-based (rendezvous): The
canParkPoller()check and the blocking wait happen inside a lock shared withwakeup(). The signal cannot slip between the check and the wait. Note:Condition.signal()is not sticky — if it arrives beforeCondition.await(), it's lost. The queue-empty check must be inside the locked region.For background on why this coordination is subtle, see:
- Seastar's memory barrier approach — the symmetric store-barrier-load pattern between producer and consumer
- Mechanical-sympathy discussion — why
Condition.signal()deadlocks when it arrives beforeCondition.await(), and why permit-based mechanisms (LockSupport.park/unpark) don't have this problem - Viktor Klang's actor — the atomic-flag-with-recheck pattern
Additional constraints:
canParkPoller()verifies the carrier is still PARKED and no work is pending. It is a snapshot — never cache the result.- One poller per carrier.
registerPinnedPollerthrows if a poller is already registered.
For a deeper look at the store-barrier-load protocol, JCStress proofs that the guard prevents missed wakeups (and that removing it causes 94% signal loss), and the BlockingPollGuard utility that encapsulates it, see concurrency-tests/README.md.
See PERFORMANCE.md for latency/throughput analysis, idle spin
trade-offs, work stealing tuning, and FJP comparison with benchmark commands.
| Property | Default | Description |
|---|---|---|
io.netty.loom.schedulers |
availableProcessors() |
Number of carrier threads |
io.netty.loom.yield.us |
50 |
Yield duration in microseconds |
io.netty.loom.idleSpins |
0 |
Idle spin iterations before blocking. When >0, the poller spins this many iterations (calling Thread.onSpinWait()) before entering the blocking I/O path. Prevents batch formation at the cost of higher CPU usage below saturation. The effective spin duration depends on the transport: each iteration includes a non-blocking I/O poll — measured at ~0.42µs for epoll (syscall) and ~0.05µs for io_uring (shared-memory CQ peek) under continuous spinning. So 256 spins ≈ 108µs with epoll but only ~13µs with io_uring. The same spin count is not interchangeable across transports. |
io.netty.loom.resumed.continuations |
1024 |
Initial MPSC queue capacity |
io.netty.loom.workstealing.enabled |
false |
Enable work stealing (experimental) |
Work stealing allows idle carriers to help overloaded siblings by taking virtual threads from their run queues. It is opt-in and designed around a locality-first principle.
Enable with -Dio.netty.loom.workstealing.enabled=true.
Each virtual thread has a home carrier — the carrier whose factory created it. The VT's I/O (sockets, channels) is registered on the home carrier's event loop. Running the VT on its home carrier means I/O completions arrive on the same carrier with no cross-carrier wakeups, no cache misses, and no eventfd writes.
Stealing violates locality. A stolen VT runs on a foreign carrier. When it
completes and calls channel.writeAndFlush(), the write goes to the home carrier's
event loop task queue — a cross-carrier hop. When the VT yields or blocks, its
continuation routes back to the home carrier's MPSC queue (non-sticky stealing).
This is why stealing is reluctant — only steal when idle, only from siblings
with queued work. The design is topology-aware: when LinuxCarrierTopology is
enabled, carriers are grouped by LLC cluster and siblings are ordered by proximity.
The protocol is a hybrid of ideas from three systems, adapted for our unique constraint: carriers have a primary job (I/O + own queue), and work stealing is opt-in.
Go runtime (src/runtime/proc.go):
- The
nmspinningcounter andwakep()protocol inspired our per-clusternSearchingcounter with CAS 0→1 admission control. - Go's dual-sided
#StoreLoadbarrier protocol (submitters checknmspinning, spinners re-check all queues after decrementing) ensures signals are never lost. OurtryStartSearcher+signalWorkForfollows the same principle. - Go allows up to
GOMAXPROCS/2concurrent spinners; we cap at 1 per cluster. Go's workers have no primary job — spinning is their idle state. Our carriers should be doing I/O, not scanning. - Known issues in Go's protocol: #43997 (non-spinning Ms spin uselessly — 12% wall-time reduction when fixed), #13527 (Gosched causes futex thrashing — 5% CPU wasted on futex calls).
- See: Scalable Go Scheduler Design by Dmitry Vyukov.
ForkJoinPool (java.util.concurrent.ForkJoinPool):
- FJP uses a Treiber stack (LIFO) for idle workers — O(1) CAS pop to wake
the most-recently-parked worker (cache-warm). We use an
IdleCarrierTrackerbitmap instead, which supports topology-aware waking (wakeFirstIdlecan prefer nearby carriers). - FJP has no nSearching counter — all idle workers scan simultaneously with randomized exhaustive strides. Contention is managed by natural backoff and deactivation, not admission control. This works because FJP workers exist only to steal; our carriers have I/O to do.
- FJP gates steal-signal propagation:
signalWork()after a steal fires only whenprevSrc != src(first steal from a new queue) AND the queue still has items. For VT tasks (noUserHelp() != 0), propagation is unconditional. OursignalWorkForafter a steal is currently unconditional. - FJP annotates hot fields with
@Contendedfor false-sharing protection:ctl(the idle stack + counters), and WorkQueue'sphase,source,parking,stackPred,top,nsteals. OurClusterState.nSearchinghas no padding — a known gap. - References: Chase and Lev (SPAA 2005), Michael-Saraswat-Vechev (PPoPP 2009), Le-Pop-Cohen-Nardelli (PPoPP 2013).
Danny Thomas's cluster-aware experiments:
- virtual-threads-cluster-aware
on the loom-dev mailing list
(September 2024). His
CHOOSE_TWOplacement strategy — power-of-2 choices for selecting the least loaded cluster — delivered ~29% more throughput than the default FJP scheduler on an AMD EPYC 9R14 (CCX-clustered L3 caches).
The naive approach — wake an idle thread every time a goroutine/VT is readied — leads to excessive thread parking/unparking. Go's runtime documents this explicitly:
Three rejected approaches that would work badly: [...] (3) Unpark an additional thread whenever we ready a goroutine and there is an idle P, but don't do handoff. This would lead to excessive thread parking/unparking as the additional threads will instantly park without discovering any work to do.
Go's solution: maintain at least one spinning thread (searching for work). If a
spinner already exists (nmspinning > 0), don't wake another — the existing spinner
will discover the work. When the last spinner finds work and stops spinning, it wakes
a replacement. This "smooths out unjustified spikes of thread unparking, but at the
same time guarantees eventual maximal CPU parallelism utilization."
Our nSearching counter serves the same purpose: prevent wakeup storms. If a search
is already in flight (CAS 0→1 succeeded), additional signalWorkFor calls are
suppressed. When the searcher completes, the chain propagates if the victim still has
work. One searcher at a time, sequential recruitment — no thundering herd.
The trade-off: with nSearching=1 per cluster, only one victim gets push-path help
at a time. If multiple carriers in the same cluster are overloaded, redistribution
is serialized. Whether this matters depends on how often multiple victims coexist —
we don't have data on this yet. A direct signal: how often tryStartSearcher fails
(CAS 0→1 returns false) while a different carrier needs help.
All three systems must answer: how many threads should actively search for stealable work? Too few delays redistribution; too many wastes CPU on fruitless scanning.
| Aspect | This scheduler | Go runtime | ForkJoinPool |
|---|---|---|---|
| Push path limit | nSearching CAS(0→1) — 1 per cluster |
wakep() CAS(0→1) — 1 at a time |
signalWork() pops Treiber stack — 1 at a time |
| Ongoing searchers | Strictly 0 or 1 | Dynamic: 2*nmspinning < busy_procs |
No limit — all idle workers scan |
| Chain propagation | signalWorkFor(victim) after steal |
"Last spinner wakes next" via wakep() |
signalWork() after steal if prevSrc != src |
| Scan strategy | Power-of-2 random (O(1), 2 siblings) | Coprime enumeration of ALL Ps, 4 passes | Coprime stride over ALL queues |
| Directed steal | Yes — victim encoded in SEARCHING+id |
No — spinners scan all queues | No — workers scan all queues |
| Pull path (separate) | tryStealing in maybeYield — no limit |
None — all stealing via findrunnable |
None — all stealing via runWorker scan |
Key finding: Go's wakep() is also CAS(0→1). It wakes exactly one spinner,
just like our tryStartSearcher. The difference is how spinners persist: Go's
findrunnable() allows an M to become spinning if 2*nmspinning < (gomaxprocs - npidle) — a soft, racy inequality check, not a CAS. Multiple Ms can pass this
check simultaneously, so nmspinning can briefly exceed the target. Our counter
is strictly 0 or 1 at all times. (Go source:
wakep CAS(0,1),
findrunnable spinning check.)
FJP has no limit at all. All idle workers scan simultaneously with randomized
coprime strides that visit every queue. Workers back off naturally via deactivate()
spin + awaitWork() park. No counter tracks how many are scanning. (FJP source at
Loom dc6f316e036: signalWork L1871,
scan: loop L1986,
deactivate L2042.)
Why our approach differs: our carriers have a dedicated I/O loop — each
carrier owns a pinned poller VT bound to a specific epoll/io_uring fd. Stealing
competes with I/O polling for carrier time. Go's Ms also do I/O polling, but
via a shared netpoller (netpoll() in findrunnable()) — any M can poll
it and it returns ready goroutines from all fds. There is no per-M epoll fd.
FJP workers don't poll at all (Netty I/O runs on separate platform threads).
In both Go and FJP, scanning for steals is a natural part of the idle path,
not competing with a dedicated per-thread I/O responsibility.
We use two independent paths:
- Push (nSearching=1): directed steal from a known victim. The searcher knows exactly where to look — no scanning needed. Chain propagation recruits the next searcher sequentially. O(N) recruitment for N idle carriers, but O(1) per steal.
- Pull (tryStealing, no limit): every idle poller iteration probes siblings via power-of-2. This runs independently of nSearching — unlimited concurrent probes, same as FJP in spirit.
The total steal activity is: 1 directed + N pull probes per idle cycle (where N =
idle pollers with !hadIoWork). This is comparable to or more permissive than Go
(which subjects ALL stealing to the nmspinning check) — just structured differently.
Trade-offs:
| Pro | Con | |
|---|---|---|
| nSearching=1 | Low contention on victim's ticket lock; directed steal = no wasted scans | Multiple overloaded carriers in same cluster: only one gets help at a time via push path |
| Go nmspinning | Dynamic limit scales with load; dual-sided barrier prevents missed signals | Known issues: #43997 (12% CPU wasted on useless spinning), #13527 (futex thrashing) |
| FJP unlimited | Fastest convergence — all idle workers scan simultaneously | Highest contention on victim queues; mitigated by @Contended padding and LIFO stack |
Open question: should nSearching be raised to allow multiple directed steals
simultaneously (e.g., two overloaded carriers each recruiting a helper)? The current
pull path compensates, but the push path's sequential chain is a bottleneck when
multiple victims exist. Benchmarking at 16+ carriers with uneven load would reveal
whether this matters in practice.
The protocol has two paths: push (directed, from submissions) and pull (speculative, from idle pollers).
Push path — nSearching + directed steal:
When a carrier has work submitted via execute() and the carrier is already
running (not just woken):
execute(task):
runQueue.offer(task)
if (submitter != carrierThread):
if (submitter.runningScheduler != this):
wakeup() // wake target from blocking I/O
else if (WORK_STEALING_ENABLED):
signalWork() // recruit a sibling to help
signalWork() → signalWorkFor(victim):
tryStartSearcher()— CAS 0→1 on per-clusternSearching. Fails if a search is already in progress (at most 1 concurrent searcher per cluster).wakeFirstIdle(group, victim)— scan idle bitmap for a PARKED carrier. If found:wakeupAsSearcher(victim)— CAS(PARKED → SEARCHING+victim.id)LockSupport.unpark+ poller eventfd write.
- If no idle carrier:
stoppedSearching()(release the nSearching token).
The woken carrier processes SEARCHING+victim.id via handleSearchWake:
- Steal one task from the encoded victim.
stoppedSearching()— release the nSearching token.- If the victim still has work:
signalWorkFor(victim)— recruit the next searcher (chain propagation). - Run the stolen task (inline from carrier loop, queued from poller VT).
Pull path — speculative stealing from maybeYield:
When the pinned poller has no I/O work:
maybeYield(hadIoWork=false):
if (hasRunnableContinuations()):
Thread.yield() // drain local work first
else:
tryStealing(false) // probe siblings
tryStealing uses power-of-2 random probing: pick 2 random siblings,
choose the one with more queued work (worthStealing() — non-empty queue).
After a successful steal, signalWorkFor(victim) fires if the victim still
has work — connecting the pull path to the push protocol.
Carrier state machine:
tryPark() wakeup()
RUNNING(0) ───────→ PARKED(1) ───────→ RUNNING(0)
│ ↑
│ wakeupAsSearcher() │
└──→ SEARCHING(2+id) ────────┘
│ unparkImpl()
└── handleSearchWake()
→ steal from victim
→ stoppedSearching()
Invariant: every tryStartSearcher() success (CAS 0→1) is matched by
exactly one stoppedSearching() decrement — either from signalWorkFor
(no idle carrier found) or from handleSearchWake (steal processed).
The carrier loop eagerly checks for pending SEARCHING state at the top of each iteration, ensuring directed steals are never delayed when the carrier has local work to drain.
Virtual thread continuations arrive in execute() from three sources:
External submissions (from a different carrier or a platform thread):
- The submitter's
runningScheduler()differs from the target →wakeup()fires to interrupt the target's blocking I/O or unpark its carrier. - If
wakeup()failed (carrier was not sleeping) →signalWork()recruits an idle sibling via the nSearching protocol.
Internal submissions (a VT running on this carrier submits to the same carrier):
- The submitter's
runningScheduler()matches → no self-wakeup needed. signalWork()still fires — covers the case where a VT monopolizes a carrier and spawns many children. An idle sibling is recruited to help.
Yield / re-enqueue (onContinue path, runs on the carrier thread itself):
- Skipped entirely — the carrier thread is actively draining, no wakeup or help needed. The carrier will pick up the re-enqueued continuation on its next drain cycle.
The carrier uses a Viktor Klang mini-actor pattern adapted for park/unpark with directed-steal signals:
tryPark(): CAS(RUNNING→PARKED) + markIdle in bitmap + re-check for work. If work arrived between drain and CAS, rolls back atomically. The rollback handles concurrentwakeupAsSearcherCAS (TOCTOU viagetAndSet).unpark()/unparkFromCarrier(): markActive +getAndSet(RUNNING). If wakeState was SEARCHING, processeshandleSearchWake— steals from the directed victim.unparkFromCarrier()runs the stolen task inline;unpark()queues it (for poller VT context where inline execution is not possible).
Victim selection uses power-of-2 random probing: pick 2 random siblings, choose the better candidate. This gives near-optimal selection with O(1) cost.
From a scalability perspective (Neil Gunther's Universal Scalability Law), the coherency penalty grows with the number of parties contending on shared state. Scanning all N siblings means N volatile reads × M concurrent stealers = O(N×M) coherency traffic. Power-of-2 reduces it to O(1) per attempt — two reads, no shared counter. ForkJoinPool avoids scanning with a Treiber stack (O(1) CAS pop), but requires a shared atomic. Power-of-2 has zero shared contention.
The MPSC queue (jctools MpscUnboundedArrayQueue) is single-consumer by design.
Work stealing introduces a second consumer (the stealer). A ticket lock
coordinates access with asymmetric paths:
Carrier (owner) — acquireConsumer() / releaseConsumer(ticket):
uses XADD (atomic increment) on consumerTicket, then spins on consumerServing
until its ticket is served. Wait-free in the common case (no stealer active).
When a stealer is mid-poll, the carrier spins briefly via Thread.onSpinWait().
Stealer — tryStealOne():
uses CAS on consumerTicket. If the carrier or another stealer holds the lock,
the CAS fails immediately and the stealer gives up (returns null). No spinning,
no blocking — best-effort only.
This is intentionally biased toward the carrier: the owner always succeeds
(XADD), the stealer gives up on contention (CAS failure). The carrier's drain
throughput is never degraded by steal attempts. An isEmpty() fast-path before
acquireConsumer() avoids the XADD entirely on empty polls.
Both the carrier and the stealer consume from the same MPSC queue head (FIFO). The oldest virtual thread continuation is processed first, regardless of whether it's drained locally or stolen.
This differs from classical work stealing (Cilk, ForkJoinPool in compute mode):
| Runtime | Local execution | Steal order | Why |
|---|---|---|---|
| Cilk / FJP (compute) | LIFO (newest first) | FIFO (oldest first) | Cache locality: newest task has hot data in L1. Stealer takes cold oldest task — minimizes cache interference. Optimal for recursive fork-join parallelism. |
| FJP with Loom (asyncMode) | FIFO | FIFO | Fairness: virtual threads are I/O-bound continuations, not recursive tasks. FIFO prevents starvation of VTs waiting to resume after I/O. |
| Go runtime | FIFO (256-slot ring) + 1-slot LIFO (runnext) |
FIFO | Hybrid: runnext gives the most recent goroutine priority (producer-consumer locality), everything else is FIFO for fairness. |
| This scheduler | FIFO (MPSC queue) | FIFO (same queue) | Same rationale as Loom: VTs are I/O continuations. FIFO ensures fair drain order. The MPSC queue is single-ended — no deque, no LIFO option. |
Go's runnext slot is notable: the goroutine that just unblocked (e.g., channel
receive) runs immediately before the FIFO queue. This gives producer-consumer
chains low latency. Our scheduler achieves similar behavior through the pinned
poller's maybeYield(hadIoWork) — when a VT completes and its continuation arrives, the
poller yields immediately to drain it.
Enable the io.netty.loom.WorkSteal JFR event to trace steals:
jcmd <pid> JFR.start settings=netty-loom.jfc
Each event records:
virtualThread: the stolen VTsourceCarrier/stealerCarrier: which carrier lost/gained the VTsourceQueueDepth: how deep the victim's queue wasfromCarrierLoop:trueif stolen from the carrier loop,falseif stolen from the pinned poller viamaybeYielddirected:trueif stolen via nSearching directed steal (handleSearchWake),falseif stolen via randomtryStealingprobe
- Balanced load (uniform connection distribution): minimal steals. Both carriers drain their own queues. No latency improvement, no regression.
- Uneven load (hot connections, burst patterns): idle carriers help overloaded siblings. Tail latency improves for VTs that would otherwise wait on a stuck carrier.
- Max throughput: near-zero overhead when enabled. The
isEmpty()fast-path in the ticket lock avoids atomic operations on empty polls. The nSearching CAS fires from submissions and when the carrier loop finds remaining work after a drain iteration.
Pins each carrier to its own CPU core and groups carriers by shared L3 cache. This eliminates thread migrations and keeps steal coordination within the same cache domain.
-Dio.netty.loom.topology=io.netty.loom.topology.LinuxCarrierTopology
--enable-native-access=ALL-UNNAMED
| Property | Values | Effect |
|---|---|---|
io.netty.loom.topology |
io.netty.loom.topology.LinuxCarrierTopology or absent |
Absent (default): carriers float, no pinning. Set: each carrier pinned to a core, clusters discovered from hardware. |
io.netty.loom.workstealing.scope |
GLOBAL (default) or CLUSTER_LOCAL |
GLOBAL: steal from any carrier. CLUSTER_LOCAL: steal only within the same LLC cluster — avoids cross-CCD/NUMA coherency traffic. |
When enabled, the scheduler:
- Reads the process affinity mask (
sched_getaffinity) to discover available CPUs - Pins each carrier to one CPU via
sched_setaffinity(one carrier per core) - Reads
/sys/devices/system/cpu/cpu{N}/cache/to discover LLC clusters - Groups carriers that share an L3 cache into clusters
Carrier thread names reflect the topology: carrier-0-cluster0-core2,
carrier-1-cluster0-core3. On an AMD Ryzen 9 7950X (2 CCDs):
L3 cluster 0: CPUs 0-7, 16-23 (8 cores + 8 SMT threads)
L3 cluster 1: CPUs 8-15, 24-31 (8 cores + 8 SMT threads)
If there are more carriers than available CPUs, extra carriers float with a
warning logged. If FFM or sched_setaffinity is unavailable (e.g., containers
without native access), the feature degrades gracefully — carriers float and
all land in a single cluster.
Without pinning, carriers are regular threads that the Linux scheduler (EEVDF)
can migrate between cores. When a carrier blocks briefly in epoll_wait and
wakes, EEVDF may place it on a different core — or worse, co-place two carriers
on the same core while another core sits idle. This is the
"Wasted Cores" problem
(Lozi et al., EuroSys 2016).
The effect on our scheduler: without pinning, each carrier cycles through the I/O poll → VT drain loop 247K times in 10 seconds. With pinning: 632K times (2.6x more responsive). Each poll finds fewer events (1.6 vs 4.0), and the VT queue stays nearly empty (avg depth 1 vs 8). The carrier reacts to each event promptly instead of waking up to find accumulated batches.
From PERFORMANCE.md at 50K req/s, 2 carriers on CPUs 2,3:
| Metric | No pinning | With topology |
|---|---|---|
| p50 latency | 3.80ms | 1.16ms |
| IO cycles/10s | 247K | 632K (2.6x) |
| IO events/cycle | 4.0 | 1.6 |
| VT queue depth (avg) | 8 | 1 |
The N+1 cores trick (3 cores for 2 carriers, no pinning) achieves similar results (p50 1.13ms) by giving the Master-Poller, GC, and JIT their own core — eliminating preemption without explicit pinning. See the reproducing section for exact commands.
Topology and work stealing are independent features that compose well:
| Configuration | p50 (50K, 2 cores) | What it provides |
|---|---|---|
| No WS, no topology | 3.80ms | Baseline |
| WS only | 2.26ms | Idle carriers help overloaded siblings |
| Topology only | 1.13ms | Pinning eliminates preemption, keeps queues short |
| WS + topology | 1.13ms | Both — pinning dominates at 2 cores |
With topology enabled, work stealing's benefit is less visible at small scale (2 cores) because pinning already eliminates the preemption problem that causes queue buildup. At larger scale (16+ carriers across 2 LLC clusters), work stealing redistributes load within a cluster when one carrier is saturated while siblings are idle.
When both are enabled and stealScope is CLUSTER_LOCAL:
- Each LLC cluster gets its own
ClusterState— independentnSearchingcounter and idle bitmap. Two clusters can signal and steal in parallel with zero cross-cluster contention. - The
siblingsarray for each carrier contains only carriers in the same cluster.tryStealingpower-of-2 probes andwakeFirstIdlescans are cluster-local — no cross-LLC traffic for steal coordination. - Stolen tasks may still route back to the home carrier's event loop on write (cross-carrier hop), but the steal itself stays within the LLC.
| Deployment | Recommendation |
|---|---|
| 1-4 cores, single CCD | Topology optional. Pinning helps if CPU-bound (see PERFORMANCE.md). |
| 8+ cores, single CCD/socket | Enable topology. Pinning eliminates migrations and keeps queues short. |
| Multi-CCD/NUMA (AMD EPYC, Ryzen 7000+) | Enable topology + CLUSTER_LOCAL scope. Each LLC cluster steals independently — no cross-CCD coherency traffic. |
| Containers with CPU limits | Topology works if --enable-native-access is allowed. Falls back gracefully if sched_setaffinity is denied (carriers float, single cluster). |
| N+1 cores trick | Alternative to pinning: give carriers one extra core (--server-cpuset 2,3,4 for 2 carriers). The extra core absorbs GC/JIT/Master-Poller without explicit pinning. See PERFORMANCE.md. |
| Module | Description |
|---|---|
netty-virtualthread-bootstrap |
Scheduling core (io.netty.loom.scheduler): NettyScheduler, EventLoopScheduler, EventLoopSchedulerGroup, JFR events. Must be on the system classloader. |
netty-virtualthread-core |
Netty integration layer (io.netty.loom): VirtualIoNativePollerEventLoopGroup, VirtualIoNioPollerEventLoopGroup, VirtualIoEventLoopGroup. |
example-echo |
Self-contained HTTP server demonstrating blocking VT handlers, structured concurrency, and carrier affinity. |
All scheduling state (carrier pool singleton, ScopedValue, instanceof checks) lives in bootstrap, loaded once by the system classloader. This prevents per-classloader duplication in app servers and fat JAR deployments.
The JDK loads the scheduler via Class.forName(cn, true, ClassLoader.getSystemClassLoader()) during VirtualThread.<clinit>. This means the bootstrap JAR must be visible to the system classloader, not just the application classloader.
If the bootstrap JAR is not on the system classpath (e.g., packaged inside a Spring Boot fat JAR without MRJAR, or only in an app server's per-deployment classloader), the JDK cannot find NettyScheduler and the scheduler does not activate. Any attempt to use the scheduler API (EventLoopSchedulerGroup.instance(), VirtualIoNativePollerEventLoopGroup, etc.) throws IllegalStateException — there is no silent fallback.
The topology JAR has the same constraint: -Dio.netty.loom.topology=<classname> loads the class via Class.forName from the bootstrap module, which uses the system classloader. The topology implementation must be on the system classpath alongside the bootstrap JAR.
Add bootstrap and core JARs to the classpath. Everything works:
java --enable-preview \
-Djdk.virtualThreadScheduler.implClass=io.netty.loom.scheduler.NettyScheduler \
-cp "bootstrap.jar:core.jar:netty-all.jar:app.jar" \
com.example.MainSpring Boot's LaunchedClassLoader loads app classes from BOOT-INF/classes/. The system classloader cannot see them. The solution: unpack bootstrap classes into the Multi-Release JAR layer (META-INF/versions/27/), and exclude bootstrap from BOOT-INF/lib/ so it's not loaded twice.
<!-- Mark as Multi-Release -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifestEntries>
<Multi-Release>true</Multi-Release>
</manifestEntries>
</archive>
</configuration>
</plugin>
<!-- Unpack bootstrap into META-INF/versions/27/ -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>unpack-bootstrap-scheduler</id>
<phase>prepare-package</phase>
<goals><goal>unpack</goal></goals>
<configuration>
<artifactItems>
<artifactItem>
<groupId>io.netty.loom</groupId>
<artifactId>netty-virtualthread-bootstrap</artifactId>
<version>${netty-loom.version}</version>
<type>jar</type>
<includes>io/netty/loom/scheduler/**</includes>
<outputDirectory>${project.build.outputDirectory}/META-INF/versions/${java.version}</outputDirectory>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
<!-- Exclude bootstrap from BOOT-INF/lib/ -->
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<excludes>
<exclude>
<groupId>io.netty.loom</groupId>
<artifactId>netty-virtualthread-bootstrap</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>Verified on Quarkus 3.21.3 with fast-jar (default packaging). The bootstrap JAR goes on -Xbootclasspath/a: and the Maven dependency must be <scope>provided</scope> to avoid duplication into Quarkus's lib/:
<dependency>
<groupId>io.netty.loom</groupId>
<artifactId>netty-virtualthread-bootstrap</artifactId>
<version>${netty-loom.version}</version>
<scope>provided</scope>
</dependency>java --enable-preview \
-Xbootclasspath/a:/path/to/netty-virtualthread-bootstrap.jar \
-Djdk.virtualThreadScheduler.implClass=io.netty.loom.scheduler.NettyScheduler \
-jar target/quarkus-app/quarkus-run.jarLegacy-jar (quarkus.package.jar.type=legacy-jar): uses a flat system classpath — no -Xbootclasspath/a: needed, but the provided scope still applies.
Verified on OpenLiberty 26.0.0.5 (which uses Netty 4.2.12 internally). Place the bootstrap JAR on -Xbootclasspath/a: and add the scheduler flag to jvm.options:
--enable-preview
-Xbootclasspath/a:/path/to/netty-virtualthread-bootstrap.jar
-Djdk.virtualThreadScheduler.implClass=io.netty.loom.scheduler.NettyScheduler
The core JAR stays inside the application deployment (WAR/EAR). Multiple deployments sharing the same JVM share a single carrier pool.
| Deployment | System CL sees bootstrap? | Action needed |
|---|---|---|
Flat classpath (java -cp) |
Yes | None |
| Spring Boot fat JAR | No | MRJAR unpack (see above) |
| Quarkus | No | Bootstrap on -Xbootclasspath/a:, dependency provided |
| OpenLiberty (WAR/EAR) | No | Bootstrap on -Xbootclasspath/a: |
The easiest way to get started is the provided dev container (.devcontainer/), which uses shipilev/openjdk:loom.
Works with VS Code (Dev Containers extension) and IntelliJ IDEA (File > Remote Development > Dev Containers).
export JAVA_HOME=/path/to/loom-jdk
mvn clean installApache License 2.0 — see LICENSE.
Credit: dreamlike-ocean for identifying and fixing the fat-JAR classloader issue.