perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts#8562
perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts#8562ggreif wants to merge 4 commits intoWebAssembly:mainfrom
i32.and X 1 into i32.ctz in boolean contexts#8562Conversation
i32.and X 1; if T E into i32.ctz X; if E Ti32.and X 1 into i32.ctz in boolean contexts
Add ggreif/binaryen (branch gabor/lsb-if-ctz-flake) as a flake input, exposing a patched wasm-opt that folds LSB-test `i32.and X 1` patterns into `i32.ctz` (WebAssembly/binaryen#8562). Apply it to the non-debug RTS variants in installPhase, yielding ~0.2% instruction count reductions in GC-heavy benchmarks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Interesting. I worry this is not always faster, though: AND usually has a cost of 1, while TZCNT often has 2: https://www.agner.org/optimize/instruction_tables.pdf Perhaps check what LLVM does here? They likely reasoned about this thoroughly. |
I have answered a similar question here. The |
|
I agree it might be cleaner in a way. I also agree that VMs could alter what they emit, as you wrote in the linked issue. However, if this would regress performance on major VMs right now, we'd want to wait for them to fix that before landing anything. |
|
Even if JIT compilers start optimizing similarly to wasmtime, it still won’t solve the performance issue, for example, in runtimes with interpreters (some smart contracts, embedded oriented like wasm3 and etc). If such optimization is to be done at all, in my opinion, it should only be for “optimized for size” (-Os). |
That went through my thoughts too. I'll submit a revision soon. |
…X; if E T` An if-else conditioned on `(i32.and X (i32.const 1))` tests the LSB of X. Since `i32.ctz X == 0` iff the LSB of X is set, we can replace the condition with `i32.ctz X` and swap the branches — saving one instruction. Handles the constant on either side (left or right of `and`). Relates to: WebAssembly#5752 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…an context In boolean contexts (if, br_if, select), `eqz(and X 1)` and `ctz X` have the same truthiness: both are truthy iff LSB(X) == 0. Replacing eqz+and with ctz saves one instruction and covers the primary pattern from WebAssembly#5752: i32.const 1; i32.and; i32.eqz; br_if N ==> i32.ctz; br_if N This fires via `optimizeBoolean`, so it covers `if`, `br_if`, and `select` conditions in one place. Observed ~26–105 hits across Motoko RTS variants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per WebAssembly#8562 review (kripken, MaxGraey): the `(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites save one instruction (a byte) but TZCNT can cost 1-2 cycles more than AND on common JIT VMs (Agner Fog tables), and JIT-less interpreters (wasm3, smart-contract runtimes) lack a fast path for ctz at all. The byte-saving is unambiguously the win we want under shrink modes; under speed modes the AND form stays. Restrict both folds to `getPassOptions().shrinkLevel >= 1` — fires under -Os and -Oz, no-ops everywhere else. Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so both directions are asserted: the fold suppresses cleanly under the default --optimize-instructions invocation, and fires as before when --shrink-level=1 is added.
|
Pushed a revision: the LSB→ctz fold is now gated on |
tlively
left a comment
There was a problem hiding this comment.
It seems we could do something similar when there is no i32.eqz in the input as well:
(i32.and X (i32.const 1))
=>
(i32.eqz (i32.ctz X))
This would save one byte and would often be further optimized by removing the i32.eqz and flipping the branches in the next iteration.
In fact, I think we could optimize only the pattern without i32.eqz in the input and depend on the existing optimizations to remove the outer i32.eqz to cover all the cases this PR already covers.
There was a problem hiding this comment.
Let's add tests for select as well.
| // win we want under shrink modes; under speed modes the AND form | ||
| // stays. See WebAssembly/binaryen#8562. | ||
| if (auto* binary = curr->condition->dynCast<Binary>()) { | ||
| if (binary->op == AndInt32 && getPassOptions().shrinkLevel >= 1) { |
There was a problem hiding this comment.
How about making this >= 2, i.e., only in -Oz? -Os is meant to be a good balance between size and speed, and without more data I'm not sure how balanced this is. -Oz is "size at all costs".
There was a problem hiding this comment.
I don't think so, as (i32.and X (i32.const 1)) would often feed conditionals and the proposed transform would unlock ripple effects. This is not only space but also time saving.
Per WebAssembly#8562 review (kripken, MaxGraey): the `(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites save one instruction (a byte) but TZCNT can cost 1-2 cycles more than AND on common JIT VMs (Agner Fog tables), and JIT-less interpreters (wasm3, smart-contract runtimes) lack a fast path for ctz at all. The byte-saving is unambiguously the win we want under shrink modes; under speed modes the AND form stays. Restrict both folds to `getPassOptions().shrinkLevel >= 1` — fires under -Os and -Oz, no-ops everywhere else. Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so both directions are asserted: the fold suppresses cleanly under the default --optimize-instructions invocation, and fires as before when --shrink-level=1 is added.
|
Added |
|
Even if such an expression is one byte larger compare to |
Summary
An if-else conditioned on
(i32.and X (i32.const 1))tests the least significant bit of X. Sincei32.ctz X == 0iff the LSB of X is set, we can replace the condition withi32.ctz Xand swap the branches — saving one instruction.The second commit extends this to the primary pattern from the issue —
eqz(and X 1)as a boolean condition (used inbr_if,if,select) — handled inoptimizeBooleanso all three sites benefit from one insertion.and)visitIf:(and X 1); if T E→(ctz X); if E ToptimizeBoolean:eqz(and X 1)→ctz X— covers the typicalbr_if (eqz (and X 1))patternMotivation
Filed in #5752. The Motoko compiler already implements this in its own peephole optimizer (
instrList.ml); the goal is to bring it towasm-optso that hand-written Wasm (e.g. the Motoko RTS, written in Rust) benefits too.The
optimizeBooleanrule alone fires 26–105 times across the three Motoko RTS variants (mo-rts-eop,mo-rts-incremental,mo-rts-non-incremental), targeting theis_skewed/is_scalarpointer-tagging checks in the GC hot path.Applying
wasm-opt --optimize-instructionsto the Motoko RTS and running the benchmark suite shows the following gross effects (the submitted optimisation is a contributing factor alongside other rules triggered in the same pass):heap-32(GC-heavy, run 1)heap-32(run 2)heap-64(run 1)heap-64(run 2)bignumcandid-subtype-costThe GC-heavy heap benchmarks benefit most, consistent with the
is_skewedcheck firing frequently during pointer traversal.Test plan
test/lit/passes/optimize-instructions-lsb-if.wastcoversif(const left and right) andbr_if (eqz (and X 1))i32.ctzin the output🤖 Generated with Claude Code