Skip to content

perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts#8562

Open
ggreif wants to merge 4 commits intoWebAssembly:mainfrom
ggreif:gabor/lsb-if-ctz
Open

perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts#8562
ggreif wants to merge 4 commits intoWebAssembly:mainfrom
ggreif:gabor/lsb-if-ctz

Conversation

@ggreif
Copy link
Copy Markdown

@ggreif ggreif commented Apr 1, 2026

Summary

An if-else conditioned on (i32.and X (i32.const 1)) tests the least significant bit of X. Since i32.ctz X == 0 iff the LSB of X is set, we can replace the condition with i32.ctz X and swap the branches — saving one instruction.

The second commit extends this to the primary pattern from the issue — eqz(and X 1) as a boolean condition (used in br_if, if, select) — handled in optimizeBoolean so all three sites benefit from one insertion.

  • Handles the constant on either side (left or right of and)
  • visitIf: (and X 1); if T E(ctz X); if E T
  • optimizeBoolean: eqz(and X 1)ctz X — covers the typical br_if (eqz (and X 1)) pattern

Motivation

Filed in #5752. The Motoko compiler already implements this in its own peephole optimizer (instrList.ml); the goal is to bring it to wasm-opt so that hand-written Wasm (e.g. the Motoko RTS, written in Rust) benefits too.

The optimizeBoolean rule alone fires 26–105 times across the three Motoko RTS variants (mo-rts-eop, mo-rts-incremental, mo-rts-non-incremental), targeting the is_skewed/is_scalar pointer-tagging checks in the GC hot path.

Applying wasm-opt --optimize-instructions to the Motoko RTS and running the benchmark suite shows the following gross effects (the submitted optimisation is a contributing factor alongside other rules triggered in the same pass):

Benchmark Before After Δ
heap-32 (GC-heavy, run 1) 1,153,792,735 instr 1,151,398,207 instr −2,394,528 (−0.21%)
heap-32 (run 2) 1,256,407,315 instr 1,253,408,059 instr −2,999,256 (−0.24%)
heap-64 (run 1) 1,324,057,357 instr 1,321,855,449 instr −2,201,908 (−0.17%)
heap-64 (run 2) 1,295,845,087 instr 1,293,744,743 instr −2,100,344 (−0.16%)
bignum 2,504,499 cycles 2,504,383 cycles −116
candid-subtype-cost 1,115,011 cycles 1,114,823 cycles −188

The GC-heavy heap benchmarks benefit most, consistent with the is_skewed check firing frequently during pointer traversal.

Test plan

  • New lit test test/lit/passes/optimize-instructions-lsb-if.wast covers if (const left and right) and br_if (eqz (and X 1))
  • All three test cases produce i32.ctz in the output

🤖 Generated with Claude Code

@ggreif ggreif requested a review from a team as a code owner April 1, 2026 09:39
@ggreif ggreif requested review from tlively and removed request for a team April 1, 2026 09:39
@ggreif ggreif changed the title perf(OptimizeInstructions): fold i32.and X 1; if T E into i32.ctz X; if E T perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts Apr 1, 2026
ggreif added a commit to caffeinelabs/motoko that referenced this pull request Apr 1, 2026
Add ggreif/binaryen (branch gabor/lsb-if-ctz-flake) as a flake input,
exposing a patched wasm-opt that folds LSB-test `i32.and X 1` patterns
into `i32.ctz` (WebAssembly/binaryen#8562). Apply it to the non-debug
RTS variants in installPhase, yielding ~0.2% instruction count reductions
in GC-heavy benchmarks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kripken
Copy link
Copy Markdown
Member

kripken commented Apr 1, 2026

Interesting. I worry this is not always faster, though: AND usually has a cost of 1, while TZCNT often has 2: https://www.agner.org/optimize/instruction_tables.pdf

Perhaps check what LLVM does here? They likely reasoned about this thoroughly.

@ggreif
Copy link
Copy Markdown
Author

ggreif commented Apr 5, 2026

Interesting. I worry this is not always faster, though: AND usually has a cost of 1, while TZCNT often has 2: https://www.agner.org/optimize/instruction_tables.pdf

Perhaps check what LLVM does here? They likely reasoned about this thoroughly.

I have answered a similar question here.

The i32.ctz approach is also semantically cleaner: it captures the "is LSB set?" intent more directly than and 1; eqz.

@kripken
Copy link
Copy Markdown
Member

kripken commented Apr 6, 2026

I agree it might be cleaner in a way. I also agree that VMs could alter what they emit, as you wrote in the linked issue. However, if this would regress performance on major VMs right now, we'd want to wait for them to fix that before landing anything.

@MaxGraey
Copy link
Copy Markdown
Contributor

MaxGraey commented Apr 7, 2026

Even if JIT compilers start optimizing similarly to wasmtime, it still won’t solve the performance issue, for example, in runtimes with interpreters (some smart contracts, embedded oriented like wasm3 and etc). If such optimization is to be done at all, in my opinion, it should only be for “optimized for size” (-Os).

@ggreif
Copy link
Copy Markdown
Author

ggreif commented Apr 7, 2026

Even if JIT compilers start optimizing similarly to wasmtime, it still won’t solve the performance issue, for example, in runtimes with interpreters (some smart contracts, embedded oriented like wasm3 and etc). If such optimization is to be done at all, in my opinion, it should only be for “optimized for size” (-Os).

That went through my thoughts too. I'll submit a revision soon.

ggreif and others added 2 commits May 8, 2026 23:55
…X; if E T`

An if-else conditioned on `(i32.and X (i32.const 1))` tests the LSB of X.
Since `i32.ctz X == 0` iff the LSB of X is set, we can replace the condition
with `i32.ctz X` and swap the branches — saving one instruction.

Handles the constant on either side (left or right of `and`).

Relates to: WebAssembly#5752

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…an context

In boolean contexts (if, br_if, select), `eqz(and X 1)` and `ctz X` have
the same truthiness: both are truthy iff LSB(X) == 0. Replacing eqz+and
with ctz saves one instruction and covers the primary pattern from
WebAssembly#5752:

  i32.const 1; i32.and; i32.eqz; br_if N  ==>  i32.ctz; br_if N

This fires via `optimizeBoolean`, so it covers `if`, `br_if`, and `select`
conditions in one place. Observed ~26–105 hits across Motoko RTS variants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/lsb-if-ctz branch from 575cd27 to 5de41e5 Compare May 8, 2026 21:57
ggreif added a commit to ggreif/binaryen that referenced this pull request May 8, 2026
Per WebAssembly#8562 review (kripken, MaxGraey): the
`(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites
save one instruction (a byte) but TZCNT can cost 1-2 cycles more
than AND on common JIT VMs (Agner Fog tables), and JIT-less
interpreters (wasm3, smart-contract runtimes) lack a fast path
for ctz at all. The byte-saving is unambiguously the win we want
under shrink modes; under speed modes the AND form stays.

Restrict both folds to `getPassOptions().shrinkLevel >= 1` —
fires under -Os and -Oz, no-ops everywhere else.

Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so
both directions are asserted: the fold suppresses cleanly under
the default --optimize-instructions invocation, and fires as
before when --shrink-level=1 is added.
@ggreif
Copy link
Copy Markdown
Author

ggreif commented May 8, 2026

Pushed a revision: the LSB→ctz fold is now gated on getPassOptions().shrinkLevel >= 1, so it fires under -Os / -Oz only and stays out of the default and speed-optimised pipelines. The lit test was split into DEFAULT (fold suppressed) and SHRINK (fold fires) prefixes.

Copy link
Copy Markdown
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we could do something similar when there is no i32.eqz in the input as well:

(i32.and X (i32.const 1))
=>
(i32.eqz (i32.ctz X))

This would save one byte and would often be further optimized by removing the i32.eqz and flipping the branches in the next iteration.

In fact, I think we could optimize only the pattern without i32.eqz in the input and depend on the existing optimizations to remove the outer i32.eqz to cover all the cases this PR already covers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for select as well.

// win we want under shrink modes; under speed modes the AND form
// stays. See WebAssembly/binaryen#8562.
if (auto* binary = curr->condition->dynCast<Binary>()) {
if (binary->op == AndInt32 && getPassOptions().shrinkLevel >= 1) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this >= 2, i.e., only in -Oz? -Os is meant to be a good balance between size and speed, and without more data I'm not sure how balanced this is. -Oz is "size at all costs".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, as (i32.and X (i32.const 1)) would often feed conditionals and the proposed transform would unlock ripple effects. This is not only space but also time saving.

ggreif added 2 commits May 9, 2026 01:22
Per WebAssembly#8562 review (kripken, MaxGraey): the
`(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites
save one instruction (a byte) but TZCNT can cost 1-2 cycles more
than AND on common JIT VMs (Agner Fog tables), and JIT-less
interpreters (wasm3, smart-contract runtimes) lack a fast path
for ctz at all. The byte-saving is unambiguously the win we want
under shrink modes; under speed modes the AND form stays.

Restrict both folds to `getPassOptions().shrinkLevel >= 1` —
fires under -Os and -Oz, no-ops everywhere else.

Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so
both directions are asserted: the fold suppresses cleanly under
the default --optimize-instructions invocation, and fires as
before when --shrink-level=1 is added.
@ggreif ggreif force-pushed the gabor/lsb-if-ctz branch from 5de41e5 to 784ed83 Compare May 8, 2026 23:22
@ggreif
Copy link
Copy Markdown
Author

ggreif commented May 8, 2026

Added select coverage in the same lit test — two cases (constant on the right, constant on the left of the AND), non-constant arms so an unrelated select c1 c0 P → P simplification doesn't eat the select before the boolean fold runs. DEFAULT keeps the AND form (with the eqz-arm-swap that was already there); SHRINK collapses to ctz.

@MaxGraey
Copy link
Copy Markdown
Contributor

MaxGraey commented May 9, 2026

(i32.and X (i32.const 1))

Even if such an expression is one byte larger compare to (i32.eqz (i32.ctz X)), it's fairly stable and repetitive since it's a predicate. That means compression will handle it very well due to sufficient dictionary-friendly entropy. Is it even worth it, considering that for interpreters and 1-tier compilers we consistently end up with a heavier instruction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants