Skip to content

Add autotuning#247

Draft
AntonOresten wants to merge 15 commits into
mainfrom
ao/autotune
Draft

Add autotuning#247
AntonOresten wants to merge 15 commits into
mainfrom
ao/autotune

Conversation

@AntonOresten
Copy link
Copy Markdown
Collaborator

@AntonOresten AntonOresten commented Jun 3, 2026

Still experimental, but improves upon and supercedes #95, including adding #95 (comment).

Introduces a @autotune macro with some ideas from #40, which uses a space argument for the search space:

using cuTile.Experimental: @autotune

@autotune(
    space=[(tile=16, occupancy=1), (tile=32, occupancy=2)]
    blocks=cld(N, $tile),
    vadd(a, b, c, ct.Constant($tile))
)

and $ syntax to mean that the name comes from a config from the search space. occupancy is automatically detected as an entry hint.

Some rough edges that I'm unsure about are:

Ports the autotune work from the autotune branch onto main's
launch infrastructure (`cuTile.cufunction` / `TileKernel`,
`cuTileconvert`, `Constant` unwrapping in `unwrap_argtypes`).

Lives in `src/Experimental.jl` (the original was under `src/Experimental.jl`
+ `ext/autotune/autotune.jl` behind a CUDA weakdep; on main, `cuTile.launch`
is in `src/` and depends on `CUDACore` directly, so the extension boundary
is no longer needed).

What this drops vs. the autotune branch:
- `ext/CUDAExt.jl` — subsumed by `src/launch.jl` on main.
- The `_SCOPED_INF_CACHE`/`create_inf_cache` scoped inference plumbing —
  caching is now handled by `CompilerCaching` inside `cufunction`.
- Manual `emit_function!` calls in `precompile_cfg` — `cufunction` does
  the full compile+link in one shot, so precompile is just a `cufunction`
  call per cfg.

API preserved:
- `Experimental.autotune_launch(f, space, grid_fn, args_fn; ...)`
- `Experimental.clear_autotune_cache(; kernel=nothing, key=nothing)`
- `Experimental.CartesianSpace`, `FixedSpace`, `AbstractSearchSpace`
- All tuning preset / verify / setup / launch_args_fn knobs.

Tests: all 31 cases in `test/device/autotune.jl` pass.
`emit_tile!` and everything below it (inference, structured IR, tile-IR
emission) mutates shared state — `CacheView` entries, `CuTileResults`
fields, the inference cache, and CompilerCaching's per-CI const_entries
vector — none of which is thread-safe. Without this, concurrent
`cufunction` calls (e.g. autotuning's precompile fan-out across
`Threads.@spawn` workers) silently race.

Two acquisitions, brief each:
- `ensure_compiled` inside `compile` (lookup hit just briefly contends)
- `emit_tile!` inside `emit_binary!`

The tileiras subprocess still runs unlocked, so concurrent compiles can
still overlap their shell-outs — the original rationale behind the
autotune branch's `EMIT_TILE_LOCK`.
Adds `_SCOPED_INF_CACHE::ScopedValue` to the cuTile interpreter so a
caller can opt into reusing one `CC.InferenceCache` across many
`cuTileInterpreter(cache)` constructions. Autotuning's
`find_or_tune` now wraps its precompile+measure pass in
`with(_SCOPED_INF_CACHE => fresh)`, so the 25-config sweep over a
single kernel shares inference results across all const-seeded
variants instead of paying the slow paths once per config.

Microbenchmark on a TileArray load with `order=(1,2)` (the original
motivating case), 25 configs:

  Fresh:  9.7 ms/cfg
  Shared: 2.4 ms/cfg   (~4x speedup)

The ScopedValue is unassigned by default, so non-autotune callers
(plain `@cuda backend=cuTile`) behave exactly as before — each
`cuTileInterpreter` allocates its own fresh cache.
Shorthand for `CUDACore.@cuda backend=cuTile …`. Module references
are interpolated as values (`$CUDACore`, `$cuTile`) rather than
symbols, so callers don't need `using CUDACore` (or even `using cuTile`)
at the call site — the expanded form points directly at the module
objects.

Verified working from a module with only `using cuTile`:

  module NoSetup
      using cuTile
      function run(a, b, c)
          cuTile.@Cutile blocks=cld(length(a), 16) vadd(a, b, c)
      end
  end
`@autotune` is a thin surface over `autotune_launch`. `$X` inside
`blocks=` or the kernel-call args is rewritten to `cfg.X` (the macro
intercepts `Expr(:$, :X)` nodes before lowering rejects them):

    @autotune(
        key   = (eltype(A), size(A, 2)),
        space = (TILE_M=(64, 128), TILE_N=(64, 128), occupancy=(1, 2, 4)),
        blocks = (cld(M, $TILE_M), cld(N, $TILE_N)),
        matmul(A, B, C, Constant($TILE_M), Constant($TILE_N))
    )

`space=` accepts a NamedTuple literal (→ CartesianSpace), a Vector of
NamedTuples (→ FixedSpace), or any `AbstractSearchSpace` (pass-through —
use this for `CartesianSpace(constraint; ...)`). Kernel kwargs aren't
supported (rejected at expansion); pass values positionally via
`Constant(...)`.

`autotune_launch` now also accepts `num_ctas`/`occupancy` as **static**
kwargs (applied uniformly to every cfg, e.g. for `ByTarget(...)`
per-arch dispatch). They may not coexist with a same-named axis in
`space`:
- The macro flags the conflict at expansion time when `space` is a
  literal NamedTuple.
- `autotune_launch` flags it at run time otherwise (opaque spaces).

Cache key now includes the static hints, so cfgs tuned under different
`num_ctas`/`occupancy` settings are kept separate.

Tests: 48 cases pass (31 existing + 17 covering static hints, the macro,
$X interpolation in tuple-blocks, NT vs Vector space, required-kwarg
errors, and the macro-time + run-time conflict errors).
`find_or_tune`'s final write block already handles the case where two
threads race into tuning the same key: only one wins the `per_kernel[arg_key] = candidate`
write, the other reads the winner's entry. The original autotune branch
returned `cache_hit=true` for the race-loser; my port dropped that signal
and hardcoded `false`, so the result NT misreported provenance in races.

Cosmetic — the right entry is still returned — but worth fixing while
the audit found it.
Replace the two-phase all-compile-then-all-measure flow with a
producer-consumer pipeline: compile workers push each finished cfg
onto a Channel, the master task pulls them off in arrival order and
times them on the GPU. The first cfg's measurement starts the moment
that cfg's tileiras subprocess returns, overlapping with the remaining
cfgs still compiling in the background.

Master is the consumer by design — `eval_cfg`'s `CUDACore.synchronize`
/ `@elapsed` rely on task-local CUDA state, and we want the timed
context to match the caller's. Producer sub-tasks (which only run
`cufunction` / load the CUBIN) tolerate fresh task-local state.

Falls back to `measure_candidates` (serial compile+measure) when
`precompile_workers=0` or only one cfg is in the search space.

Cancellation: master sets `cancelled[]=true` on `InterruptException`,
drains the channel so producers don't block on `put!`, then waits for
the producer driver to wind down.

Record is now in completion order rather than trial order — the only
behavior change visible to callers. Refinement (`sort!(record, by=last)`)
doesn't care; tests don't check ordering.

Sanity check on a 16-cfg sweep with 8 threads: ~2× wall-time vs.
`precompile_workers=0` (combined effect of compile parallelism +
pipelining).
- `key_fn` removed. `arg_key` was always built eagerly inside
  `autotune_launch` (no laziness benefit), so `key_fn=f` and
  `key=f()` were identical from the caller's perspective. Use `key=`.

- `grid` and `args` (formerly `grid_fn`/`args_fn`) now accept either a
  `cfg -> value` callable OR a plain value (wrapped in `Returns(...)`).
  Lets direct `autotune_launch` callers pass `cld(n, 16)` and
  `(a, b, c, Constant(16))` without writing trivial closures. The
  `@autotune` macro keeps emitting closures (because `$X` may appear).

- `launch_args_fn` renamed to `launch_args` (same fn-or-value treatment).

- Strip the `::Union{Nothing, Function}=nothing` annotations from kwargs
  throughout — they pinned the type for no reason and added noise.
  Implicit `Any` with `=nothing` does the same job.

Tests: drop the `key_fn` case (3 assertions), add a "literal grid/args"
case. Net 2388 total pass; autotune set goes 48 → 47.
A CartesianSpace axis with only `(nothing,)` adds the same field to
every cfg with value `nothing` — same outcome as omitting the axis,
since `hints_from_cfg` already falls through `hasproperty`. Was
present in the original autotune branch too; just noise.

Keep `occupancy=(nothing, 2)` in the CartesianSpace testset (tunes
between "no hint" and 2, which is meaningful) and keep the explicit
`nothing` slots on the `configs` Vector (FixedSpace requires uniform
NT shape across elements; cfg 1's nothings make the shape match
cfgs 2/3 which carry real hint values).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant