Add autotuning by AntonOresten · Pull Request #247 · JuliaGPU/cuTile.jl

AntonOresten · 2026-06-03T16:20:57Z

Still experimental, but improves upon and supercedes #95, including adding #95 (comment).

Introduces a @autotune macro with some ideas from #40, which uses a space argument for the search space:

using cuTile.Experimental: @autotune

@autotune(
    space=[(tile=16, occupancy=1), (tile=32, occupancy=2)]
    blocks=cld(N, $tile),
    vadd(a, b, c, ct.Constant($tile))
)

and $ syntax to mean that the name comes from a config from the search space. occupancy is automatically detected as an entry hint.

Some rough edges that I'm unsure about are:

Workarounds to avoid persistent caching of candidates while iterating through the search space.
_SCOPED_INF_CACHE as a targetted solution to a specific problem. See Add autotuning (experimental) #95 (comment)

Ports the autotune work from the autotune branch onto main's launch infrastructure (`cuTile.cufunction` / `TileKernel`, `cuTileconvert`, `Constant` unwrapping in `unwrap_argtypes`). Lives in `src/Experimental.jl` (the original was under `src/Experimental.jl` + `ext/autotune/autotune.jl` behind a CUDA weakdep; on main, `cuTile.launch` is in `src/` and depends on `CUDACore` directly, so the extension boundary is no longer needed). What this drops vs. the autotune branch: - `ext/CUDAExt.jl` — subsumed by `src/launch.jl` on main. - The `_SCOPED_INF_CACHE`/`create_inf_cache` scoped inference plumbing — caching is now handled by `CompilerCaching` inside `cufunction`. - Manual `emit_function!` calls in `precompile_cfg` — `cufunction` does the full compile+link in one shot, so precompile is just a `cufunction` call per cfg. API preserved: - `Experimental.autotune_launch(f, space, grid_fn, args_fn; ...)` - `Experimental.clear_autotune_cache(; kernel=nothing, key=nothing)` - `Experimental.CartesianSpace`, `FixedSpace`, `AbstractSearchSpace` - All tuning preset / verify / setup / launch_args_fn knobs. Tests: all 31 cases in `test/device/autotune.jl` pass.

`emit_tile!` and everything below it (inference, structured IR, tile-IR emission) mutates shared state — `CacheView` entries, `CuTileResults` fields, the inference cache, and CompilerCaching's per-CI const_entries vector — none of which is thread-safe. Without this, concurrent `cufunction` calls (e.g. autotuning's precompile fan-out across `Threads.@spawn` workers) silently race. Two acquisitions, brief each: - `ensure_compiled` inside `compile` (lookup hit just briefly contends) - `emit_tile!` inside `emit_binary!` The tileiras subprocess still runs unlocked, so concurrent compiles can still overlap their shell-outs — the original rationale behind the autotune branch's `EMIT_TILE_LOCK`.

Adds `_SCOPED_INF_CACHE::ScopedValue` to the cuTile interpreter so a caller can opt into reusing one `CC.InferenceCache` across many `cuTileInterpreter(cache)` constructions. Autotuning's `find_or_tune` now wraps its precompile+measure pass in `with(_SCOPED_INF_CACHE => fresh)`, so the 25-config sweep over a single kernel shares inference results across all const-seeded variants instead of paying the slow paths once per config. Microbenchmark on a TileArray load with `order=(1,2)` (the original motivating case), 25 configs: Fresh: 9.7 ms/cfg Shared: 2.4 ms/cfg (~4x speedup) The ScopedValue is unassigned by default, so non-autotune callers (plain `@cuda backend=cuTile`) behave exactly as before — each `cuTileInterpreter` allocates its own fresh cache.

@cuda

Shorthand for `CUDACore.@cuda backend=cuTile …`. Module references are interpolated as values (`$CUDACore`, `$cuTile`) rather than symbols, so callers don't need `using CUDACore` (or even `using cuTile`) at the call site — the expanded form points directly at the module objects. Verified working from a module with only `using cuTile`: module NoSetup using cuTile function run(a, b, c) cuTile.@Cutile blocks=cld(length(a), 16) vadd(a, b, c) end end

@autotune

`@autotune` is a thin surface over `autotune_launch`. `$X` inside `blocks=` or the kernel-call args is rewritten to `cfg.X` (the macro intercepts `Expr(:$, :X)` nodes before lowering rejects them): @autotune( key = (eltype(A), size(A, 2)), space = (TILE_M=(64, 128), TILE_N=(64, 128), occupancy=(1, 2, 4)), blocks = (cld(M, $TILE_M), cld(N, $TILE_N)), matmul(A, B, C, Constant($TILE_M), Constant($TILE_N)) ) `space=` accepts a NamedTuple literal (→ CartesianSpace), a Vector of NamedTuples (→ FixedSpace), or any `AbstractSearchSpace` (pass-through — use this for `CartesianSpace(constraint; ...)`). Kernel kwargs aren't supported (rejected at expansion); pass values positionally via `Constant(...)`. `autotune_launch` now also accepts `num_ctas`/`occupancy` as **static** kwargs (applied uniformly to every cfg, e.g. for `ByTarget(...)` per-arch dispatch). They may not coexist with a same-named axis in `space`: - The macro flags the conflict at expansion time when `space` is a literal NamedTuple. - `autotune_launch` flags it at run time otherwise (opaque spaces). Cache key now includes the static hints, so cfgs tuned under different `num_ctas`/`occupancy` settings are kept separate. Tests: 48 cases pass (31 existing + 17 covering static hints, the macro, $X interpolation in tuple-blocks, NT vs Vector space, required-kwarg errors, and the macro-time + run-time conflict errors).

`find_or_tune`'s final write block already handles the case where two threads race into tuning the same key: only one wins the `per_kernel[arg_key] = candidate` write, the other reads the winner's entry. The original autotune branch returned `cache_hit=true` for the race-loser; my port dropped that signal and hardcoded `false`, so the result NT misreported provenance in races. Cosmetic — the right entry is still returned — but worth fixing while the audit found it.

Replace the two-phase all-compile-then-all-measure flow with a producer-consumer pipeline: compile workers push each finished cfg onto a Channel, the master task pulls them off in arrival order and times them on the GPU. The first cfg's measurement starts the moment that cfg's tileiras subprocess returns, overlapping with the remaining cfgs still compiling in the background. Master is the consumer by design — `eval_cfg`'s `CUDACore.synchronize` / `@elapsed` rely on task-local CUDA state, and we want the timed context to match the caller's. Producer sub-tasks (which only run `cufunction` / load the CUBIN) tolerate fresh task-local state. Falls back to `measure_candidates` (serial compile+measure) when `precompile_workers=0` or only one cfg is in the search space. Cancellation: master sets `cancelled[]=true` on `InterruptException`, drains the channel so producers don't block on `put!`, then waits for the producer driver to wind down. Record is now in completion order rather than trial order — the only behavior change visible to callers. Refinement (`sort!(record, by=last)`) doesn't care; tests don't check ordering. Sanity check on a 16-cfg sweep with 8 threads: ~2× wall-time vs. `precompile_workers=0` (combined effect of compile parallelism + pipelining).

- `key_fn` removed. `arg_key` was always built eagerly inside `autotune_launch` (no laziness benefit), so `key_fn=f` and `key=f()` were identical from the caller's perspective. Use `key=`. - `grid` and `args` (formerly `grid_fn`/`args_fn`) now accept either a `cfg -> value` callable OR a plain value (wrapped in `Returns(...)`). Lets direct `autotune_launch` callers pass `cld(n, 16)` and `(a, b, c, Constant(16))` without writing trivial closures. The `@autotune` macro keeps emitting closures (because `$X` may appear). - `launch_args_fn` renamed to `launch_args` (same fn-or-value treatment). - Strip the `::Union{Nothing, Function}=nothing` annotations from kwargs throughout — they pinned the type for no reason and added noise. Implicit `Any` with `=nothing` does the same job. Tests: drop the `key_fn` case (3 assertions), add a "literal grid/args" case. Net 2388 total pass; autotune set goes 48 → 47.

A CartesianSpace axis with only `(nothing,)` adds the same field to every cfg with value `nothing` — same outcome as omitting the axis, since `hints_from_cfg` already falls through `hasproperty`. Was present in the original autotune branch too; just noise. Keep `occupancy=(nothing, 2)` in the CartesianSpace testset (tunes between "no hint" and 2, which is meaningful) and keep the explicit `nothing` slots on the `configs` Vector (FixedSpace requires uniform NT shape across elements; cfg 1's nothings make the shape match cfgs 2/3 which carry real hint values).

AntonOresten added 15 commits June 3, 2026 21:18

Move to src/experimental/

16a2aa4

Refactor experimental autotuning internals

13a316b

Keep autotune candidate compiles temporary

0d0c628

Fix TileCacheKey call

f60154d

Retain context

215fece

cleanup

4090aca

AntonOresten force-pushed the ao/autotune branch from 7c2b9bf to 4090aca Compare June 3, 2026 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add autotuning#247

Add autotuning#247
AntonOresten wants to merge 15 commits into
mainfrom
ao/autotune

AntonOresten commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AntonOresten commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AntonOresten commented Jun 3, 2026 •

edited

Loading