Add autotuning#247
Draft
AntonOresten wants to merge 15 commits into
Draft
Conversation
Ports the autotune work from the autotune branch onto main's launch infrastructure (`cuTile.cufunction` / `TileKernel`, `cuTileconvert`, `Constant` unwrapping in `unwrap_argtypes`). Lives in `src/Experimental.jl` (the original was under `src/Experimental.jl` + `ext/autotune/autotune.jl` behind a CUDA weakdep; on main, `cuTile.launch` is in `src/` and depends on `CUDACore` directly, so the extension boundary is no longer needed). What this drops vs. the autotune branch: - `ext/CUDAExt.jl` — subsumed by `src/launch.jl` on main. - The `_SCOPED_INF_CACHE`/`create_inf_cache` scoped inference plumbing — caching is now handled by `CompilerCaching` inside `cufunction`. - Manual `emit_function!` calls in `precompile_cfg` — `cufunction` does the full compile+link in one shot, so precompile is just a `cufunction` call per cfg. API preserved: - `Experimental.autotune_launch(f, space, grid_fn, args_fn; ...)` - `Experimental.clear_autotune_cache(; kernel=nothing, key=nothing)` - `Experimental.CartesianSpace`, `FixedSpace`, `AbstractSearchSpace` - All tuning preset / verify / setup / launch_args_fn knobs. Tests: all 31 cases in `test/device/autotune.jl` pass.
`emit_tile!` and everything below it (inference, structured IR, tile-IR emission) mutates shared state — `CacheView` entries, `CuTileResults` fields, the inference cache, and CompilerCaching's per-CI const_entries vector — none of which is thread-safe. Without this, concurrent `cufunction` calls (e.g. autotuning's precompile fan-out across `Threads.@spawn` workers) silently race. Two acquisitions, brief each: - `ensure_compiled` inside `compile` (lookup hit just briefly contends) - `emit_tile!` inside `emit_binary!` The tileiras subprocess still runs unlocked, so concurrent compiles can still overlap their shell-outs — the original rationale behind the autotune branch's `EMIT_TILE_LOCK`.
Adds `_SCOPED_INF_CACHE::ScopedValue` to the cuTile interpreter so a caller can opt into reusing one `CC.InferenceCache` across many `cuTileInterpreter(cache)` constructions. Autotuning's `find_or_tune` now wraps its precompile+measure pass in `with(_SCOPED_INF_CACHE => fresh)`, so the 25-config sweep over a single kernel shares inference results across all const-seeded variants instead of paying the slow paths once per config. Microbenchmark on a TileArray load with `order=(1,2)` (the original motivating case), 25 configs: Fresh: 9.7 ms/cfg Shared: 2.4 ms/cfg (~4x speedup) The ScopedValue is unassigned by default, so non-autotune callers (plain `@cuda backend=cuTile`) behave exactly as before — each `cuTileInterpreter` allocates its own fresh cache.
Shorthand for `CUDACore.@cuda backend=cuTile …`. Module references are interpolated as values (`$CUDACore`, `$cuTile`) rather than symbols, so callers don't need `using CUDACore` (or even `using cuTile`) at the call site — the expanded form points directly at the module objects. Verified working from a module with only `using cuTile`: module NoSetup using cuTile function run(a, b, c) cuTile.@Cutile blocks=cld(length(a), 16) vadd(a, b, c) end end
`@autotune` is a thin surface over `autotune_launch`. `$X` inside
`blocks=` or the kernel-call args is rewritten to `cfg.X` (the macro
intercepts `Expr(:$, :X)` nodes before lowering rejects them):
@autotune(
key = (eltype(A), size(A, 2)),
space = (TILE_M=(64, 128), TILE_N=(64, 128), occupancy=(1, 2, 4)),
blocks = (cld(M, $TILE_M), cld(N, $TILE_N)),
matmul(A, B, C, Constant($TILE_M), Constant($TILE_N))
)
`space=` accepts a NamedTuple literal (→ CartesianSpace), a Vector of
NamedTuples (→ FixedSpace), or any `AbstractSearchSpace` (pass-through —
use this for `CartesianSpace(constraint; ...)`). Kernel kwargs aren't
supported (rejected at expansion); pass values positionally via
`Constant(...)`.
`autotune_launch` now also accepts `num_ctas`/`occupancy` as **static**
kwargs (applied uniformly to every cfg, e.g. for `ByTarget(...)`
per-arch dispatch). They may not coexist with a same-named axis in
`space`:
- The macro flags the conflict at expansion time when `space` is a
literal NamedTuple.
- `autotune_launch` flags it at run time otherwise (opaque spaces).
Cache key now includes the static hints, so cfgs tuned under different
`num_ctas`/`occupancy` settings are kept separate.
Tests: 48 cases pass (31 existing + 17 covering static hints, the macro,
$X interpolation in tuple-blocks, NT vs Vector space, required-kwarg
errors, and the macro-time + run-time conflict errors).
`find_or_tune`'s final write block already handles the case where two threads race into tuning the same key: only one wins the `per_kernel[arg_key] = candidate` write, the other reads the winner's entry. The original autotune branch returned `cache_hit=true` for the race-loser; my port dropped that signal and hardcoded `false`, so the result NT misreported provenance in races. Cosmetic — the right entry is still returned — but worth fixing while the audit found it.
Replace the two-phase all-compile-then-all-measure flow with a producer-consumer pipeline: compile workers push each finished cfg onto a Channel, the master task pulls them off in arrival order and times them on the GPU. The first cfg's measurement starts the moment that cfg's tileiras subprocess returns, overlapping with the remaining cfgs still compiling in the background. Master is the consumer by design — `eval_cfg`'s `CUDACore.synchronize` / `@elapsed` rely on task-local CUDA state, and we want the timed context to match the caller's. Producer sub-tasks (which only run `cufunction` / load the CUBIN) tolerate fresh task-local state. Falls back to `measure_candidates` (serial compile+measure) when `precompile_workers=0` or only one cfg is in the search space. Cancellation: master sets `cancelled[]=true` on `InterruptException`, drains the channel so producers don't block on `put!`, then waits for the producer driver to wind down. Record is now in completion order rather than trial order — the only behavior change visible to callers. Refinement (`sort!(record, by=last)`) doesn't care; tests don't check ordering. Sanity check on a 16-cfg sweep with 8 threads: ~2× wall-time vs. `precompile_workers=0` (combined effect of compile parallelism + pipelining).
- `key_fn` removed. `arg_key` was always built eagerly inside
`autotune_launch` (no laziness benefit), so `key_fn=f` and
`key=f()` were identical from the caller's perspective. Use `key=`.
- `grid` and `args` (formerly `grid_fn`/`args_fn`) now accept either a
`cfg -> value` callable OR a plain value (wrapped in `Returns(...)`).
Lets direct `autotune_launch` callers pass `cld(n, 16)` and
`(a, b, c, Constant(16))` without writing trivial closures. The
`@autotune` macro keeps emitting closures (because `$X` may appear).
- `launch_args_fn` renamed to `launch_args` (same fn-or-value treatment).
- Strip the `::Union{Nothing, Function}=nothing` annotations from kwargs
throughout — they pinned the type for no reason and added noise.
Implicit `Any` with `=nothing` does the same job.
Tests: drop the `key_fn` case (3 assertions), add a "literal grid/args"
case. Net 2388 total pass; autotune set goes 48 → 47.
A CartesianSpace axis with only `(nothing,)` adds the same field to every cfg with value `nothing` — same outcome as omitting the axis, since `hints_from_cfg` already falls through `hasproperty`. Was present in the original autotune branch too; just noise. Keep `occupancy=(nothing, 2)` in the CartesianSpace testset (tunes between "no hint" and 2, which is meaningful) and keep the explicit `nothing` slots on the `configs` Vector (FixedSpace requires uniform NT shape across elements; cfg 1's nothings make the shape match cfgs 2/3 which carry real hint values).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Still experimental, but improves upon and supercedes #95, including adding #95 (comment).
Introduces a
@autotunemacro with some ideas from #40, which uses aspaceargument for the search space:and
$syntax to mean that the name comes from a config from the search space.occupancyis automatically detected as an entry hint.Some rough edges that I'm unsure about are:
_SCOPED_INF_CACHEas a targetted solution to a specific problem. See Add autotuning (experimental) #95 (comment)