Make constant memory opt-in, spill large statics to global memory by LegNeato · Pull Request #217 · Rust-GPU/rust-cuda

LegNeato · 2025-05-26T02:15:53Z

I've decided to default to not using constant memory with an opt-in flag. Later when we are smarter we can flip the flag by default and/or make it a no-op. One can still annotate code with #[cuda_std::address_space(constant)] to place it manually., this flag only affects automatic placement by the codegen backend.

Using this flag / turning on constant memory can blow up as constant memory placing logic isn't fully correct. Ideally we keep track of what we have put into constant memory and when it is filled up spill instead of only spilling when a static is too large on its own. We'll also probably want some packing strategy controlled by the user...for example, if you have one large static and many small ones, you might want the small ones to all be in constant memory or just the big one depending on your workload. We need some design work around this, and the design shouldn't require code to be annotated to support third party non-GPU-aware libraries.

But, this is materially better than what is there (which is a runtime error).

Fixes #208.

See also the debugging and discussion in
#216

This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87

This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216

LegNeato · 2025-05-26T17:02:03Z

I've decided to default to not using constant memory with an opt-in flag. Later when we are smarter we can flip the flag by default and/or make it a no-op. I tested this with your vast script (thanks again!) and it should work...I was playing around with putting everything in globals previously and that must have snuck in when testing the previous commits!

brandonros · 2025-05-26T17:32:29Z

let me give it one final test and we can move on to the next one because i think they will be quick (if you have time that is)

sha2: #207 entire thing won't compile with a trap, i bet it is something small silly (potentially on my end) and this one may actually make all of this worth while, improving performance on this

(not sure if you saw but... 100,000,000 hashes/sec on a RTX5090 is even better than I could come up with native https://github.com/brandonros/vanity_finder_cpp )

rand_xoshiro : #203

there's a chhhhhannnnce this ed25519 problem was related and it'll fix it but i doubt it, i also think it'll be something small

brandonros · 2025-05-26T17:33:40Z

I tested this with your vast script (thanks again!)

we could hypothetically make a manual github actions trigger with a Vast API token as a secret that can stand up a $0.15/hr instance, do CI against it (tests/examples/whatever) and then tear it down

I'd be willing to put some cycles into that if you'd like but I'm not sure if the benefit is there for you

brandonros · 2025-05-26T17:34:42Z

-            AddressSpace(4)
+            if !self.codegen_args.use_constant_memory_space {
+                // We aren't using constant memory, so put the instance in global memory.
+                AddressSpace(1)


nit: could we make consts somewhere that represent this 0 1 2 3 4 stuff better for easier readability?

Yeah, we have one in cuda_std and I didn't want to duplicate it. A followup should add a rustc_codegen_nvvm-types crate that cuda_std and rustc_codegen_nvvm could share (rust-gpu has a "-types" crate for just this reason). Irust-gpu also has rspirv for spirv-specific info encoded in rust types, so perhaps it should be something like rcuda? we could move the cuda error code mapping out of cust into it as well 🤔 )

brandonros

it works

LegNeato · 2025-05-26T17:56:23Z

I tested this with your vast script (thanks again!)

we could hypothetically make a manual github actions trigger with a Vast API token as a secret that can stand up a $0.15/hr instance, do CI against it (tests/examples/whatever) and then tear it down

I'd be willing to put some cycles into that if you'd like but I'm not sure if the benefit is there for you

We have GPUs sponsored by modal.com. I just haven't got a chance to get it all working (they didn't have simple ssh access, but looks like maybe they do now? #202

brandonros · 2025-05-26T17:58:52Z

I tested this with your vast script (thanks again!)

we could hypothetically make a manual github actions trigger with a Vast API token as a secret that can stand up a $0.15/hr instance, do CI against it (tests/examples/whatever) and then tear it down
I'd be willing to put some cycles into that if you'd like but I'm not sure if the benefit is there for you

We have GPUs sponsored by modal.com. I just haven't got a chance to get it all working (they didn't have simple ssh access, but looks like maybe they do now? #202

ok. not sure if you saw: https://github.com/brandonros/ed25519-vanity-rs/blob/master/.github/workflows/ci.yaml

we can build on runner (as you know and also have) and then run. i could help split the Vast script to not have build if needed

let's merge if you're ready, this is awesome. thanks again, hope you enjoyed working together so far. i'll retest the smaller simpler Xorshiro RNG issue after this, i have a feeling it might be fixed with this

LegNeato · 2025-05-26T18:02:20Z

I tested this with your vast script (thanks again!)

we could hypothetically make a manual github actions trigger with a Vast API token as a secret that can stand up a $0.15/hr instance, do CI against it (tests/examples/whatever) and then tear it down
I'd be willing to put some cycles into that if you'd like but I'm not sure if the benefit is there for you

We have GPUs sponsored by modal.com. I just haven't got a chance to get it all working (they didn't have simple ssh access, but looks like maybe they do now? #202

ok. not sure if you saw: https://github.com/brandonros/ed25519-vanity-rs/blob/master/.github/workflows/ci.yaml

we can build on runner (as you know and also have) and then run. i could help split the Vast script to not have build if needed

let's merge if you're ready, this is awesome. thanks again, hope you enjoyed working together so far. i'll retest the smaller simpler Xorshiro RNG issue after this, i have a feeling it might be fixed with this

Yeah, we build containers as well:

https://github.com/Rust-GPU/Rust-CUDA/blob/main/.github/workflows/container_images.yml

That is the idea for the modal stuff...build on actions, push up to modal, and run.

LegNeato · 2025-05-26T19:01:03Z

#218 for improving usage of constant memory space.

…st-GPU#217) * Allow address spaces to propagate to LLVM This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87 * Spill large statics from constant to global memory This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216 * Add `--use-constant-memory-space` flag, off by default * Make it clear that `#[cuda_std::address_space(constant)]` still works

…ation) Before this commit, mir-importer's static-translation path called `ensure_zero_initializer`, which only accepted statics whose raw bytes were all zero and silently discarded `alloc.provenance.ptrs` (the side table of cross-static pointer fixups). The emitted PTX contained a single zero-bodied pointer slot for the outer ref-static and nothing at all for the inner data static — every `pub static X: &T = &INNER` shape (curve25519-dalek's `ED25519_BASEPOINT_TABLE`, sha2's K constants, k256's affine generator, etc.) faulted on hardware with a clean null deref. Replaced with `compute_static_initializer` (extracts bytes + relocations from the alloc) and `collect_reachable_statics` (walks the transitive closure of referenced statics so every reachable static gets emitted as its own MirGlobalAllocOp). Plumbing: * `MirGlobalAllocOp` gains `initializer_bytes` (hex-encoded body) and `initializer_relocations` (`OFF:KEY,OFF:KEY` cross-static refs) attributes. `crates/dialect-mir/src/ops/memory.rs`. * `GlobalOp` mirrors them as `llvm_initializer_bytes` / `llvm_initializer_relocations`, plus an `llvm_global_source_key` sidecar so the exporter can resolve source-level static names back to the synthetic `__device_global_N` symbol. mir-lower forwards all three when creating the LLVM global. * `export_global` in `crates/dialect-llvm/src/export.rs` now emits: - `[N x i8] c"\01\00..."` when there are no relocations (the plain bytes case, e.g. `static INNER: [u64; 4] = [...]`) - `<{ ... }> <{ ... }>` packed-struct interleaving byte runs with `addrspacecast (ptr addrspace(1) @target to ptr)` slots when relocations are present (the `pub static X: &T = &INNER` case) Falls back to `zeroinitializer` only when both attrs are absent. A new pre-pass `build_source_key_map` walks all GlobalOps to build a `source_key -> llvm_name` map before any export starts. Secondary (transitively-reachable) statics' MirGlobalAllocOps are inserted at the front of the current block to keep the kernel's own terminator the last op in its block — they're hoisted to module scope by mir-lower regardless of where they sit in the kernel's MIR. Post-fix PTX for `static INNER: [u64; 4] = [1,2,3,4]` + `static OUTER: &[u64; 4] = &INNER`: .visible .global .align 8 .b8 __device_global_0[32] = {1,0,0,...4,...}; .visible .global .align 8 .u64 __device_global_1[1] = {generic(__device_global_0)}; Sweep: 88 pass, 6 fail (the README-documented codegen-time known-failures). No regressions. This is the cuda-oxide counterpart to Rust-CUDA PR 217's static- placement work (cf. Rust-GPU/rust-cuda#217), which surfaced from the same downstream consumer (`~/vanity-miner-rs/`'s ed25519 path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LegNeato added 2 commits May 25, 2025 22:06

Allow address spaces to propagate to LLVM

e98fec3

This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87

LegNeato mentioned this pull request May 26, 2025

Fix read_volatile intrinsic #216

Closed

brandonros reviewed May 26, 2025

View reviewed changes

Comment thread crates/rustc_codegen_nvvm/src/context.rs Outdated

LegNeato force-pushed the staticaddr branch from a283db5 to 6c4125b Compare May 26, 2025 16:50

Add --use-constant-memory-space flag, off by default

81936e1

LegNeato force-pushed the staticaddr branch from 6c4125b to 81936e1 Compare May 26, 2025 16:55

LegNeato changed the title ~~Spill large statics to global memory~~ Make constant memory opt-in, spill large statics to global memory May 26, 2025

Make it clear that #[cuda_std::address_space(constant)] still works

5f81c65

LegNeato force-pushed the staticaddr branch from 60b0286 to 5f81c65 Compare May 26, 2025 17:23

brandonros reviewed May 26, 2025

View reviewed changes

brandonros approved these changes May 26, 2025

View reviewed changes

LegNeato merged commit afb147e into Rust-GPU:main May 26, 2025
7 checks passed

LegNeato mentioned this pull request May 26, 2025

Smart and user-controllable automatic Constant memory space placement for static variables #218

Open

brandonros mentioned this pull request May 26, 2025

rand_xoshiro crate = runtime error #203

Closed

LegNeato deleted the staticaddr branch November 6, 2025 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make constant memory opt-in, spill large statics to global memory#217

Make constant memory opt-in, spill large statics to global memory#217
LegNeato merged 4 commits into
Rust-GPU:mainfrom
LegNeato:staticaddr

LegNeato commented May 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025

Uh oh!

brandonros May 26, 2025

Uh oh!

LegNeato May 26, 2025 •

edited

Loading

Uh oh!

brandonros left a comment

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025 •

edited

Loading

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LegNeato commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025

Uh oh!

brandonros May 26, 2025

Choose a reason for hiding this comment

Uh oh!

LegNeato May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonros left a comment

Choose a reason for hiding this comment

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

brandonros commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LegNeato commented May 26, 2025 •

edited

Loading

LegNeato May 26, 2025 •

edited

Loading

brandonros commented May 26, 2025 •

edited

Loading