Make constant memory opt-in, spill large statics to global memory#217
Conversation
This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87
This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216
|
I've decided to default to not using constant memory with an opt-in flag. Later when we are smarter we can flip the flag by default and/or make it a no-op. I tested this with your vast script (thanks again!) and it should work...I was playing around with putting everything in globals previously and that must have snuck in when testing the previous commits! |
|
let me give it one final test and we can move on to the next one because i think they will be quick (if you have time that is) sha2: #207 entire thing won't compile with a trap, i bet it is something small silly (potentially on my end) and this one may actually make all of this worth while, improving performance on this (not sure if you saw but... 100,000,000 hashes/sec on a RTX5090 is even better than I could come up with native https://github.com/brandonros/vanity_finder_cpp ) rand_xoshiro : #203 there's a chhhhhannnnce this ed25519 problem was related and it'll fix it but i doubt it, i also think it'll be something small |
we could hypothetically make a manual github actions trigger with a Vast API token as a secret that can stand up a $0.15/hr instance, do CI against it (tests/examples/whatever) and then tear it down I'd be willing to put some cycles into that if you'd like but I'm not sure if the benefit is there for you |
| AddressSpace(4) | ||
| if !self.codegen_args.use_constant_memory_space { | ||
| // We aren't using constant memory, so put the instance in global memory. | ||
| AddressSpace(1) |
There was a problem hiding this comment.
nit: could we make consts somewhere that represent this 0 1 2 3 4 stuff better for easier readability?
There was a problem hiding this comment.
Yeah, we have one in cuda_std and I didn't want to duplicate it. A followup should add a rustc_codegen_nvvm-types crate that cuda_std and rustc_codegen_nvvm could share (rust-gpu has a "-types" crate for just this reason). Irust-gpu also has rspirv for spirv-specific info encoded in rust types, so perhaps it should be something like rcuda? we could move the cuda error code mapping out of cust into it as well 🤔 )
We have GPUs sponsored by modal.com. I just haven't got a chance to get it all working (they didn't have simple ssh access, but looks like maybe they do now? #202 |
ok. not sure if you saw: https://github.com/brandonros/ed25519-vanity-rs/blob/master/.github/workflows/ci.yaml we can build on runner (as you know and also have) and then run. i could help split the Vast script to not have build if needed let's merge if you're ready, this is awesome. thanks again, hope you enjoyed working together so far. i'll retest the smaller simpler Xorshiro RNG issue after this, i have a feeling it might be fixed with this |
Yeah, we build containers as well: https://github.com/Rust-GPU/Rust-CUDA/blob/main/.github/workflows/container_images.yml That is the idea for the modal stuff...build on actions, push up to modal, and run. |
|
#218 for improving usage of constant memory space. |
…st-GPU#217) * Allow address spaces to propagate to LLVM This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87 * Spill large statics from constant to global memory This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216 * Add `--use-constant-memory-space` flag, off by default * Make it clear that `#[cuda_std::address_space(constant)]` still works
…ation)
Before this commit, mir-importer's static-translation path called
`ensure_zero_initializer`, which only accepted statics whose raw
bytes were all zero and silently discarded `alloc.provenance.ptrs`
(the side table of cross-static pointer fixups). The emitted PTX
contained a single zero-bodied pointer slot for the outer ref-static
and nothing at all for the inner data static — every
`pub static X: &T = &INNER` shape (curve25519-dalek's
`ED25519_BASEPOINT_TABLE`, sha2's K constants, k256's affine
generator, etc.) faulted on hardware with a clean null deref.
Replaced with `compute_static_initializer` (extracts bytes +
relocations from the alloc) and `collect_reachable_statics` (walks
the transitive closure of referenced statics so every reachable
static gets emitted as its own MirGlobalAllocOp).
Plumbing:
* `MirGlobalAllocOp` gains `initializer_bytes` (hex-encoded body)
and `initializer_relocations` (`OFF:KEY,OFF:KEY` cross-static
refs) attributes. `crates/dialect-mir/src/ops/memory.rs`.
* `GlobalOp` mirrors them as `llvm_initializer_bytes` /
`llvm_initializer_relocations`, plus an `llvm_global_source_key`
sidecar so the exporter can resolve source-level static names
back to the synthetic `__device_global_N` symbol. mir-lower
forwards all three when creating the LLVM global.
* `export_global` in `crates/dialect-llvm/src/export.rs` now emits:
- `[N x i8] c"\01\00..."` when there are no relocations (the
plain bytes case, e.g. `static INNER: [u64; 4] = [...]`)
- `<{ ... }> <{ ... }>` packed-struct interleaving byte runs
with `addrspacecast (ptr addrspace(1) @target to ptr)` slots
when relocations are present (the
`pub static X: &T = &INNER` case)
Falls back to `zeroinitializer` only when both attrs are absent.
A new pre-pass `build_source_key_map` walks all GlobalOps to
build a `source_key -> llvm_name` map before any export starts.
Secondary (transitively-reachable) statics' MirGlobalAllocOps are
inserted at the front of the current block to keep the kernel's
own terminator the last op in its block — they're hoisted to
module scope by mir-lower regardless of where they sit in the
kernel's MIR.
Post-fix PTX for `static INNER: [u64; 4] = [1,2,3,4]` +
`static OUTER: &[u64; 4] = &INNER`:
.visible .global .align 8 .b8 __device_global_0[32] = {1,0,0,...4,...};
.visible .global .align 8 .u64 __device_global_1[1] = {generic(__device_global_0)};
Sweep: 88 pass, 6 fail (the README-documented codegen-time
known-failures). No regressions.
This is the cuda-oxide counterpart to Rust-CUDA PR 217's static-
placement work (cf. Rust-GPU/rust-cuda#217),
which surfaced from the same downstream consumer
(`~/vanity-miner-rs/`'s ed25519 path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I've decided to default to not using constant memory with an opt-in flag. Later when we are smarter we can flip the flag by default and/or make it a no-op. One can still annotate code with
#[cuda_std::address_space(constant)]to place it manually., this flag only affects automatic placement by the codegen backend.Using this flag / turning on constant memory can blow up as constant memory placing logic isn't fully correct. Ideally we keep track of what we have put into constant memory and when it is filled up spill instead of only spilling when a static is too large on its own. We'll also probably want some packing strategy controlled by the user...for example, if you have one large static and many small ones, you might want the small ones to all be in constant memory or just the big one depending on your workload. We need some design work around this, and the design shouldn't require code to be annotated to support third party non-GPU-aware libraries.
But, this is materially better than what is there (which is a runtime error).
Fixes #208.
See also the debugging and discussion in
#216