-
Notifications
You must be signed in to change notification settings - Fork 14k
Offload intrinsic #147936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload intrinsic #147936
Conversation
9118683 to
23722aa
Compare
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| pub fn from_ty<'tcx>(tcx: TyCtxt<'tcx>, ty: Ty<'tcx>) -> Self { | ||
| OffloadMetadata { payload_size: get_payload_size(tcx, ty), mode: TransferKind::Both } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you already have the code here, I would add a small check for & or byVal (implies Mode ToGPU), vs &mut (implies Both).
In the future we would hope to analyze the & or byval case more, if we never read from it (before writing) then we could use a new mode 4, which allocates directly on the gpu.
This comment has been minimized.
This comment has been minimized.
|
☔ The latest upstream changes (presumably #148507) made this pull request unmergeable. Please resolve the merge conflicts. |
e0fd7be to
97a8e96
Compare
This comment has been minimized.
This comment has been minimized.
|
☔ The latest upstream changes (presumably #148721) made this pull request unmergeable. Please resolve the merge conflicts. |
3540edb to
e9d89ce
Compare
This comment has been minimized.
This comment has been minimized.
e9d89ce to
a08949b
Compare
This comment has been minimized.
This comment has been minimized.
a08949b to
9397d31
Compare
This comment has been minimized.
This comment has been minimized.
9397d31 to
7666b58
Compare
|
The rustc-dev-guide subtree was changed. If this PR only touches the dev guide consider submitting a PR directly to rust-lang/rustc-dev-guide otherwise thank you for updating the dev guide with your changes. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I wanted to submit these yesterday.
| let mut builder = SBuilder::build(cx, kernel_call_bb); | ||
|
|
||
| let types = cx.func_params_types(cx.get_type_of_global(called)); | ||
| let mut builder = SBuilder::build(cx, bb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a FIXME, so we can get rid of this? I don't think it should be a permanent solution. I'm also somewhat confused about why they are unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had a similar issue with autodiff, i think that, as the intrinsic is lowered relatively early in the compilation pipeline, it goes through more LLVM opt passes, and since there isn't yet any info that they will be used by the offloading feature, LLVM internalizes them (i tried to prevent them from being optimized by changing the linkage, but they always appeared as internal) and then removes them because unused internal variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand that in the first version of codegen, when this is done in fat LTO, LLVM is already aware of what will actually happen and doesn't modify them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We launch all LLVM passes at once via an LLVM PassManager, so it shouldn't change. But as long as it works, we can postpone investigations till later, the PR is enough of an improvement.
| &target_symbol, | ||
| ); | ||
|
|
||
| let bb = unsafe { llvm::LLVMGetInsertBlock(bx.llbuilder) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you get a bb from a builder, just to then create a builder out of the bb inside of gen_call_handling, right? Can you directly pass the builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rn i cannot pass the builder directly because generating globals from bx.cx, forces 'tcx to outlive 'll (and honestly i haven't been capable of fixing that without changing too much code), so i added a hacky fix until we know where to generate the globals (i've added a FIXME)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u think i should leave that as a TODO, lmk
| // Step 0) | ||
| // %struct.__tgt_bin_desc = type { i32, ptr, ptr, ptr } | ||
| // %6 = alloca %struct.__tgt_bin_desc, align 8 | ||
| unsafe { llvm::LLVMRustPositionBuilderPastAllocas(builder.llbuilder, main_fn) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you create / reuse a builder, are you sure that the tgt_bin_desc would get an alloca in the right position in the first bb (and it nost just working in the test by coincidence)?
E.g.
fn main() {
if (condition) {
} else {
}
core::intrinsic::offload(args);
}If the builder is on the intrinsic, the alloca wouldn't land where it should (in the beginning).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea (or at least how i'd imagined it) is that when expanding the future macro, the wrapper function should always contain only the intrinsic, so we can generate all the logic sequentially
if you mean that it needs to be at the beginning of the first bb of the program, just let me know and i'll change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, allocas should all be together at the beginning, so moving the builder via LLVMRustPositionBuilderPastAllocas (and putting the builder back into the old place) would be the way to go.
It might work if you put them elsewhere, but LLVM opt passes don't really expect that, so we're likely to miss out on some optimizations.
I agree that we should later distinguish better between the kernel launch intrinsic and the globals that are somewhat independent of the number of kernel launches.
| let a5 = builder.direct_alloca(tgt_kernel_decl, Align::EIGHT, "kernel_args"); | ||
|
|
||
| // Step 1) | ||
| unsafe { llvm::LLVMRustPositionBefore(builder.llbuilder, kernel_call) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for the position of the memset without repositioning the builder.
| ); | ||
| } | ||
|
|
||
| fn codegen_offload<'ll, 'tcx>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some docs to this function?
|
☔ The latest upstream changes (presumably #148151) made this pull request unmergeable. Please resolve the merge conflicts. |
1a7e216 to
0b71052
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
This comment has been minimized.
This comment has been minimized.
c39e4e5 to
ce5970c
Compare
ce5970c to
f39ec47
Compare
|
This removes a good amount of the hacks from my first MVP, further improvements can land in a follow-up PR. @bors r+ |
Offload intrinsic
This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.
- `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code.
- `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately.
Example usage (pseudocode):
```rust
fn kernel(x: *mut [f64; 128]) {
core::intrinsics::offload(kernel_1, (x,))
}
#[cfg(target_os = "linux")]
extern "C" {
pub fn kernel_1(array_b: *mut [f64; 128]);
}
#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
unsafe { (*x)[0] = 21.0 };
}
```
Rollup of 9 pull requests Successful merges: - #147936 (Offload intrinsic) - #148358 (Fix some issues around `rustc_public`) - #148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors) - #148751 (Build gnullvm toolchains on Windows natively) - #148951 (rustc_target: aarch64: Remove deprecated FEAT_TME) - #149173 (Use rust rather than LLVM target features in the target spec) - #149307 (Deny const auto traits) - #149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target) - #149341 (Add `Copy` to some AST enums.) r? `@ghost` `@rustbot` modify labels: rollup
Offload intrinsic
This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.
- `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code.
- `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately.
Example usage (pseudocode):
```rust
fn kernel(x: *mut [f64; 128]) {
core::intrinsics::offload(kernel_1, (x,))
}
#[cfg(target_os = "linux")]
extern "C" {
pub fn kernel_1(array_b: *mut [f64; 128]);
}
#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
unsafe { (*x)[0] = 21.0 };
}
```
Rollup of 12 pull requests Successful merges: - #147936 (Offload intrinsic) - #148358 (Fix some issues around `rustc_public`) - #148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors) - #148751 (Build gnullvm toolchains on Windows natively) - #148951 (rustc_target: aarch64: Remove deprecated FEAT_TME) - #149149 ([rustdoc] misc search index cleanups) - #149173 (Use rust rather than LLVM target features in the target spec) - #149307 (Deny const auto traits) - #149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target) - #149317 (Handle inline asm in has_ffi_unwind_calls) - #149326 (Remove unused `Clone` derive on `DelayedLint`) - #149341 (Add `Copy` to some AST enums.) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of #147936 - Sa4dUs:offload-intrinsic, r=ZuseZ4 Offload intrinsic This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata. - `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code. - `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately. Example usage (pseudocode): ```rust fn kernel(x: *mut [f64; 128]) { core::intrinsics::offload(kernel_1, (x,)) } #[cfg(target_os = "linux")] extern "C" { pub fn kernel_1(array_b: *mut [f64; 128]); } #[cfg(not(target_os = "linux"))] #[rustc_offload_kernel] extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) { unsafe { (*x)[0] = 21.0 }; } ```
|
Perf results from rollup: Some regressions in the |
| // For now we only support up to 10 kernels named kernel_0 ... kernel_9, a follow-up PR is | ||
| // introducing a proper offload intrinsic to solve this limitation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is now outdated as of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We have a few follow-up PRs where I'll add it.
Rollup of 12 pull requests Successful merges: - rust-lang/rust#147936 (Offload intrinsic) - rust-lang/rust#148358 (Fix some issues around `rustc_public`) - rust-lang/rust#148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors) - rust-lang/rust#148751 (Build gnullvm toolchains on Windows natively) - rust-lang/rust#148951 (rustc_target: aarch64: Remove deprecated FEAT_TME) - rust-lang/rust#149149 ([rustdoc] misc search index cleanups) - rust-lang/rust#149173 (Use rust rather than LLVM target features in the target spec) - rust-lang/rust#149307 (Deny const auto traits) - rust-lang/rust#149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target) - rust-lang/rust#149317 (Handle inline asm in has_ffi_unwind_calls) - rust-lang/rust#149326 (Remove unused `Clone` derive on `DelayedLint`) - rust-lang/rust#149341 (Add `Copy` to some AST enums.) r? `@ghost` `@rustbot` modify labels: rollup
|
@Zalathar We add metadata and do work if and only if you set We also introduce one more intrinsic, so I guess the match arm against intrinsic get's slightly larger in |
This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.
offload(kernel, (..args)): an intrinsic that generates the necessary host-side LLVM-IR code.rustc_offload_kernel: a builtin attribute that marks device kernels to be handled appropriately.Example usage (pseudocode):