Skip to content

Conversation

@Sa4dUs
Copy link
Contributor

@Sa4dUs Sa4dUs commented Oct 21, 2025

This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.

  • offload(kernel, (..args)): an intrinsic that generates the necessary host-side LLVM-IR code.
  • rustc_offload_kernel: a builtin attribute that marks device kernels to be handled appropriately.

Example usage (pseudocode):

fn kernel(x: *mut [f64; 128]) {
    core::intrinsics::offload(kernel_1, (x,))
}

#[cfg(target_os = "linux")]
extern "C" {
    pub fn kernel_1(array_b: *mut [f64; 128]);
}

#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
    unsafe { (*x)[0] = 21.0 };
}

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 21, 2025
@ZuseZ4 ZuseZ4 self-assigned this Oct 21, 2025
@Sa4dUs Sa4dUs force-pushed the offload-intrinsic branch from 9118683 to 23722aa Compare October 21, 2025 19:45
@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 added the F-gpu_offload `#![feature(gpu_offload)]` label Oct 22, 2025
}

pub fn from_ty<'tcx>(tcx: TyCtxt<'tcx>, ty: Ty<'tcx>) -> Self {
OffloadMetadata { payload_size: get_payload_size(tcx, ty), mode: TransferKind::Both }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have the code here, I would add a small check for & or byVal (implies Mode ToGPU), vs &mut (implies Both).

In the future we would hope to analyze the & or byval case more, if we never read from it (before writing) then we could use a new mode 4, which allocates directly on the gpu.

@ZuseZ4 ZuseZ4 mentioned this pull request Oct 24, 2025
5 tasks
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Nov 5, 2025

☔ The latest upstream changes (presumably #148507) made this pull request unmergeable. Please resolve the merge conflicts.

@Sa4dUs Sa4dUs force-pushed the offload-intrinsic branch from e0fd7be to 97a8e96 Compare November 7, 2025 15:37
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Nov 9, 2025

☔ The latest upstream changes (presumably #148721) made this pull request unmergeable. Please resolve the merge conflicts.

@rustbot rustbot added the A-attributes Area: Attributes (`#[…]`, `#![…]`) label Nov 11, 2025
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Sa4dUs Sa4dUs marked this pull request as ready for review November 16, 2025 10:27
@rustbot
Copy link
Collaborator

rustbot commented Nov 16, 2025

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 16, 2025
@rustbot
Copy link
Collaborator

rustbot commented Nov 17, 2025

The rustc-dev-guide subtree was changed. If this PR only touches the dev guide consider submitting a PR directly to rust-lang/rustc-dev-guide otherwise thank you for updating the dev guide with your changes.

cc @BoxyUwU, @jieyouxu, @Kobzol, @tshepang

@rustbot rustbot added the A-rustc-dev-guide Area: rustc-dev-guide label Nov 17, 2025
@rustbot

This comment has been minimized.

Copy link
Member

@ZuseZ4 ZuseZ4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wanted to submit these yesterday.

View changes since this review

let mut builder = SBuilder::build(cx, kernel_call_bb);

let types = cx.func_params_types(cx.get_type_of_global(called));
let mut builder = SBuilder::build(cx, bb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a FIXME, so we can get rid of this? I don't think it should be a permanent solution. I'm also somewhat confused about why they are unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had a similar issue with autodiff, i think that, as the intrinsic is lowered relatively early in the compilation pipeline, it goes through more LLVM opt passes, and since there isn't yet any info that they will be used by the offloading feature, LLVM internalizes them (i tried to prevent them from being optimized by changing the linkage, but they always appeared as internal) and then removes them because unused internal variables

Copy link
Contributor Author

@Sa4dUs Sa4dUs Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand that in the first version of codegen, when this is done in fat LTO, LLVM is already aware of what will actually happen and doesn't modify them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We launch all LLVM passes at once via an LLVM PassManager, so it shouldn't change. But as long as it works, we can postpone investigations till later, the PR is enough of an improvement.

&target_symbol,
);

let bb = unsafe { llvm::LLVMGetInsertBlock(bx.llbuilder) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you get a bb from a builder, just to then create a builder out of the bb inside of gen_call_handling, right? Can you directly pass the builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rn i cannot pass the builder directly because generating globals from bx.cx, forces 'tcx to outlive 'll (and honestly i haven't been capable of fixing that without changing too much code), so i added a hacky fix until we know where to generate the globals (i've added a FIXME)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u think i should leave that as a TODO, lmk

// Step 0)
// %struct.__tgt_bin_desc = type { i32, ptr, ptr, ptr }
// %6 = alloca %struct.__tgt_bin_desc, align 8
unsafe { llvm::LLVMRustPositionBuilderPastAllocas(builder.llbuilder, main_fn) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you create / reuse a builder, are you sure that the tgt_bin_desc would get an alloca in the right position in the first bb (and it nost just working in the test by coincidence)?

E.g.

fn main() {
if (condition) {
} else {
}
core::intrinsic::offload(args);
}

If the builder is on the intrinsic, the alloca wouldn't land where it should (in the beginning).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea (or at least how i'd imagined it) is that when expanding the future macro, the wrapper function should always contain only the intrinsic, so we can generate all the logic sequentially

if you mean that it needs to be at the beginning of the first bb of the program, just let me know and i'll change that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, allocas should all be together at the beginning, so moving the builder via LLVMRustPositionBuilderPastAllocas (and putting the builder back into the old place) would be the way to go.
It might work if you put them elsewhere, but LLVM opt passes don't really expect that, so we're likely to miss out on some optimizations.

I agree that we should later distinguish better between the kernel launch intrinsic and the globals that are somewhat independent of the number of kernel launches.

let a5 = builder.direct_alloca(tgt_kernel_decl, Align::EIGHT, "kernel_args");

// Step 1)
unsafe { llvm::LLVMRustPositionBefore(builder.llbuilder, kernel_call) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the position of the memset without repositioning the builder.

);
}

fn codegen_offload<'ll, 'tcx>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs to this function?

@bors
Copy link
Collaborator

bors commented Nov 18, 2025

☔ The latest upstream changes (presumably #148151) made this pull request unmergeable. Please resolve the merge conflicts.

@rustbot
Copy link
Collaborator

rustbot commented Nov 22, 2025

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@rust-log-analyzer

This comment has been minimized.

@Sa4dUs Sa4dUs force-pushed the offload-intrinsic branch 2 times, most recently from c39e4e5 to ce5970c Compare November 22, 2025 18:19
@ZuseZ4
Copy link
Member

ZuseZ4 commented Nov 25, 2025

This removes a good amount of the hacks from my first MVP, further improvements can land in a follow-up PR.
Verified to work on an MI 250X.

@bors r+

@bors
Copy link
Collaborator

bors commented Nov 25, 2025

📌 Commit f39ec47 has been approved by ZuseZ4

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 25, 2025
jhpratt added a commit to jhpratt/rust that referenced this pull request Nov 26, 2025
Offload intrinsic

This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.

- `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code.
- `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately.

Example usage (pseudocode):
```rust
fn kernel(x: *mut [f64; 128]) {
    core::intrinsics::offload(kernel_1, (x,))
}

#[cfg(target_os = "linux")]
extern "C" {
    pub fn kernel_1(array_b: *mut [f64; 128]);
}

#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
    unsafe { (*x)[0] = 21.0 };
}
```
bors added a commit that referenced this pull request Nov 26, 2025
Rollup of 9 pull requests

Successful merges:

 - #147936 (Offload intrinsic)
 - #148358 (Fix some issues around `rustc_public`)
 - #148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors)
 - #148751 (Build gnullvm toolchains on Windows natively)
 - #148951 (rustc_target: aarch64: Remove deprecated FEAT_TME)
 - #149173 (Use rust rather than LLVM target features in the target spec)
 - #149307 (Deny const auto traits)
 - #149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target)
 - #149341 (Add `Copy` to some AST enums.)

r? `@ghost`
`@rustbot` modify labels: rollup
Zalathar added a commit to Zalathar/rust that referenced this pull request Nov 26, 2025
Offload intrinsic

This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.

- `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code.
- `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately.

Example usage (pseudocode):
```rust
fn kernel(x: *mut [f64; 128]) {
    core::intrinsics::offload(kernel_1, (x,))
}

#[cfg(target_os = "linux")]
extern "C" {
    pub fn kernel_1(array_b: *mut [f64; 128]);
}

#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
    unsafe { (*x)[0] = 21.0 };
}
```
bors added a commit that referenced this pull request Nov 26, 2025
Rollup of 12 pull requests

Successful merges:

 - #147936 (Offload intrinsic)
 - #148358 (Fix some issues around `rustc_public`)
 - #148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors)
 - #148751 (Build gnullvm toolchains on Windows natively)
 - #148951 (rustc_target: aarch64: Remove deprecated FEAT_TME)
 - #149149 ([rustdoc] misc search index cleanups)
 - #149173 (Use rust rather than LLVM target features in the target spec)
 - #149307 (Deny const auto traits)
 - #149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target)
 - #149317 (Handle inline asm in has_ffi_unwind_calls)
 - #149326 (Remove unused `Clone` derive on `DelayedLint`)
 - #149341 (Add `Copy` to some AST enums.)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 2b150f2 into rust-lang:main Nov 26, 2025
11 checks passed
@rustbot rustbot added this to the 1.93.0 milestone Nov 26, 2025
rust-timer added a commit that referenced this pull request Nov 26, 2025
Rollup merge of #147936 - Sa4dUs:offload-intrinsic, r=ZuseZ4

Offload intrinsic

This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.

- `offload(kernel, (..args))`: an intrinsic that generates the necessary host-side LLVM-IR code.
- `rustc_offload_kernel`: a builtin attribute that marks device kernels to be handled appropriately.

Example usage (pseudocode):
```rust
fn kernel(x: *mut [f64; 128]) {
    core::intrinsics::offload(kernel_1, (x,))
}

#[cfg(target_os = "linux")]
extern "C" {
    pub fn kernel_1(array_b: *mut [f64; 128]);
}

#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
    unsafe { (*x)[0] = 21.0 };
}
```
@Zalathar
Copy link
Member

Perf results from rollup:

Some regressions in the large-workspace secondary benchmark, perhaps due to the additional metadata.

Comment on lines 707 to 708
// For now we only support up to 10 kernels named kernel_0 ... kernel_9, a follow-up PR is
// introducing a proper offload intrinsic to solve this limitation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment is now outdated as of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We have a few follow-up PRs where I'll add it.

github-actions bot pushed a commit to rust-lang/rustc-dev-guide that referenced this pull request Nov 27, 2025
Rollup of 12 pull requests

Successful merges:

 - rust-lang/rust#147936 (Offload intrinsic)
 - rust-lang/rust#148358 (Fix some issues around `rustc_public`)
 - rust-lang/rust#148452 (Mangle symbols with a mangled name close to PDB limits with v0 instead of legacy mangling to avoid linker errors)
 - rust-lang/rust#148751 (Build gnullvm toolchains on Windows natively)
 - rust-lang/rust#148951 (rustc_target: aarch64: Remove deprecated FEAT_TME)
 - rust-lang/rust#149149 ([rustdoc] misc search index cleanups)
 - rust-lang/rust#149173 (Use rust rather than LLVM target features in the target spec)
 - rust-lang/rust#149307 (Deny const auto traits)
 - rust-lang/rust#149312 (Mark riscv64gc-unknown-linux-musl as tier 2 target)
 - rust-lang/rust#149317 (Handle inline asm in has_ffi_unwind_calls)
 - rust-lang/rust#149326 (Remove unused `Clone` derive on `DelayedLint`)
 - rust-lang/rust#149341 (Add `Copy` to some AST enums.)

r? `@ghost`
`@rustbot` modify labels: rollup
@ZuseZ4
Copy link
Member

ZuseZ4 commented Nov 27, 2025

@Zalathar We add metadata and do work if and only if you set -Zoffload=Enable flag, which isn't done in that benchmark.
if cgcx.target_is_like_gpu && config.offload.contains(&config::Offload::Enable)

We also introduce one more intrinsic, so I guess the match arm against intrinsic get's slightly larger in compiler/rustc_codegen_llvm/src/intrinsic.rs. Maybe the large-workspace has a lot of intrinsics, so the slightly larger code size for the match arm here causes a perf impact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-attributes Area: Attributes (`#[…]`, `#![…]`) A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-rustc-dev-guide Area: rustc-dev-guide F-gpu_offload `#![feature(gpu_offload)]` S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants