-
Notifications
You must be signed in to change notification settings - Fork 216
Examples: Ported the AsyncAPI example of CUDA samples #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
LegNeato
merged 11 commits into
Rust-GPU:main
from
madhav-madhusoodanan:add_cuda_examples
Nov 12, 2025
Merged
Changes from 5 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
c3fffba
feat: ported the AsyncAPI sample of CUDA examples
madhav-madhusoodanan 4d3b181
fix: spelling errors
madhav-madhusoodanan 3ac6cc5
chore: format code
madhav-madhusoodanan 071df49
chore: move async_api example to the samples/introduction/ subdirectory
madhav-madhusoodanan e422369
feat: add README.md for the samples/ subdirectory
madhav-madhusoodanan 5b6d246
fix: replace manual increment operation with +=
madhav-madhusoodanan 49fea30
chore: remove the drop-specific code, since Rust automatically drops …
madhav-madhusoodanan cf47dcd
chore: remove the clippy annotations from the kernel-side code
madhav-madhusoodanan 00135f5
chore: make the context creation code more ergonomic by using `quick_…
madhav-madhusoodanan f016d39
fix: add SAFETY message to kernel and remove unnecessary imports
madhav-madhusoodanan 98fc439
feat: update ci_windows.yml to update PATH to expose CUDA codegen bac…
madhav-madhusoodanan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Rust-Cuda Samples | ||
|
|
||
| These are the Rust-Cuda port of the samples from Nvidia's [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) repository. | ||
|
|
||
| 1. Chapter 0: [Introduction](https://github.com/Rust-GPU/rust-cuda/samples/introduction) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # Chapter 0: Introduction | ||
|
|
||
| ## [asyncAPI](https://github.com/Rust-GPU/rust-cuda/samples/introduction/async_api) | ||
| This example demonstrates two key capabilities of CUDA events: measuring GPU execution time and enabling concurrent CPU-GPU operations. | ||
|
|
||
| 1. Events are recorded at specific points within a CUDA stream to mark the beginning and end of GPU operations. | ||
| 2. Because CUDA stream operations execute asynchronously, the CPU remains free to perform other work while the GPU processes tasks (including memory transfers between host and device) | ||
| 3. The CPU can query these events to check whether the GPU has finished its work, allowing for coordination between the two processors without blocking the CPU. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| [package] | ||
| name = "async_api" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
|
|
||
| [dependencies] | ||
| cust = { path = "../../../crates/cust" } | ||
| nanorand = "0.7" | ||
|
|
||
| [build-dependencies] | ||
| cuda_builder = { workspace = true, default-features = false } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| use std::env; | ||
| use std::path; | ||
|
|
||
| use cuda_builder::CudaBuilder; | ||
|
|
||
| fn main() { | ||
| println!("cargo::rerun-if-changed=build.rs"); | ||
| println!("cargo::rerun-if-changed=kernels"); | ||
|
|
||
| let out_path = path::PathBuf::from(env::var("OUT_DIR").unwrap()); | ||
| let manifest_dir = path::PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()); | ||
|
|
||
| CudaBuilder::new(manifest_dir.join("kernels")) | ||
| .copy_to(out_path.join("kernels.ptx")) | ||
| .build() | ||
| .unwrap(); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| [package] | ||
| name = "async_api-kernels" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
|
|
||
| [dependencies] | ||
| cuda_std = { path = "../../../../crates/cuda_std" } | ||
|
|
||
| [lib] | ||
| crate-type = ["cdylib", "rlib"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| use cuda_std::prelude::*; | ||
|
|
||
| #[kernel] | ||
| #[allow(improper_ctypes_definitions, clippy::missing_safety_doc)] | ||
| pub unsafe fn increment(g_data: *mut u32, inc_value: u32) { | ||
| // This can also be obtained directly as | ||
| // | ||
| // let idx: usize = cuda_std::thread::index() as usize; | ||
| let idx: usize = (cuda_std::thread::block_dim().x * cuda_std::thread::block_idx().x | ||
| + cuda_std::thread::thread_idx().x) as usize; | ||
|
|
||
| let elem: &mut u32 = unsafe { &mut *g_data.add(idx) }; | ||
| *elem = *elem + inc_value; | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| use cust::context::Context; | ||
| use cust::device::Device; | ||
| use cust::event::{Event, EventFlags}; | ||
| use cust::function::{BlockSize, GridSize}; | ||
| use cust::memory::{AsyncCopyDestination, DeviceBuffer, LockedBuffer}; | ||
| use cust::module::Module; | ||
| use cust::prelude::EventStatus; | ||
| use cust::stream::{Stream, StreamFlags}; | ||
| use cust::{CudaFlags, launch}; | ||
| use std::time::Instant; | ||
|
|
||
| static PTX: &str = include_str!(concat!(env!("OUT_DIR"), "/kernels.ptx")); | ||
|
|
||
| fn correct_output(data: &[u32], x: u32) -> bool { | ||
| let not_matching_element = data.iter().enumerate().find(|&(_, &elem)| elem != x); | ||
|
|
||
| match not_matching_element { | ||
| Some((index, elem)) => println!("Error! data[{index}] = {elem}, ref = {x}"), | ||
| None => println!("All elements of the array match the value!"), | ||
| } | ||
|
|
||
| not_matching_element.is_none() | ||
| } | ||
|
|
||
| fn main() -> Result<(), cust::error::CudaError> { | ||
| cust::init(CudaFlags::empty()).expect("Couldn't initialize CUDA environment!"); | ||
|
|
||
| let device = Device::get_device(0).expect("Couldn't find Cuda supported devices!"); | ||
|
|
||
| println!("Device Name: {}", device.name().unwrap()); | ||
|
|
||
| // Set up the context, load the module, and create a stream to run kernels in. | ||
| let _ctx = Context::new(device); | ||
LegNeato marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| let module = Module::from_ptx(PTX, &[]).expect("Module couldn't be init!"); | ||
| let increment = module | ||
| .get_function("increment") | ||
| .expect("Kernel function not found!"); | ||
| let stream = Stream::new(StreamFlags::NON_BLOCKING, None).expect("Stream couldn't be init!"); | ||
|
|
||
| const N: usize = 16 * 1024 * 1024; | ||
| let value = 26; | ||
|
|
||
| let blocks = BlockSize::xy(512, 1); | ||
| let grids = GridSize::xy((N / (blocks.x as usize)).try_into().unwrap(), 1); | ||
|
|
||
| let start_event = Event::new(EventFlags::DEFAULT)?; | ||
| let stop_event = Event::new(EventFlags::DEFAULT)?; | ||
|
|
||
| // Create buffers for data on host-side | ||
| // Ideally should be page-locked for efficiency | ||
| let mut host_a = LockedBuffer::new(&0u32, N).expect("host array couldn't be initialized!"); | ||
| let mut device_a = | ||
| DeviceBuffer::from_slice(&[u32::MAX; N]).expect("device array couldn't be initialized!"); | ||
|
|
||
| start_event | ||
| .record(&stream) | ||
| .expect("Failed to record start_event in the CUDA stream!"); | ||
| let start = Instant::now(); | ||
|
|
||
| // SAFETY: until the stop_event is triggered: | ||
| // 1. `host_a` is not being modified | ||
| // 2. Both `device_a` and `host_a` are not deallocated | ||
| // 3. Until `stop_query` yields `EventStatus::Ready`, `device_a` is not involved in any other operation | ||
| // other than those of the operations in the stream. | ||
| unsafe { | ||
| device_a | ||
| .async_copy_from(&host_a, &stream) | ||
| .expect("Could not copy from host to device!"); | ||
| } | ||
|
|
||
| // SAFETY: number of threads * number of blocks = total number of elements. | ||
| // Hence there will not be any out-of-bounds issues. | ||
| unsafe { | ||
| let result = launch!(increment<<<grids, blocks, 0, stream>>>( | ||
| device_a.as_device_ptr(), | ||
| value | ||
| )); | ||
| result.expect("Result of `increment` kernel did not process!"); | ||
| } | ||
|
|
||
| // SAFETY: until the stop_event is triggered: | ||
| // 1. `device_a` is not being modified | ||
| // 2. Both `device_a` and `host_a` are not deallocated | ||
| // 3. At this point, until `stop_query` yields `EventStatus::Ready`, | ||
| // `host_a` is not involved in any other operation. | ||
| unsafe { | ||
| device_a | ||
| .async_copy_to(&mut host_a, &stream) | ||
| .expect("Could not copy from device to host!"); | ||
| } | ||
|
|
||
| stop_event | ||
| .record(&stream) | ||
| .expect("Failed to record stop_event in the CUDA stream!"); | ||
| let cpu_time: u128 = start.elapsed().as_micros(); | ||
|
|
||
| let mut counter: u64 = 0; | ||
| while stop_event.query() != Ok(EventStatus::Ready) { | ||
| counter += 1 | ||
| } | ||
|
|
||
| let gpu_time: u128 = stop_event | ||
| .elapsed(&start_event) | ||
| .expect("Failed to calculate duration of GPU operations!") | ||
| .as_micros(); | ||
|
|
||
| println!("Time spent executing by the GPU: {gpu_time} microseconds"); | ||
| println!("Time spent by CPU in CUDA calls: {cpu_time} microseconds"); | ||
| println!("CPU executed {counter} iterations while waiting for GPU to finish."); | ||
|
|
||
| assert!(correct_output(host_a.as_slice(), value)); | ||
|
|
||
| // Stream is synchronized as a safety measure | ||
| stream.synchronize().expect("Stream couldn't synchronize!"); | ||
|
|
||
| // Events and buffers can be safely dropped now | ||
LegNeato marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| match Event::drop(start_event) { | ||
| Ok(()) => println!("Successfully destroyed start_event"), | ||
| Err((cuda_error, _event)) => { | ||
| println!("Failed to destroy start_event: {:?}", cuda_error); | ||
| } | ||
| } | ||
|
|
||
| match Event::drop(stop_event) { | ||
| Ok(()) => println!("Successfully destroyed stop_event"), | ||
| Err((cuda_error, _event)) => { | ||
| println!("Failed to destroy stop_event: {:?}", cuda_error); | ||
| } | ||
| } | ||
|
|
||
| DeviceBuffer::drop(device_a).expect("Couldn't drop device array!"); | ||
| LockedBuffer::drop(host_a).expect("Couldn't drop host array!"); | ||
|
|
||
| println!("test PASSED"); | ||
| Ok(()) | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.