Skip to content

Commit ecaaefb

Browse files
committed
Add kernelretsnoop and threadhist tools for CUDA kernel profiling
- Introduced `kernelretsnoop`, an eBPF-based tool to trace CUDA kernel thread exit timestamps, providing insights into thread execution times and performance bottlenecks. - Implemented `threadhist`, a tool to analyze per-thread execution counts in CUDA kernels, helping to identify load imbalances and optimize thread configurations. - Added example CUDA application (`vec_add.cu`) for testing both tools, demonstrating their usage in real scenarios. - Created README files for both tools, detailing their functionality, usage, and examples. - Included necessary build files and configurations for compiling and running the tools.
1 parent 12a4f1e commit ecaaefb

File tree

15 files changed

+441
-357
lines changed

15 files changed

+441
-357
lines changed

example/gpu/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ The GPU support is built on the `nv_attach_impl` system (`attach/nv_attach_impl/
4848
Complete working examples with full source code, build instructions, and READMEs are available on GitHub:
4949

5050
- **[cuda-counter](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/cuda-counter)**: Basic probe/retprobe with timing measurements
51-
- **[cuda-counter-gpu-array](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/cuda-counter-gpu-array)**: Per-thread counters using GPU array maps
52-
- **[cuda-counter-gpu-ringbuf](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/cuda-counter-gpu-ringbuf)**: Event streaming with ringbuf maps
51+
- **[kernelretsnoop](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/kernelretsnoop)**: Captures per-thread exit timestamps to detect thread divergence, memory access patterns, and warp scheduling issues
52+
- **[threadhist](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/threadhist)**: Per-thread execution histogram using GPU array maps to detect workload imbalance
5353
- **[rocm-counter](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/rocm-counter)**: AMD GPU instrumentation (experimental)
5454

5555
Each example includes CUDA/ROCm application source, eBPF probe programs, Makefile, and detailed usage instructions.

example/gpu/cuda-counter-gpu-array/README.md

Lines changed: 0 additions & 141 deletions
This file was deleted.

example/gpu/cuda-counter-gpu-ringbuf/README.md

Lines changed: 0 additions & 141 deletions
This file was deleted.

example/gpu/cuda-counter-gpu-ringbuf/.gitignore renamed to example/gpu/kernelretsnoop/.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/cuda_probe
1+
/kernelretsnoop
22
/.output
33
/victim*
44
/vec_add.cpp

example/gpu/cuda-counter-gpu-array/Makefile renamed to example/gpu/kernelretsnoop/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ INCLUDES := -I$(OUTPUT) -I../../../third_party/libbpf/include/uapi -I$(dir $(VML
2121
CFLAGS := -g -Wall
2222
ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS)
2323

24-
APPS = cuda_probe # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall
24+
APPS = kernelretsnoop # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall
2525

2626
CARGO ?= $(shell which cargo)
2727
ifeq ($(strip $(CARGO)),)

0 commit comments

Comments
 (0)