-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Problem statement
The BLAS level-1 axpby kernel does not yet have a CuTe backend implementation in this project. In the README BLAS table, the CuTe column for axpby is empty, preventing complete cross-backend coverage for this fundamental fused vector update.
Without a CuTe axpby kernel:
- users cannot see how
$y = \alpha x + \beta y$ maps onto CuTe primitives and memory abstractions, - there is no CuTe baseline for performance comparison against PyTorch and Triton
axpby, - higher-level CuTe-based examples lack a standard
axpbybuilding block.
Proposed solution
Implement an axpby kernel using CuTe (CUTLASS/CuTe-style) constructs, matching the mathematical semantics of the Python reference and fitting the project’s backend conventions.
Concretely:
- Add a CuTe-based
axpbykernel in the appropriate CuTe backend directory (once established in the codebase). - Implement the operation
$y = \alpha x + \beta y$ for 1D vectors with attention to memory access patterns and performance. - Mirror the public API structure of other backends so callers can dispatch to CuTe
axpbyin a uniform way.
Alternatives considered
Alternatives such as omitting CuTe axpby or relying on other backends for this operation would:
- reduce the educational value of comparing CuTe with PyTorch and Triton on a fused vector update,
- leave the CuTe column incomplete for
axpbyin the README, - limit CuTe’s role as a first-class backend in
kernel-course.
Implementation details
- Decide the precise file layout and build integration for CuTe kernels in the repository.
- Implement the
axpbykernel using CuTe primitives that are idiomatic for 1D vector operations. - Ensure numerical equivalence with the Python reference and consistency with other backend semantics.
- Integrate with future tests and benchmarks for
axpby.
Use case
The CuTe axpby kernel will:
- illustrate how a fused BLAS-1 operation is realized in CuTe,
- enable detailed performance comparisons and tuning across backends,
- serve as a foundational building block for more complex CuTe-based kernels.
Related work
- CuTe/CUTLASS examples of vector and fused operations.
- BLAS
axpbyin other GPU math libraries.
Additional context
This issue complements the axpby Python, PyTorch, and Triton feature requests and helps complete the CuTe column in the README BLAS table.
Metadata
Metadata
Assignees
Labels
No labels