Skip to content

[FEATURE REQUEST] axpby CuTe kernel implementation #13

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-1 axpby kernel does not yet have a CuTe backend implementation in this project. In the README BLAS table, the CuTe column for axpby is empty, preventing complete cross-backend coverage for this fundamental fused vector update.

Without a CuTe axpby kernel:

  • users cannot see how $y = \alpha x + \beta y$ maps onto CuTe primitives and memory abstractions,
  • there is no CuTe baseline for performance comparison against PyTorch and Triton axpby,
  • higher-level CuTe-based examples lack a standard axpby building block.

Proposed solution

Implement an axpby kernel using CuTe (CUTLASS/CuTe-style) constructs, matching the mathematical semantics of the Python reference and fitting the project’s backend conventions.

Concretely:

  • Add a CuTe-based axpby kernel in the appropriate CuTe backend directory (once established in the codebase).
  • Implement the operation $y = \alpha x + \beta y$ for 1D vectors with attention to memory access patterns and performance.
  • Mirror the public API structure of other backends so callers can dispatch to CuTe axpby in a uniform way.

Alternatives considered

Alternatives such as omitting CuTe axpby or relying on other backends for this operation would:

  • reduce the educational value of comparing CuTe with PyTorch and Triton on a fused vector update,
  • leave the CuTe column incomplete for axpby in the README,
  • limit CuTe’s role as a first-class backend in kernel-course.

Implementation details

  • Decide the precise file layout and build integration for CuTe kernels in the repository.
  • Implement the axpby kernel using CuTe primitives that are idiomatic for 1D vector operations.
  • Ensure numerical equivalence with the Python reference and consistency with other backend semantics.
  • Integrate with future tests and benchmarks for axpby.

Use case

The CuTe axpby kernel will:

  • illustrate how a fused BLAS-1 operation is realized in CuTe,
  • enable detailed performance comparisons and tuning across backends,
  • serve as a foundational building block for more complex CuTe-based kernels.

Related work

  • CuTe/CUTLASS examples of vector and fused operations.
  • BLAS axpby in other GPU math libraries.

Additional context

This issue complements the axpby Python, PyTorch, and Triton feature requests and helps complete the CuTe column in the README BLAS table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions