Skip to content

[FEATURE REQUEST] geru CuTe kernel implementation #28

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-2 geru kernel (general rank-1 update) does not yet have a CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for geru, preventing full cross-backend coverage for this operation.

Without a CuTe geru kernel:

  • users cannot learn how rank-1 updates are expressed using CuTe’s layout and tiling abstractions,
  • there is no CuTe performance baseline to compare against PyTorch and Triton geru implementations,
  • CuTe-based higher-level examples lack a standard rank-1 update primitive.

Proposed solution

Implement a CuTe-based geru kernel that matches the Python reference semantics and fits within the project’s backend structure.

Concretely:

  • Add a CuTe geru kernel in the appropriate CuTe backend directory, implementing $A = A + \alpha x y^\top$.
  • Use CuTe primitives to describe matrix layout, vector access, and thread scheduling for rank-1 updates.
  • Align the public API with other backends to allow uniform dispatch.

Alternatives considered

Alternatives such as omitting CuTe geru or reusing other backends would:

  • reduce the educational impact of comparing CuTe to PyTorch/Triton on BLAS-2 operations,
  • leave the CuTe column incomplete in the README BLAS table,
  • limit CuTe’s role as a first-class backend.

Implementation details

  • Establish file layout and build rules for CuTe kernels.
  • Implement geru using CuTe abstractions for rank-1 updates over 2D layouts.
  • Ensure numerical equivalence with the Python reference.
  • Integrate with planned tests and benchmarks for geru.

Use case

The CuTe geru kernel will:

  • demonstrate rank-1 updates in CuTe,
  • enable detailed performance comparisons across backends,
  • serve as a building block for more complex CuTe-based kernels.

Related work

  • CuTe/CUTLASS examples of rank-1 updates.
  • Standard BLAS geru implementations.

Additional context

This issue complements the geru Python/PyTorch/Triton feature requests and contributes to full CuTe coverage of BLAS-2 operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions