Skip to content

[FEATURE REQUEST] gemm CuTe kernel implementation #33

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-3 gemm kernel has no CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for gemm, even though GEMM is the primary workload that CuTe/CUTLASS is designed to optimize.

Without a CuTe gemm kernel:

  • users cannot see how GEMM is expressed using CuTe’s layout and tiling abstractions,
  • there is no CuTe GEMM baseline for performance comparison with PyTorch and Triton,
  • CuTe-based examples are missing the most important building block for many workloads.

Proposed solution

Implement a CuTe-based gemm kernel that matches the Python reference semantics and aligns with the project’s backend structure.

Concretely:

  • Add a CuTe gemm kernel in the appropriate CuTe backend directory, implementing $C = \alpha A B + \beta C$.
  • Use CuTe primitives to describe matrix layouts, threadblock tiling, and memory movement.
  • Align the public API with other backends so callers can dispatch to CuTe GEMM uniformly.

Alternatives considered

Alternatives such as omitting CuTe gemm or relying on other backends would:

  • significantly reduce the educational and practical value of including CuTe as a backend,
  • leave the CuTe column incomplete in the README BLAS table for the most important BLAS-3 kernel,
  • limit opportunities to demonstrate high-performance GEMM implementation details in CuTe.

Implementation details

  • Establish file layout and build integration for CuTe kernels.
  • Implement GEMM using CuTe abstractions tuned for GPU execution, potentially leveraging CUTLASS patterns.
  • Ensure numerical equivalence with the Python reference and harmonize with PyTorch/Triton semantics.
  • Integrate with planned tests and benchmarks for GEMM.

Use case

The CuTe gemm kernel will:

  • demonstrate a high-performance GEMM implementation in CuTe,
  • enable rich performance comparisons across backends,
  • act as a cornerstone for Transformer modules and other large-scale linear algebra workloads.

Related work

  • CuTe/CUTLASS GEMM examples and reference kernels.
  • Standard BLAS gemm implementations.

Additional context

This issue complements the gemm Python/PyTorch/Triton feature requests and aims to make CuTe a first-class backend for the project’s most important BLAS-3 operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions