-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Problem statement
The BLAS level-3 gemm kernel has no CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for gemm, even though GEMM is the primary workload that CuTe/CUTLASS is designed to optimize.
Without a CuTe gemm kernel:
- users cannot see how GEMM is expressed using CuTe’s layout and tiling abstractions,
- there is no CuTe GEMM baseline for performance comparison with PyTorch and Triton,
- CuTe-based examples are missing the most important building block for many workloads.
Proposed solution
Implement a CuTe-based gemm kernel that matches the Python reference semantics and aligns with the project’s backend structure.
Concretely:
- Add a CuTe
gemmkernel in the appropriate CuTe backend directory, implementing$C = \alpha A B + \beta C$ . - Use CuTe primitives to describe matrix layouts, threadblock tiling, and memory movement.
- Align the public API with other backends so callers can dispatch to CuTe GEMM uniformly.
Alternatives considered
Alternatives such as omitting CuTe gemm or relying on other backends would:
- significantly reduce the educational and practical value of including CuTe as a backend,
- leave the CuTe column incomplete in the README BLAS table for the most important BLAS-3 kernel,
- limit opportunities to demonstrate high-performance GEMM implementation details in CuTe.
Implementation details
- Establish file layout and build integration for CuTe kernels.
- Implement GEMM using CuTe abstractions tuned for GPU execution, potentially leveraging CUTLASS patterns.
- Ensure numerical equivalence with the Python reference and harmonize with PyTorch/Triton semantics.
- Integrate with planned tests and benchmarks for GEMM.
Use case
The CuTe gemm kernel will:
- demonstrate a high-performance GEMM implementation in CuTe,
- enable rich performance comparisons across backends,
- act as a cornerstone for Transformer modules and other large-scale linear algebra workloads.
Related work
- CuTe/CUTLASS GEMM examples and reference kernels.
- Standard BLAS
gemmimplementations.
Additional context
This issue complements the gemm Python/PyTorch/Triton feature requests and aims to make CuTe a first-class backend for the project’s most important BLAS-3 operation.
Metadata
Metadata
Assignees
Labels
No labels