v0.8: Mat-Muls on Nvidia Hopper and Blackwell
This release answers a few questions:
- CUTLASS vs CUBLAS performance: which to choose?
- How did
MMAinstructions change with Hopper H100? - How did they change again with Blackwell B200?
Minor
- Add: Warp-Group Binary MMA (d6daf3a)
- Add: Larger
m64n256k8WGMMA variant (3e3530e) - Add: Warp-Group Async kernels (6cc7e34)
- Add:
f64MMA PTX variant (ae450e5) - Add: CuTe draft (fdea727)
- Add: CUTLASS placeholders (b1ab93d)
- Add: Hopper
sm90aPTX kernels (4bcf74a)
Patch
- Improve:
CUresulterror handling (d74d430) - Improve: Logging CUDA errors (953a696)
- Fix: Synchronize TCs (494ba52)
- Improve: Impossible
%tidcondition against NVCC (8a9c9c5) - Make: Temporarily block CUTLASS (df1b39c)
- Improve: Cleaner PTX code (71dea0c)
- Improve: Avoid NVCC-specific features (3d65c7f)
- Fix: Re-creating a CUDA stream (e831650)
- Make: Compile in parallel by default (8e671c6)
- Make: Separate host-only code (f751fbf)
- Docs: Counter-intuitive PTX facts (822fa2f)
- Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
- Make: CUTLASS dependency (f272c40)
- Fix: Synchronize cuBLAS for profiling (4077f26)
- Docs: Blackwell tensor cores (ec35b35)
- Fix: Missing
_Float16in NVCC, usehalf(71cadca) - Improve: Same size range for GEMM (d914fce)
- Fix: Different output size for
cublasGemmEx(304c880)