Skip to content

v0.8: Mat-Muls on Nvidia Hopper and Blackwell

Choose a tag to compare

@ashvardanian ashvardanian released this 07 Feb 21:08
· 84 commits to main since this release

This release answers a few questions:

  • CUTLASS vs CUBLAS performance: which to choose?
  • How did MMA instructions change with Hopper H100?
  • How did they change again with Blackwell B200?

Minor

  • Add: Warp-Group Binary MMA (d6daf3a)
  • Add: Larger m64n256k8 WGMMA variant (3e3530e)
  • Add: Warp-Group Async kernels (6cc7e34)
  • Add: f64 MMA PTX variant (ae450e5)
  • Add: CuTe draft (fdea727)
  • Add: CUTLASS placeholders (b1ab93d)
  • Add: Hopper sm90a PTX kernels (4bcf74a)

Patch

  • Improve: CUresult error handling (d74d430)
  • Improve: Logging CUDA errors (953a696)
  • Fix: Synchronize TCs (494ba52)
  • Improve: Impossible %tid condition against NVCC (8a9c9c5)
  • Make: Temporarily block CUTLASS (df1b39c)
  • Improve: Cleaner PTX code (71dea0c)
  • Improve: Avoid NVCC-specific features (3d65c7f)
  • Fix: Re-creating a CUDA stream (e831650)
  • Make: Compile in parallel by default (8e671c6)
  • Make: Separate host-only code (f751fbf)
  • Docs: Counter-intuitive PTX facts (822fa2f)
  • Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
  • Make: CUTLASS dependency (f272c40)
  • Fix: Synchronize cuBLAS for profiling (4077f26)
  • Docs: Blackwell tensor cores (ec35b35)
  • Fix: Missing _Float16 in NVCC, use half (71cadca)
  • Improve: Same size range for GEMM (d914fce)
  • Fix: Different output size for cublasGemmEx (304c880)