You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: torchao/prototype/moe_training/README.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -187,6 +187,41 @@ To reproduce this benchmark, on a B200 GPU machine, run the following command:
187
187
- torchao: `0.14.0+gitc7b8e13da`
188
188
- torch: `2.10.0a0+gitf6de195`
189
189
190
+
### Roofline Performance Analysis
191
+
192
+
The following roofline plots provide roofline analysis and benchmarks for the following:
193
+
194
+
1.**Net Speedup vs () Size** - MXFP8 vs BF16 for forward + backward pass
195
+
2.**2D Quantization + Block Format Kernels** - Bandwidth utilization for input quantization and per-group scale conversion to blocked format
196
+
3.**3D Quantization + Block Format Kernels** - Bandwidth utilization for weight quantization and per-group scale conversion to blocked format
197
+
4.**Grouped GEMM Kernel Speedup** - MXFP8 over BF16 for 2D/3D and 2D/2D GEMM operations
198
+
5.**Kernel Breakdown** - Stacked bar chart showing actual measured times for each kernel component (quantization, conversion to blocked format, GEMM) across forward, backward input, and backward weight passes
199
+
200
+
These benchmarks were generated on **November 26, 2025** and will be updated with every change that affects performance.
201
+
202
+
Next steps for optimization:
203
+
* Improve 2D-2D MXFP8 grouped GEMM CUTLASS kernel performance (used for computing wgrad), which currently produces much lower speedups than the 2D-3D case (used for computing output and dgrad).
0 commit comments