Skip to content

Commit c7d0b58

Browse files
[mxfp8 moe training] update readme with rooflines and benchmarks (#3399)
1 parent 8d4a921 commit c7d0b58

File tree

3 files changed

+35
-0
lines changed

3 files changed

+35
-0
lines changed
371 KB
Loading
373 KB
Loading

torchao/prototype/moe_training/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,41 @@ To reproduce this benchmark, on a B200 GPU machine, run the following command:
187187
- torchao: `0.14.0+gitc7b8e13da`
188188
- torch: `2.10.0a0+gitf6de195`
189189

190+
### Roofline Performance Analysis
191+
192+
The following roofline plots provide roofline analysis and benchmarks for the following:
193+
194+
1. **Net Speedup vs () Size** - MXFP8 vs BF16 for forward + backward pass
195+
2. **2D Quantization + Block Format Kernels** - Bandwidth utilization for input quantization and per-group scale conversion to blocked format
196+
3. **3D Quantization + Block Format Kernels** - Bandwidth utilization for weight quantization and per-group scale conversion to blocked format
197+
4. **Grouped GEMM Kernel Speedup** - MXFP8 over BF16 for 2D/3D and 2D/2D GEMM operations
198+
5. **Kernel Breakdown** - Stacked bar chart showing actual measured times for each kernel component (quantization, conversion to blocked format, GEMM) across forward, backward input, and backward weight passes
199+
200+
These benchmarks were generated on **November 26, 2025** and will be updated with every change that affects performance.
201+
202+
Next steps for optimization:
203+
* Improve 2D-2D MXFP8 grouped GEMM CUTLASS kernel performance (used for computing wgrad), which currently produces much lower speedups than the 2D-3D case (used for computing output and dgrad).
204+
205+
#### Llama4 Shapes (K=5120, N=8192, G=8)
206+
207+
![Llama Rooflines](../../../benchmarks/prototype/moe_training/mxfp8/llama_rooflines.png)
208+
209+
**Command to reproduce:**
210+
```bash
211+
cd benchmarks/prototype/moe_training/mxfp8
212+
python roofline_unified.py --K=5120 --N=8192 --G=8 --power_limit_percent=100 --breakdown_M=131072 --plot_file=llama_rooflines.png
213+
```
214+
215+
#### DeepSeek V3 Shapes (K=7168, N=2048, G=8)
216+
217+
![DeepSeek V3 Rooflines](../../../benchmarks/prototype/moe_training/mxfp8/dsv3_rooflines.png)
218+
219+
**Command to reproduce:**
220+
```bash
221+
cd benchmarks/prototype/moe_training/mxfp8
222+
python roofline_unified.py --K=7168 --N=2048 --G=8 --power_limit_percent=100 --breakdown_M=131072 --plot_file=dsv3_rooflines.png
223+
```
224+
190225
## Benchmark: single MoE layer forward + backward pass
191226

192227
| Model | total_M | N | K | bf16 time (ms) | mxfp8 time (ms) | speedup |

0 commit comments

Comments
 (0)