CUDA: Accelerate MXFP4 table lookup using __byte_perm (#15451)
#230
| Job | Run time |
|---|---|
| 7m 17s | |
| 13m 7s | |
| 57m 41s | |
| 11m 21s | |
| 10m 16s | |
| 7m 27s | |
| 15m 28s | |
| 3m 30s | |
| 4m 21s | |
| 6m 21s | |
| 18m 17s | |
| 11m 1s | |
| 7m 3s | |
| 9m 27s | |
| 5m 12s | |
| 4m 40s | |
| 9m 47s | |
| 8m 35s | |
| 2m 22s | |
| 14m 23s | |
| 5m 48s | |
| 9m 45s | |
| 2m 43s | |
| 19m 21s | |
| 48m 14s | |
| 9m 18s | |
| 5m 38s | |
| 3m 22s | |
| 7m 59s | |
| 13m 45s | |
| 7m 7s | |
| 1m 55s | |
| 1m 54s | |
| 20m 6s | |
| 9m 31s | |
| 21m 21s | |
| 3m 23s | |
| 13m 10s | |
| 4m 46s | |
| 4m 56s | |
| 7m 35s | |
| 3m 7s | |
| 7h 32m 20s |