You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't have a benchmark from before we added CPU FP8, but this change
restores a fair bit (if not all) of the lost performance:
- The inner-most `d` loop does only `dot += qv[d] * kvp[d];`
- All FP8 work has been hoisted into the Qq/Kh precomputation loops, which are O(D * H * Tc + D * kv) instead of O(D * H * Tc * kv).
0 commit comments