Skip to content

Conversation

@bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Dec 4, 2025

The additional condition added by #17332 is too strict and excludes some legit cases, e.g., test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {4352, 1, 9216, 1}, {1, 2, 0, 3}, {0, 0, 0, 0}));. This PR loosened that check a bit and added another permuted case which is also a transpose.

Master

 CPY(type_src=bf16,type_dst=bf16,ne=[4352,1,9216,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0):                     1935 runs -   567.07 us/run -   156672 kB/run -  264.09 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[4352,1,9216,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0):               1080 runs -   946.85 us/run -   313344 kB/run -  317.06 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[21504,4352,1,1],permute_src=[2,0,1,3],permute_dst=[0,0,0,0],_src_transpose=0):                     828 runs -  1280.62 us/run -   365568 kB/run -  273.72 GB/s

This PR

CPY(type_src=bf16,type_dst=bf16,ne=[4352,1,9216,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0):                     5590 runs -   185.73 us/run -   156672 kB/run -  806.36 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[4352,1,9216,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0):               2592 runs -   389.38 us/run -   313344 kB/run -  771.01 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[21504,4352,1,1],permute_src=[2,0,1,3],permute_dst=[0,0,0,0],_src_transpose=0):                    2208 runs -   460.16 us/run -   365568 kB/run -  761.74 GB/s

@bssrdf bssrdf requested a review from ggerganov as a code owner December 4, 2025 13:52
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant