Perplexity vs Size Graphs for the recent quants (Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.) #715

magikRUKKOLA · 2025-08-21T14:17:19Z

magikRUKKOLA
Aug 21, 2025

GRAPHS:

DATA SOURCES:

{
  "title": "DeepSeek-V3.1-Terminus (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S", "bpw": 1.745, "ppl": 5.4829, "size": 134.45, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ1_S"},
    {"name": "IQ1_KT", "bpw": 1.987, "ppl": 4.5310, "size": 154.61, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ1_KT"},
    {"name": "IQ2_KS", "bpw": 2.472, "ppl": 4.0280, "size": 190.56, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ2_KS"},
    {"name": "IQ2_KL", "bpw": 2.962, "ppl": 3.7112, "size": 228.54, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ2_KL"},
    {"name": "IQ3_KS", "bpw": 3.545, "ppl": 3.5174, "size": 276.89, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ3_KS"},
    {"name": "IQ3_K", "bpw": 3.724, "ppl": 3.4781, "size": 290.56, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ3_K"},
    {"name": "smol-IQ4_KSS", "bpw": 4.080, "ppl": 3.4445, "size": 317.96, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/smol-IQ4_KSS"},
    {"name": "IQ4_K", "bpw": 4.896, "ppl": 3.4198, "size": 380.57, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ4_K"},
    {"name": "smol-IQ5_KS", "bpw": 5.339, "ppl": 3.4059, "size": 416.27, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/smol-IQ5_KS"},
    {"name": "THIREUS-5.4498bpw-R4", "bpw": 5.4498, "ppl": 3.3961, "size": 426.07, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14579570"},
    {"name": "IQ5_K", "bpw": 5.941, "ppl": 3.4000, "size": 462.87, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/IQ5_K"},
    {"name": "THIREUS-6.2212bpw", "bpw": 6.2212, "ppl": 3.3949, "size": 485.07, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14554951"},
    {"name": "Q8_0", "bpw": 8.504, "ppl": 3.3929, "size": 660.30, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-Terminus-GGUF/tree/main/Q8_0"}
  ]
}
{
  "title": "DeepSeek-R1-0528 (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S_R4", "bpw": 1.664, "ppl": 4.8831, "size": 129.53, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S_R4"},
    {"name": "THIREUS-1.9364", "bpw": 1.9364, "ppl": 4.3533, "size": 150.75, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-1.9364bpw-4.3533ppl.151GB-GGUF_14GB-GPU_203GB-CPU.3c88ec6_9fd615d.recipe"},
    {"name": "IQ2_KT", "bpw": 2.514, "ppl": 3.6378, "size": 197.70, "url": null},
    {"name": "THIREUS-2.7840", "bpw": 2.7840, "ppl": 3.4341, "size": 216.55, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-2.7840bpw-3.4341ppl.217GB-GGUF_14GB-GPU_203GB-CPU.3c88ec6_02247be.recipe"},
    {"name": "IQ2_K_R4", "bpw": 2.799, "ppl": 3.5069, "size": 217.76, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ2_K_R4"},
    {"name": "JWNoctis/R1-0528/IQ2_KL", "bpw": 2.930, "ppl": 3.4379, "size": 227.64, "url": "https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/354"},
    {"name": "UD_Q2_K_XL", "bpw": 2.994, "ppl": 3.5278, "size": 232.02, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q2_K_XL"},
    {"name": "THIREUS-3.1027", "bpw": 3.1027, "ppl": 3.3372, "size": 240.96, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1027bpw-3.3372ppl.242GB-GGUF_11GB-GPU_231GB-CPU.3c88ec6_adc8101.recipe"},
    {"name": "THIREUS-3.1446", "bpw": 3.1446, "ppl": 3.3257, "size": 244.39, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1446bpw-3.3257ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_7d1efe1.recipe"},
    {"name": "THIREUS-3.1447", "bpw": 3.1447, "ppl": 3.3269, "size": 244.40, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1447bpw-3.3269ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_4b1254a.recipe"},
    {"name": "THIREUS-3.1525", "bpw": 3.1525, "ppl": 3.3251, "size": 245.07, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1525bpw-3.3251ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_5a3fc0f.recipe"},
    {"name": "THIREUS-3.1740", "bpw": 3.1740, "ppl": 3.3253, "size": 246.76, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1740bpw-3.3253ppl.248GB-GGUF_17GB-GPU_231GB-CPU.3c88ec6_6cf3a72.recipe"},
    {"name": "THIREUS-3.1858", "bpw": 3.1858, "ppl": 3.3261, "size": 247.60, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1858bpw-3.3261ppl.249GB-GGUF_18GB-GPU_231GB-CPU.3c88ec6_027b7ff.recipe"},
    {"name": "THIREUS-3.2564", "bpw": 3.2564, "ppl": 3.2985, "size": 253.18, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.2564bpw-3.2985ppl.254GB-GGUF_15GB-GPU_239GB-CPU.3c88ec6_7c0be1e.recipe"},
    {"name": "IQ3_KT", "bpw": 3.483, "ppl": 3.3056, "size": 267.63, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KT"},
    {"name": "THIREUS-3.5652", "bpw": 3.5652, "ppl": 3.2734, "size": 284.90, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.5652bpw-3.2734ppl.278GB-GGUF_14GB-GPU_264GB-CPU.3c88ec6_9b5660b.recipe"},
    {"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.2991, "size": 287.54, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS"},
    {"name": "THIREUS-3.6766", "bpw": 3.6766, "ppl": 3.2741, "size": 293.80, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781700"},
    {"name": "IQ3_K_R4", "bpw": 3.847, "ppl": 3.2730, "size": 306.52, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4"},
    {"name": "THIREUS-3.976", "bpw": 3.976, "ppl": 3.2452, "size": 315.18, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13798329"},
    {"name": "IQ4_XS (unsloth)", "bpw": 4.2683, "ppl": 3.2598, "size": 337.03, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/IQ4_XS"},
    {"name": "q4_0", "bpw": 4.508, "ppl": 3.2895, "size": 356.27, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q4_0"},
    {"name": "UD_Q4_K_XL", "bpw": 4.578, "ppl": 3.2483, "size": 361.92, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q4_K_XL"},
    {"name": "IQ4_KS_R4", "bpw": 4.701, "ppl": 3.2286, "size": 371.94, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ4_KS_R4"},
    {"name":"THIREUS-5.0601","bpw":5.0601,"ppl":3.2223,"size": 397.12, "url":"https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14625973"},
    {"name": "DQ4_K_R4", "bpw": 5.289, "ppl": 3.2276, "size": 415.04, "url": "https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4"},
    {"name": "THIREUS-6.2218", "bpw": 6.2218, "ppl": 3.2240, "size": 486.97, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781560"},
    {"name": "THIREUS-6.4296", "bpw": 6.4296, "ppl": 3.2231, "size": 503.65, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/718#discussioncomment-14193821"},
    {"name": "THIREUS-6.5522", "bpw": 6.5522, "ppl": 3.2227, "size": 512.60, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/718#discussioncomment-14193821"},
    {"name": "Q8_0", "bpw": 8.5259260, "ppl": 3.2130, "size": 664.33, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q8_0"}
  ]
}
{
  "title": "DeepSeek-V3.1 (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    { "name": "IQ1_S", "bpw": 1.710, "ppl": 5.3113, "size": 132.84, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ1_S" },
    { "name": "IQ1_KT", "bpw": 1.984, "ppl": 4.3987, "size": 153.62, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ1_KT" },
    { "name": "IQ2_KS", "bpw": 2.472, "ppl": 3.9583, "size": 190.56, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ2_KS" },
    { "name": "IQ2_KT", "bpw": 2.619, "ppl": 3.8109, "size": 201.47, "url": "" },
    { "name": "IQ2_KL", "bpw": 2.960, "ppl": 3.6312, "size": 225.58, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ2_KL" },
    { "name": "IQ3_KS", "bpw": 3.551, "ppl": 3.4534, "size": 273.06, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ3_KS" },
    { "name": "IQ3_K", "bpw": 3.753, "ppl": 3.4260, "size": 287.09, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ3_K" },
    { "name": "smol-IQ4_KSS", "bpw": 4.080, "ppl": 3.3898, "size": 317.96, "url": "" },
    { "name": "IQ4_KSS", "bpw": 4.162, "ppl": 3.3887, "size": 325.03, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ4_KSS" },
    { "name": "Q4_0", "bpw": 4.507, "ppl": 3.4277, "size": 355.86, "url": "" },
    { "name": "UD-Q4_K_XL", "bpw": 4.507, "ppl": 3.4013, "size": 355.86, "url": "https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF/tree/main/UD-Q4_K_XL" },
    { "name": "IQ4_KS", "bpw": 4.649, "ppl": 3.3806, "size": 367.52, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ4_KS" },
    { "name": "IQ4_K", "bpw": 4.925, "ppl": 3.3715, "size": 389.94, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ4_K" },
    { "name": "IQ5_K", "bpw": 5.944, "ppl": 3.3550, "size": 462.98, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/IQ5_K" },
    { "name": "Q8_0", "bpw": 8.504, "ppl": 3.3473, "size": 660.30, "url": "https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/tree/main/Q8_0" },
    { "name": "BF16", "bpw": 16.003, "ppl": 3.3469, "size": 1257.33, "url": "" }
  ]
}
{
  "title": "DeepSeek-TNG-R1T2-Chimera (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S", "bpw": 1.699, "ppl": 4.9878, "size": 132.25, "url": "https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/tree/main/IQ1_S"},
    {"name": "THIREUS-1.6693", "bpw": 1.6693, "ppl": 4.9676, "size": 130.58, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13883488"},
    {"name": "THIREUS-1.7067", "bpw": 1.7067, "ppl": 4.9199, "size": 132.75, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13914222"},
    {"name": "THIREUS-2.0622", "bpw": 2.0622, "ppl": 4.0622, "size": 159.84, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13914222"},
    {"name": "IQ2_XSS", "bpw": 2.168, "ppl": 4.0078, "size": 168.55, "url": "https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/tree/main/IQ2_XSS"},
    {"name": "IQ2_KT", "bpw": 2.188, "ppl": 3.8887, "size": 170.29, "url": "https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/tree/main/IQ2_KT"},
    {"name": "THIREUS-2.5961", "bpw": 2.5961, "ppl": 3.6768, "size": 204.39, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13883488"},
    {"name": "IQ2_KS", "bpw": 2.602, "ppl": 3.6254, "size": 204.91, "url": "https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/tree/main/IQ2_KS"},
    {"name": "THIREUS-2.6261", "bpw": 2.6261, "ppl": 3.5627, "size": 207.12, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13914222"},
    {"name": "THIREUS-3.5753", "bpw": 3.5753, "ppl": 3.3187, "size": 280.66, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13883488"},
    {"name": "THIREUS-3.5858", "bpw": 3.5858, "ppl": 3.3063, "size": 281.55, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13914222"},
    {"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.3167, "size": 282.60, "url": "https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/tree/main/IQ3_KS"}
  ]
}
{
  "title": "Kimi-K2-Instruct-0905 (1026B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 1026000000000,
  "data": [
    {"name": "smol-IQ1_KT", "bpw": 1.832, "ppl": 4.2224, "size": 227.95, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ1_KT"},
    {"name": "smol-IQ2_KS", "bpw": 2.261, "ppl": 3.4977, "size": 281.47, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ2_KS"},
    {"name": "IQ2_KS", "bpw": 2.425, "ppl": 3.2478, "size": 303.61, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/IQ2_KS"},
    {"name": "smol-IQ2_KL", "bpw": 2.755, "ppl": 2.9294, "size": 342.81, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ2_KL"},
    {"name": "IQ2_KL", "bpw": 3.000, "ppl": 2.7993, "size": 371.48, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/IQ2_KL"},
    {"name": "smol-IQ3_KS", "bpw": 3.249, "ppl": 2.5902, "size": 401.87, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ3_KS"},
    {"name": "IQ3_KS", "bpw": 3.520, "ppl": 2.5640, "size": 431.87, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/IQ3_KS"},
    {"name": "UD-Q3_K_XL", "bpw": 3.521, "ppl": 2.6706, "size": 432.02, "url": "https://huggingface.co/unsloth/Kimi-K2-Instruct-0905-GGUF/tree/main/UD-Q3_K_XL"},
    {"name": "THIREUS-4.0285", "bpw": 4.034, "ppl": 2.493, "size": 494.61, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14485602"},
    {"name": "smol-IQ4_KSS", "bpw": 4.059, "ppl": 2.5185, "size": 498.63, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ4_KSS"},
    {"name": "IQ4_KS", "bpw": 4.633, "ppl": 2.4641, "size": 567.88, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/IQ4_KS"},
    {"name": "smol-IQ5_KS", "bpw": 5.295, "ppl": 2.4526, "size": 651.80, "url": "https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/tree/main/smol-IQ5_KS"}
  ]
}
{
  "title": "GLM-4.6 Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 357000000000,
  "data": [
    {"name": "smol-IQ4_KSS", "bpw": 4.090, "ppl": 3.5911, "size": 169.82, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/smol-IQ4_KSS"},
    {"name": "smol-IQ1_KT", "bpw": 1.948, "ppl": 5.9034, "size": 82.31, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/smol-IQ1_KT"},
    {"name": "smol-IQ2_KS", "bpw": 2.359, "ppl": 5.2760, "size": 98.97, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/smol-IQ2_KS"},
    {"name": "IQ2_KL", "bpw": 3.070, "ppl": 4.1456, "size": 129.13, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KL"},
    {"name": "IQ3_KS", "bpw": 3.573, "ppl": 3.6427, "size": 150.98, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ3_KS"},
    {"name": "IQ4_KS", "bpw": 4.646, "ppl": 3.5309, "size": 196.27, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ4_KS"},
    {"name": "IQ4_K", "bpw": 5.001, "ppl": 3.4758, "size": 210.95, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ4_K"},
    {"name": "THIREUS-5.5774bpw", "bpw": 5.5774, "ppl": 3.4486, "size": 234.10, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14572398"},
    {"name": "UD-Q5_K_XL(unsloth)", "bpw": 5.6471, "ppl": 3.4807, "size": 238.97, "url": "https://huggingface.co/unsloth/GLM-4.6-GGUF/tree/main/UD-Q5_K_XL"},
    {"name": "IQ5_K", "bpw": 5.997, "ppl": 3.4428, "size": 254.10, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ5_K"},
    {"name": "Q8_0", "bpw": 8.505, "ppl": 3.4471, "size": 359.26, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/Q8_0"},
    {"name": "BF16", "bpw": 16.003, "ppl": 3.4454, "size": 672.09, "url": "https://huggingface.co/ubergarm/GLM-4.6-GGUF"}
  ]
}
{
  "title": "Qwen3-Coder-480B-A35B-Instruct Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 480000000000,
  "data": [
    {"name": "IQ1_KT", "bpw": 1.945, "ppl": 6.3370, "size": 108.26, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ1_KT"},
    {"name": "IQ2_KS", "bpw": 2.578, "ppl": 5.6658, "size": 142.90, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_KS"},
    {"name": "IQ2_K", "bpw": 2.588, "ppl": 5.6578, "size": 143.49, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_K"},
    {"name": "IQ2_KL", "bpw": 3.034, "ppl": 5.4113, "size": 169.76, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_KL"},
    {"name": "IQ3_K", "bpw": 3.865, "ppl": 5.1808, "size": 214.77, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ3_K"},
    {"name": "IQ4_KSS", "bpw": 4.180, "ppl": 5.1579, "size": 236.73, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ4_KSS"},
    {"name": "IQ4_K", "bpw": 4.885, "ppl": 5.1257, "size": 276.05, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ4_K"},
    {"name": "THIREUS-5.1546bpw", "bpw": 5.1546, "ppl": 5.1057, "size": 289.53, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/715#discussioncomment-14670424"},
    {"name": "IQ5_K", "bpw": 5.900, "ppl": 5.1073, "size": 334.84, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ5_K"},
    {"name": "Q8_0", "bpw": 8.503, "ppl": 5.0975, "size": 480.12, "url": "https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/Q8_0"}
  ]
}

CODE: #477 (comment)

ikawrakow · 2025-08-21T14:34:13Z

ikawrakow
Aug 21, 2025
Maintainer

@magikRUKKOLA Thank you for these graphs, very useful!

Can one do something to improve discoverability? I personally find it a bit hard to find which point corresponds to which quantization.

2 replies

magikRUKKOLA Aug 21, 2025
Author

@magikRUKKOLA Thank you for these graphs, very useful!

Can one do something to improve discoverability? I personally find it a bit hard to find which point corresponds to which quantization.

SVG suppose to support xlink namespace for the hyperlinks. But I am not sure. I have to check.

magikRUKKOLA Aug 24, 2025
Author

@ikawrakow

As related to the discoverability. I drew some opaque lines (dashed lines for non-pareto quants) so it should be easier to trace. Not sure why the hovering effect doesn't work (in github) though.

ubergarm · 2025-08-24T18:03:16Z

ubergarm
Aug 24, 2025

Thanks @magikRUKKOLA for putting these together. Always interesting to see which quantization types are performing well on some of these big models.

I just added a few more data points to my DeepSeek-V3.1 collection. The IQ4_KSS is doing unreasonably well again right around 4.0BPW. I went back and re-read this earlier discussion on QAT and IQ4_KS here: #359 (comment) and speculating wildly if it could have anything to do with ~4.0BPW being a "sweet spot" in the size vs perplexity trade-off curve.

👈 json data

[
  {
    "name": "BF16",
    "ppl": "3.3469 +/- 0.01936",
    "size": 1250.084,
    "bpw": 16.003,
    "legend": "pure"
  },
  {
    "name": "Q8_0",
    "ppl": "3.3473 +/- 0.01935",
    "size": 664.295,
    "bpw": 8.504,
    "legend": "pure",
    "skip": true
  },
  {
    "name": "IQ5_K",
    "ppl": "3.3550 +/- 0.01942",
    "size": 465.075,
    "bpw": 5.944,
    "legend": "ubergarm"
  },
  {
    "name": "IQ4_K",
    "ppl": "3.3715 +/- 0.01956",
    "size": 384.765,
    "bpw": 4.925,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "IQ4_KS",
    "ppl": "3.3806 +/- 0.01966",
    "size": 363.151,
    "bpw": 4.649,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "Q4_0",
    "ppl": "3.4277 +/- 0.02000",
    "size": 352.096,
    "bpw": 4.507,
    "legend": "pure",
    "comment": "q4_K embd, q6_K head"
  },
  {
    "name": "IQ4_KSS",
    "ppl": "3.3887 +/- 0.01968",
    "size": 325.088,
    "bpw": 4.162,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "smol-IQ4_KSS",
    "ppl": "3.3898 +/- 0.01964",
    "size": 318.745,
    "bpw": 4.080,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "IQ3_K",
    "ppl": "3.4260 +/- 0.01995",
    "size": 293.177,
    "bpw": 3.753,
    "legend": "ubergarm",
    "comment": "PR624 ik/quantization_tweaks"
  },
  {
    "name": "IQ3_KS",
    "ppl": "3.4534 +/- 0.02019",
    "size": 277.397,
    "bpw": 3.551,
    "legend": "ubergarm",
    "comment": "PR624 ik/quantization_tweaks"
  },
  {
    "name": "IQ2_KL",
    "ppl": "3.6312 +/- 0.02161",
    "size": 231.206,
    "bpw": 2.960,
    "legend": "ubergarm",
    "comment": "PR624 ik/quantization_tweaks"
  },
  {
    "name": "IQ2_KT",
    "ppl": "3.8109 +/- 0.02294",
    "size": 204.592,
    "bpw": 2.619,
    "legend": "ubergarm",
    "comment": "PR624 ik/quantization_tweaks + PR to fix KT quantization"
  },
  {
    "name": "IQ2_KS",
    "ppl": "3.9583 +/- 0.02433",
    "size": 193.144,
    "bpw": 2.472,
    "legend": "ubergarm",
    "comment": "PR624 ik/quantization_tweaks"
  },
  {
    "name": "IQ1_KT",
    "ppl": "4.3987 +/- 0.02786",
    "size": 154.968,
    "bpw": 1.984,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "IQ1_S",
    "ppl": "5.3113 +/- 0.03507",
    "size": 133.610,
    "bpw": 1.710,
    "legend": "ubergarm",
    "comment": ""
  }
]

0 replies

oovloveme · 2025-08-29T01:04:16Z

oovloveme
Aug 29, 2025

Add my test result :
DeepSeek-V3.1-UD-Q4_K_XL : PPL = 3.4013 +/- 0.01980

Kimi-K2-Instruct-UD-Q3_K_XL : PPL = 3.2330 +/- 0.01668
Kimi-K2-Instruct-IQ3_KS : PPL = 3.0275 +/- 0.01520
Kimi-K2-Instruct-IQ4_XS(Unsloth) : PPL = 2.9812 +/- 0.01485

3 replies

magikRUKKOLA Aug 30, 2025
Author

Add my test result : DeepSeek-V3.1-UD-Q4_K_XL : PPL = 3.4013 +/- 0.01980

Kimi-K2-Instruct-UD-Q3_K_XL : PPL = 3.2330 +/- 0.01668 Kimi-K2-Instruct-IQ3_KS : PPL = 3.0275 +/- 0.01520 Kimi-K2-Instruct-IQ4_XS(Unsloth) : PPL = 2.9812 +/- 0.01485

DeepSeek-V3.1-UD-Q4_K_XL and Kimi-K2-Instruct-IQ4_XS(Unsloth) are added. The rest already were in the graphs though with slightly different PPLs. So the main takeaway is that we should also somehow specify the version of the ik_llama.cpp -- it seems that the way different versions can have slightly different ppls due to the changes in the code, that is, the way the quants are getting processed. see here: #624 (comment)

Also, for the future submissions plz add the BPW parameter if you can. Its written in the logs of the llama-perplexity. I calculated the BPW like this:

echo "scale=8;(452*1000^3*8)/1026000000000"|bc
3.52436647

where 452 is taken from the huggingface.com info page. Hopefully the calculations are correct.

saood06 Aug 30, 2025
Collaborator

Kimi-K2-Instruct-IQ4_XS(Unsloth) are added.

You label it as IQ4_KS not IQ4_XS on the graph.

magikRUKKOLA Aug 30, 2025
Author

@saood06

You label it as IQ4_KS not IQ4_XS on the graph.

Good catch! :D

magikRUKKOLA · 2025-09-23T09:40:18Z

magikRUKKOLA
Sep 23, 2025
Author

@Thireus

I tried to calculate the Kimi-K2-Instruct-0905-THIREUS-IQ3_K-SPECIAL_SPLIT and got very bad results.

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0
 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                                                 
perplexity: tokenizing the input ..                                                                                                                                              
perplexity: tokenization took 526.628 ms                                                                                                                                         
perplexity: calculating perplexity over 568 chunks, n_ctx=512, batch_size=8192, n_seq=16                                                                                         
perplexity: 60.91 seconds per pass - ETA 36.03 minutes                                                                                                                           
[1]2.3068,[2]3.1089,[3]2.4672,[4]2.1591,[5]1.9721,[6]1.8616,[7]1.8202,[8]1.7345,[9]1.7025,[10]1.6619,[11]1.6390,[12]1.6457,[13]1.6348,[14]1.6720,[15]1.7652,[16]1.8718,[17]1.9859,[18]2.1667,[19]2.1733,[20]2.
2048,[21]2.2572,[22]2.2590,[23]2.2362,[24]2.2105,[25]2.1984,[26]2.1843,[27]2.1744,[28]2.1829,[29]2.1701,[30]2.2017,[31]2.2438,[32]2.2629,[33]2.2880,[34]2.3160,[35]2.3683,[36]2.3902,[37]2.4119,[38]2.4664,[39
]2.5019,[40]2.5358,[41]2.5895,[42]2.6154,[43]2.6315,[44]2.6476,[45]2.7203,[46]2.7726,[47]2.7424,[48]2.7009,[49]2.6726,[50]2.6733,[51]2.7000,[52]2.7265,[53]2.7676,[54]2.7754,[55]2.7862,[56]2.8005,[57]2.7898,
[58]2.7912,[59]2.8104,[60]2.8433,[61]2.8745,[62]2.9012,[63]2.9306,[64]2.9514,[65]2.9715,[66]2.9618,[67]2.9505,[68]2.9331,[69]2.9371,[70]2.9336,[71]2.9101,[72]2.8962,[73]2.8847,[74]2.9014,[75]2.9051,[76]2.88
81,[77]2.8647,[78]2.8355,[79]2.8250,[80]2.8070,[81]2.7902,[82]2.7736,[83]2.7998,[84]2.7935,[85]2.7681,[86]2.7537,[87]2.7432,[88]2.7318,[89]2.7144,[90]2.7130,[91]2.6967,[92]2.6802,[93]2.6705,[94]2.6543,[95]2
.6389,[96]2.6228,[97]2.6226,[98]2.6211,[99]2.6106,[100]2.5998,[101]2.5969,[102]2.6002,[103]2.6088,[104]2.6271,[105]2.6571,[106]2.6620,[107]2.6929,[108]2.7209,[109]2.7404,[110]2.7739,[111]2.8097,[112]2.8287,
[113]2.8178,[114]2.8081,[115]2.7983,[116]2.7860,[117]2.7788,[118]2.7718,[119]2.7609,[120]2.7553,[121]2.7443,[122]2.7319,[123]2.7233,[124]2.7156,[125]2.7049,[126]2.6897,[127]2.6816,[128]2.6804,[129]2.6799,[1
30]2.6730,[131]2.6663,[132]2.6668,[133]2.6563,[134]2.6588,[135]2.6750,[136]2.6686,[137]2.6633,[138]2.6608,[139]2.6587,[140]2.6677,[141]2.6649,[142]2.6635,[143]2.6655,[144]2.6643,[145]2.6665,[146]2.6671,[147
]2.6544,[148]2.6495,[149]2.6450,[150]2.6440,[151]2.6429,[152]2.6392,[153]2.6370,[154]2.6337,[155]2.6297,[156]2.6286,[157]2.6320,[158]2.6287,[159]2.6314,[160]2.6290,[161]2.6252,[162]2.6302,[163]2.6286,[164]2
.6421,[165]2.6432,[166]2.6537,[167]2.6629,[168]2.6782,[169]2.6929,[170]2.7112,[171]2.7307,[172]2.7532,[173]2.7688,[174]2.7592,[175]2.7485,[176]2.7419,[177]2.7391,[178]2.7356,[179]2.7284,[180]2.7191,[181]2.7
139,[182]2.7170,[183]2.7319,[184]2.7468,[185]2.7616,[186]2.7761,[187]2.7865,[188]2.8020,[189]2.8181,[190]2.8327,[191]2.8426,[192]2.8456,[193]2.8535,[194]2.8573,[195]2.8559,[196]2.8632,[197]2.8669,[198]2.879
3,[199]2.8904,[200]2.8934,[201]2.9011,[202]2.8991,[203]2.9137,[204]2.9129,[205]2.9201,[206]2.9219,[207]2.9251,[208]2.9290,[209]2.9366,[210]2.9434,[211]2.9503,[212]2.9531,[213]2.9534,[214]2.9563,[215]2.9570,
[216]2.9613,[217]2.9718,[218]2.9702,[219]2.9692,[220]2.9674,[221]2.9696,[222]2.9715,[223]2.9741,[224]2.9755,[225]2.9758,[226]2.9817,[227]2.9868,[228]2.9760,[229]2.9767,[230]2.9736,[231]2.9717,[232]2.9788,[2
33]2.9869,[234]2.9921,[235]2.9835,[236]2.9785,[237]2.9784,[238]2.9821,[239]2.9862,[240]2.9908,[241]2.9976,[242]3.0043,[243]3.0117,[244]3.0190,[245]3.0307,[246]3.0387,[247]3.0412,[248]3.0468,[249]3.0523,[250
]3.0526,[251]3.0455,[252]3.0352,[253]3.0258,[254]3.0206,[255]3.0173,[256]3.0156,[257]3.0152,[258]3.0150,[259]3.0134,[260]3.0088,[261]3.0061,[262]3.0022,[263]2.9989,[264]2.9959,[265]2.9924,[266]2.9904,[267]2
.9890,[268]2.9847,[269]2.9826,[270]2.9760,[271]2.9717,[272]2.9688,[273]2.9655,[274]2.9662,[275]2.9600,[276]2.9590,[277]2.9552,[278]2.9541,[279]2.9494,[280]2.9496,[281]2.9547,[282]2.9597,[283]2.9667,[284]2.9
746,[285]2.9816,[286]2.9871,[287]2.9990,[288]3.0057,[289]3.0128,[290]3.0125,[291]3.0148,[292]3.0172,[293]3.0216,[294]3.0134,[295]3.0138,[296]3.0198,[297]3.0216,[298]3.0257,[299]3.0302,[300]3.0321,[301]3.037
2,[302]3.0432,[303]3.0423,[304]3.0408,[305]3.0431,[306]3.0428,[307]3.0460,[308]3.0450,[309]3.0460,[310]3.0471,[311]3.0474,[312]3.0418,[313]3.0376,[314]3.0354,[315]3.0330,[316]3.0300,[317]3.0262,[318]3.0226,
[319]3.0189,[320]3.0159,[321]3.0102,[322]3.0046,[323]3.0007,[324]2.9958,[325]2.9940,[326]2.9881,[327]2.9851,[328]2.9815,[329]2.9803,[330]2.9749,[331]2.9781,[332]2.9727,[333]2.9743,[334]2.9746,[335]2.9776,[3
36]2.9808,[337]2.9819,[338]2.9821,[339]2.9824,[340]2.9826,[341]2.9827,[342]2.9889,[343]2.9903,[344]2.9892,[345]2.9965,[346]3.0022,[347]3.0055,[348]3.0020,[349]2.9987,[350]2.9954,[351]2.9936,[352]2.9872,[353
]2.9821,[354]2.9776,[355]2.9727,[356]2.9679,[357]2.9651,[358]2.9616,[359]2.9585,[360]2.9568,[361]2.9536,[362]2.9494,[363]2.9460,[364]2.9417,[365]2.9373,[366]2.9323,[367]2.9272,[368]2.9228,[369]2.9188,[370]2
.9146,[371]2.9104,[372]2.9065,[373]2.9016,[374]2.8986,[375]2.8969,[376]2.8935,[377]2.8904,[378]2.8880,[379]2.8852,[380]2.8827,[381]2.8820,[382]2.8793,[383]2.8761,[384]2.8755,[385]2.8782,[386]2.8830,[387]2.8
879,[388]2.8936,[389]2.8964,[390]2.9009,[391]2.9065,[392]2.9081,[393]2.9020,[394]2.8990,[395]2.8942,[396]2.8909,[397]2.8872,[398]2.8831,[399]2.8778,[400]2.8722,[401]2.8664,[402]2.8608,[403]2.8561,[404]2.852
7,[405]2.8481,[406]2.8431,[407]2.8369,[408]2.8317,[409]2.8267,[410]2.8222,[411]2.8169,[412]2.8129,[413]2.8090,[414]2.8048,[415]2.8045,[416]2.8026,[417]2.8020,[418]2.7998,[419]2.7962,[420]2.7919,[421]2.7872,
[422]2.7869,[423]2.7830,[424]2.7819,[425]2.7789,[426]2.7765,[427]2.7729,[428]2.7687,[429]2.7674,[430]2.7633,[431]2.7588,[432]2.7545,[433]2.7504,[434]2.7484,[435]2.7461,[436]2.7419,[437]2.7375,[438]2.7335,[4
39]2.7305,[440]2.7265,[441]2.7229,[442]2.7206,[443]2.7177,[444]2.7173,[445]2.7207,[446]2.7259,[447]2.7314,[448]2.7295,[449]2.7276,[450]2.7289,[451]2.7325,[452]2.7365,[453]2.7382,[454]2.7416,[455]2.7438,[456
]2.7495,[457]2.7513,[458]2.7535,[459]2.7576,[460]2.7588,[461]2.7628,[462]2.7652,[463]2.7723,[464]2.7774,[465]2.7802,[466]2.7804,[467]2.7806,[468]2.7798,[469]2.7839,[470]2.7818,[471]2.7803,[472]2.7822,[473]2
.7830,[474]2.7824,[475]2.7842,[476]2.7843,[477]2.7841,[478]2.7859,[479]2.7854,[480]2.7863,[481]2.7857,[482]2.7848,[483]2.7841,[484]2.7833,[485]2.7831,[486]2.7854,[487]2.7837,[488]2.7864,[489]2.7856,[490]2.7
940,[491]2.7982,[492]2.8032,[493]2.8034,[494]2.8065,[495]2.8106,[496]2.8132,[497]2.8161,[498]2.8204,[499]2.8210,[500]2.8214,[501]2.8230,[502]2.8252,[503]2.8275,[504]2.8281,[505]2.8323,[506]2.8357,[507]2.842
5,[508]2.8428,[509]2.8443,[510]2.8459,[511]2.8504,[512]2.8563,[513]2.8607,[514]2.8627,[515]2.8597,[516]2.8575,[517]2.8556,[518]2.8523,[519]2.8491,[520]2.8481,[521]2.8475,[522]2.8449,[523]2.8438,[524]2.8434,
[525]2.8421,[526]2.8396,[527]2.8390,[528]2.8385,[529]2.8397,[530]2.8378,[531]2.8373,[532]2.8366,[533]2.8353,[534]2.8350,[535]2.8337,[536]2.8330,[537]2.8325,[538]2.8283,[539]2.8255,[540]2.8232,[541]2.8227,[5
42]2.8235,[543]2.8247,[544]2.8252,[545]2.8238,[546]2.8234,[547]2.8227,[548]2.8260,[549]2.8262,[550]2.8264,[551]2.8280,[552]2.8240,[553]2.8208,[554]2.8164,[555]2.8131,[556]2.8099,[557]2.8063,[558]2.8024,[559
]2.7996,[560]2.7980,[561]2.7955,[562]2.7925,[563]2.7907,[564]2.7890,[565]2.7872,[566]2.7881,[567]2.7874,[568]2.7851,                                                                                          
Final estimate: PPL = 2.7851 +/- 0.01394                                                               

llama_print_timings:        load time =    6745.76 ms                                                  
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                                                                                      
llama_print_timings: prompt eval time = 2136780.65 ms / 290816 tokens (    7.35 ms per token,   136.10 tokens per second)                                                                                     
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                                                                                      
llama_print_timings:       total time = 2142688.10 ms / 290817 tokens

PPL 2.7851 with 3.4325bpw? Seems like something is very wrong. I made sure all the split files are valild. Anyways, honestly, its unlikely I will be using Kimi-K2 personally. The DeepSeek-V3.1-Terminus just got released. :)

28 replies

Thireus Oct 2, 2025

CPU+GPU (pp 127.3 t/s + tg 9.9 t/s) - https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/discussions/1#68beb4a4441f681751abc87c
GPU only (pp 751.6 t/s + tg 30.4 t/s) - https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/discussions/1#68cd97a9217b07616817c6a9

Hardware: Intel i9-7980XE + 256GB DDR4 + 3x RTX 6000 Pro + 1x 5090

_r4 are only available for some quants, for example iq6_k won't benefit from it, see list here: https://github.com/Thireus/GGUF-Tool-Suite/?tab=readme-ov-file#%EF%B8%8F-generate-a-custom-recipe-for-your-config
So if you only have a handful of CPU tensors using r4 quants I suppose the perf gains will be minimal. For the recipe you mention, I only see "blk..*.ffn(gate|up)_exps.weight=iq5_k" CPU tensor that would use iq5_k_r4 when setting -rtr.

If no perf gains observed for pp or tg speed, then don't use -rtr.

magikRUKKOLA Oct 2, 2025
Author

@Thireus

Should I checkout the following then?

## Summary of tensor sizes per class
# GPU Total: 14.090 GiB (94.9%) | 14.85 GiB max, if all were q8_0 | 12.90 GiB min, if all were iq5_k_r4
# CPU Total: 406.547 GiB (97.1%) | 418.69 GiB max, if all were iq5_k_r4 | 323.53 GiB min, if all were iq4_ks_r4
# GPU+CPU Total: 420.637 GiB (96.0%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# +f32          361     32.0      0.40 GiB      -               -
# +q8_0         61      8.5       0.51 GiB      -               -
# q8_0          39      8.5       2.60 GiB      47.0%           5.54
# +iq6_k        305     6.625     8.41 GiB      -               -
# iq6_k         102     6.625     1.61 GiB      37.2%           4.32
# iq5_k_r4      44      5.5       0.56 GiB      15.7%           3.58
#
# CPU-friendly quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# iq5_k_r4      87      5.5     209.34 GiB      50.0%           418.69
# iq5_ks_r4     81      5.25    186.05 GiB      46.6%           399.66
# iq4_ks_r4     6       4.25     11.16 GiB      3.4%            323.53
#
# -Average BPW: 5.3847
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-10-01 22:26:05 UTC+0000 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: f385e17ea9998140203cc543e8e9d3635f6f1292999c58d837cc6d2d3d48b1e0
# - Calibration dataset 'ppl_results.csv' SHA-256: 6e86f375f3cfc3724cc4eb748166dc1e6e4ea2e9496efdd0e1a1fbd090557d3a
# - tensors.bf16.map SHA-256: d8bc873f82ff1b42cdec2726d649dbed5b4689f1ab599c51bbb85acdc19c743c
# - tensors.bf16.map model name: DeepSeek-V3.1-Terminus-THIREUS-BF16-SPECIAL_TENSOR-01087-of-01087
# - tensors.iq5_k_r4.map SHA-256: fb8a17d26cdad448d3b7d869bc80e541e1ea1b374d2f8c90da0ac53492b46ca9
# - tensors.iq5_k_r4.map model name: DeepSeek-V3.1-Terminus-THIREUS-IQ5_K_R4-SPECIAL_TENSOR-01087-of-01087
# - tensors.iq5_ks_r4.map SHA-256: 2c82c1a399eca049cbb55036a9c6d952d9b0bf33d527cc7523cd95cfeefabc47
# - tensors.iq5_ks_r4.map model name: DeepSeek-V3.1-Terminus-THIREUS-IQ5_KS_R4-SPECIAL_TENSOR-01087-of-01087
# - tensors.iq4_ks_r4.map SHA-256: 3b051631543d04f711aa61714cd6cabc632b1fb34ca4972dffa56e3569466d70
# - tensors.iq4_ks_r4.map model name: DeepSeek-V3.1-Terminus-THIREUS-IQ4_KS_R4-SPECIAL_TENSOR-01087-of-01087
# - tensors.q8_0.map SHA-256: 4ee954a837dcf1df29eaacbf412f2707503398839b4651b63ef7468c3f651dc5
# - tensors.q8_0.map model name: DeepSeek-V3.1-Terminus-THIREUS-Q8_0-SPECIAL_TENSOR-01087-of-01087
# - tensors.iq6_k.map SHA-256: 08c72c97c9fff653242749ff07e4f7d2df5749bda18e92b6dcc570df7ba455de
# - tensors.iq6_k.map model name: DeepSeek-V3.1-Terminus-THIREUS-IQ6_K-SPECIAL_TENSOR-01087-of-01087
# - tensors.iq1_m_r4.map SHA-256: 6808abd977c6e077ad0368c83800bad009aea9adaebdaba0b47c023c41919ed0
# - tensors.iq1_m_r4.map model name: DeepSeek-V3.1-Terminus-THIREUS-IQ1_M_R4-SPECIAL_TENSOR-01087-of-01087
# - GPG signatures: DISABLED
# - Command used:
# quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-irq-k 1.5 --gpu-irq-k 1.5 --gpu-assign-qtype iq6_k \
# --cpu-tensors-max-size 410 --gpu-tensors-max-size 95% --exponential-factor 8 --cpu-tensors \
# '^blk\.([3-9]|[1-5][0-9]|60)\.ffn_down_exps\.weight$' '^blk\.([3-9]|[1-5][0-9]|60)\.ffn_up_exps\.weight$' \
# '^blk\.([3-9]|[1-5][0-9]|60)\.ffn_gate_exps\.weight$' --gpu-tensors '.*' --cpu-quants iq5_k_r4 iq5_ks_r4 iq4_ks_r4 \
# --gpu-quants q8_0 iq5_k_r4 iq6_k --gpu-assign-tensors 'blk\..*\.attn_k_b\.weight=q8_0' --harmonize-tensors \
# '^blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

Thireus Oct 2, 2025

So, what I found often works best (as in giving the best PPL) is to only select one quant type per bit. Here you've selected iq5_k and iq5_ks which are very similar in terms of BPW, so the recipe produced is very similar to a pure iq5_ks, which defeats the "dynamic quant" purpose.

Also, you usually want to avoid creating recipes where the distribution is too heavy on the highest quant type, here I see iq4 only gets 3% which is very small compared to the iq5 ones.

For a better distribution I would say try the following, for --cpu-tensors-max-size 410:

--cpu-quants iq6_k iq5_ks_r4 iq4_ks_r4
--cpu-quants q8_0 iq6_k iq5_ks_r4 iq4_ks_r4
--cpu-quants q8_0 iq6_k iq5_ks_r4
--cpu-quants iq6_k iq5_ks_r4

Adding iq6_k and q8_0 should allow for a better "bell" looking like distribution.

You can also refer to https://github.com/Thireus/GGUF-Tool-Suite/blob/main/quants_graphs/Best_AMD_Quants.png, it's possible that iq5_ks_r4 is in fact slower than iq5_ks for your system as it was in my benchmark with the 5950X CPU. Same for iq4_ks_r4... If which case you should drop the _r4.

magikRUKKOLA Oct 3, 2025
Author

@Thireus

Indeed, having a better prefill (~ +10%) with R4 quants but slightly worse decode (~1-2%).

THIREUS/DeepSeek-V3.1-Terminus-5.4498bpw:

--cpu-quants iq6_k iq5_ks_r4 iq4_ks_r4

## Summary of tensor sizes per class
# GPU Total: 14.090 GiB (94.9%) | 14.85 GiB max, if all were q8_0 | 12.90 GiB min, if all were iq5_k_r4
# CPU Total: 411.633 GiB (81.6%) | 504.33 GiB max, if all were iq6_k | 323.53 GiB min, if all were iq4_ks_r4
# GPU+CPU Total: 425.723 GiB (88.2%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# +f32          361     32.0      0.40 GiB      -               -
# +q8_0         61      8.5       0.51 GiB      -               -
# q8_0          39      8.5       2.60 GiB      47.0%           5.54
# +iq6_k        305     6.625     8.41 GiB      -               -
# iq6_k         102     6.625     1.61 GiB      37.2%           4.32
# iq5_k_r4      44      5.5       0.56 GiB      15.7%           3.58
#
# CPU-friendly quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# iq6_k         49      6.625   142.02 GiB      28.2%           504.33
# iq5_ks_r4     85      5.25    195.23 GiB      48.9%           399.66
# iq4_ks_r4     40      4.25     74.38 GiB      23.0%           323.53
#
# -Average BPW: 5.4498

main: n_kv_max = 163840, n_batch = 8192, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   29.734 |   137.76 |  140.578 |     7.28 |
|  4096 |   1024 |   4096 |   30.217 |   135.55 |  143.079 |     7.16 |
|  4096 |   1024 |   8192 |   30.771 |   133.11 |  147.195 |     6.96 |

ubergarm/DeepSeek-V3.1-Terminus/IQ5_K (*n_kv_max=140k):

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   32.634 |   125.51 |  138.274 |     7.41 |
|  4096 |   1024 |   4096 |   33.103 |   123.74 |  141.111 |     7.26 |
|  4096 |   1024 |   8192 |   33.609 |   121.87 |  145.452 |     7.04 |

Let me see what's up with PPL.

[EDIT]:

Final estimate: PPL = 3.3961 +/- 0.01984

Ha!

[EDIT2]:

Okay cool! Thanks alot, Bro! A very cool framework you got indeed.

[EDIT3]:
I made a recalculation of the ubergarm IQ5_K quant of Deepseek V3.1-Terminus (because I am using bigger batches for a speed) and its:

Final estimate: PPL = 3.4028 +/- 0.01993

while the declared PPL is actually:

Final estimate: PPL = 3.4000 +/- 0.01992

So my calculations could be off by at least 0.0028 PPL.

[EDIT4]:

The performance with 8k batches and 128k ctx:

Oct 08 13:20:50 xxx run-ik_llama.cpp.sh[86360]: main: n_kv_max = 131072, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
Oct 08 13:20:50 xxx run-ik_llama.cpp.sh[86360]: |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
Oct 08 13:20:50 xxx run-ik_llama.cpp.sh[86360]: |-------|--------|--------|----------|----------|----------|----------|
Oct 08 13:26:31 xxx run-ik_llama.cpp.sh[86360]: |  8192 |   2048 |      0 |   35.478 |   230.90 |  266.567 |     7.68 |
Oct 08 13:31:51 xxx run-ik_llama.cpp.sh[86360]: |  8192 |   2048 |   8192 |   39.577 |   206.99 |  279.792 |     7.32 |
Oct 08 13:37:33 xxx run-ik_llama.cpp.sh[86360]: |  8192 |   2048 |  16384 |   48.008 |   170.64 |  294.366 |     6.96 |

System usage:

┌─NVIDIA GeForce RTX 3090─────────────────────────────────────────────────────────────────────────────┐
│ Temp: 76C  Fan: 100%  Power: 208W                                                                   │
│ GPU: 1995MHz (71%)  VRAM: 10251MHz (02%)  Mem: 23.1GB                                               │
│ Temp [74-84]                     Fan [99-100]                     Power [178-394]                   │
││       █                        │█████   ███████████████████████ │      █                           │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
││─────────────────────────────── │─────────────────────────────── │───────────────────────────────   │
│ GPU Clock [1680-2010]            VRAM Clock [10251-10251]         Util [0-100]                      │
││█████                           │                                │     ██████                       │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
└│─────────────────────────────── │─────────────────────────────── │─────────────────────────────── ──┘
┌─NVIDIA GeForce RTX 3090─────────────────────────────────────────────────────────────────────────────┐
│ Temp: 70C  Fan: 100%  Power: 310W                                                                   │
│ GPU: 1740MHz (100%)  VRAM: 10251MHz (62%)  Mem: 22.7GB                                              │
│ Temp [58-70]                     Fan [99-100]                     Power [147-348]                   │
││                           █  █ │████████████  █████████████████ │                    █             │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
││─────────────────────────────── │─────────────────────────────── │───────────────────────────────   │
│ GPU Clock [1725-2040]            VRAM Clock [10251-10251]         Util [0-100]                      │
││███████████  █                  │                                │             █  █ ██ █ █  ██  █   │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
└│─────────────────────────────── │─────────────────────────────── │─────────────────────────────── ──┘
┌─NVIDIA GeForce RTX 3090─────────────────────────────────────────────────────────────────────────────┐
│ Temp: 70C  Fan: 100%  Power: 167W                                                                   │
│ GPU: 2010MHz (00%)  VRAM: 10251MHz (00%)  Mem: 23.0GB                                               │
│ Temp [70-73]                     Fan [0-100]                      Power [167-206]                   │
││███                             │███████████████████████████████ │██                                │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
││─────────────────────────────── │─────────────────────────────── │───────────────────────────────   │
│ GPU Clock [2010-2010]            VRAM Clock [10251-10251]         Util [0-19]                       │
││                                │                                │   ██                             │
││███████████████████████████████ │███████████████████████████████ │███████████████████████████████   │
└│─────────────────────────────── │─────────────────────────────── │─────────────────────────────── ──┘
RAM: 34.15 GB/s [0.38-100.71]                                                                          
                                       █                                                               
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                           
████████████████████████████████████████████████████████████████████████████                █  █ █   ██
████████████████████████████████████████████████████████████████████████████       ███ █ ████ ██ ██ ███
████████████████████████████████████████████████████████████████████████████     █ █████ ██████████████
████████████████████████████████████████████████████████████████████████████  █████████████████████████
█████████████████████████████████████████████████████████████████████████████ █████████████████████████

magikRUKKOLA Oct 7, 2025
Author

@Thireus made a similar quant for GLM-4.6:

Final estimate: PPL = 3.4486 +/- 0.02001

## Summary of tensor sizes per class
# GPU Total: 13.561 GiB (94.8%) | 14.30 GiB max, if all were q8_0 | 12.99 GiB min, if all were iq5_ks_r4
# CPU Total: 214.783 GiB (82.0%) | 262.02 GiB max, if all were iq6_k | 168.09 GiB min, if all were iq4_ks_r4
# GPU+CPU Total: 228.344 GiB (88.4%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# +f32          835     32.0      0.28 GiB      -               -
# +q8_0         1       8.5       0.77 GiB      -               -
# q8_0          19      8.5       0.20 GiB      5.9%            3.43
# +iq6_k        373     6.625     9.82 GiB      -               -
# iq6_k         241     6.625     2.39 GiB      89.6%           2.67
# iq5_ks_r4     20      5.25      0.10 GiB      4.5%            2.12
#
# CPU-friendly quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# iq6_k         74      6.625    71.81 GiB      27.4%           262.02
# iq5_ks_r4     143     5.25    109.97 GiB      53.0%           207.64
# iq4_ks_r4     53      4.25     33.00 GiB      19.6%           168.09
#
# -Average BPW: 5.4976
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-10-06 05:17:53 UTC+0000 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: f385e17ea9998140203cc543e8e9d3635f6f1292999c58d837cc6d2d3d48b1e0
# - Calibration dataset 'ppl_results.csv' SHA-256: 5d15ad4864a06be06db0a1e8d9fe479b475293dbf858edcf44c6e8427a7e8232
# - tensors.bf16.map SHA-256: fa987db60ed8e9eb4348a45cb0ad630f81a97b20292ed9adfe0369ddd3ec2828
# - tensors.bf16.map model name: GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq6_k.map SHA-256: 0743f08065ebeeac64c52844c0a1cbde4e4e9242230d4218de7647fd46d2ba99
# - tensors.iq6_k.map model name: GLM-4.6-THIREUS-IQ6_K-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq5_ks_r4.map SHA-256: 5c570d0b404729a5ec8d359471ac60edd86cfac80cad58999b596fcf0e61fbb0
# - tensors.iq5_ks_r4.map model name: GLM-4.6-THIREUS-IQ5_KS_R4-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq4_ks_r4.map SHA-256: 101d062c2bbedef43ac326b7659911dfa2246fe608db2d8b86cc9d9fa1e9ed86
# - tensors.iq4_ks_r4.map model name: GLM-4.6-THIREUS-IQ4_KS_R4-SPECIAL_TENSOR-01760-of-01760
# - tensors.q8_0.map SHA-256: 4cdc6924a3e79e3a8df15cc40d607d01ad2d29375eb411ca423a44d437ae0c66
# - tensors.q8_0.map model name: GLM-4.6-THIREUS-Q8_0-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq1_kt.map SHA-256: 793fce90c8b4c735e406f9ee701fc10d9e483d90fbdafdab533096ac7d9e748e
# - tensors.iq1_kt.map model name: GLM-4.6-THIREUS-IQ1_KT-SPECIAL_TENSOR-01760-of-01760
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-irq-k 1.5 --gpu-irq-k 1.5 --gpu-assign-qtype iq6_k \
# --cpu-tensors-max-size 215 --gpu-tensors-max-size 95% --exponential-factor 8 --cpu-tensors \
# 'blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq6_k iq5_ks_r4 iq4_ks_r4 \
# --gpu-quants iq6_k iq5_ks_r4 q8_0 --gpu-assign-tensors 'output\.weight=q8_0' --harmonize-tensors \
# 'blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

And again, similarly, its about +8% in prefill but (/and) better decode (+5%):

ubergarm/GLM-4.6-IQ5_K:

main: n_kv_max = 65536, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   24.715 |   331.46 |  394.268 |     5.19 |
|  8192 |   2048 |   8192 |   28.593 |   286.51 |  420.881 |     4.87 |

64k ctx, q8_0 KV cache with two GPUs:
  Memory Usage:      23.5 / 24.0 GB
  Memory Usage:      23.0 / 24.0 GB

THIREUS/GLM-4.6-5.4976bpw (R4 primarily)

main: n_kv_max = 65536, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   22.879 |   358.06 |  376.160 |     5.44 |
|  8192 |   2048 |   8192 |   27.105 |   302.23 |  401.994 |     5.09 |



64k ctx, q8_0 KV cache with two GPUs:
  Memory Usage:      21.8 / 24.0 GB
  Memory Usage:      21.8 / 24.0 GB

magikRUKKOLA · 2025-10-03T03:24:34Z

magikRUKKOLA
Oct 3, 2025
Author

@Thireus @ubergarm

I was thinking about an automated tool that finds the most optimal quant given the input parameters such as the RAM/VRAM limitations, the prefill/decode speed, the max context length, and the perplexity. So that everyone would find the exact quant they want spending very little work for that.

1 reply

magikRUKKOLA Oct 8, 2025
Author

@Thireus

Somehow I overlooked [this] R1-0524 quant: https://raw.githubusercontent.com/Thireus/GGUF-Tool-Suite/refs/heads/main/recipe_examples/ik_harmonized_recipes/DeepSeek-R1-0528.ROOT-5.0601bpw-3.2219ppl.395GB-GGUF_13GB-GPU_382GB-CPU.90e3c2f_ea5145a.recipe

So I changed it a to use [the] R4 CPU quants:

## Summary of tensor sizes per class
# GPU Total: 13.733 GiB (100.0%) | 13.73 GiB max, if all were q8_0 | 13.73 GiB min, if all were q8_0
# CPU Total: 382.156 GiB (75.8%) | 504.33 GiB max, if all were iq6_k | 323.53 GiB min, if all were iq4_ks_r4
# GPU+CPU Total: 395.889 GiB (87.9%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# +f32          361     32.0      0.40 GiB      -               -
# +q8_0         61      8.5       0.51 GiB      -               -
# q8_0          185     8.5       5.54 GiB      100.0%          5.54
# +iq5_ks_r4    366     5.25      7.29 GiB      -               -
#
# CPU-friendly quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# iq6_k         24      6.625    69.56 GiB      13.8%           504.33
# iq5_ks_r4     77      5.25    176.86 GiB      44.3%           399.66
# iq4_ks_r4     73      4.25    135.73 GiB      42.0%           323.53
#
# -Average BPW: 5.0601
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-10-07 22:14:34 UTC+0000 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: f385e17ea9998140203cc543e8e9d3635f6f1292999c58d837cc6d2d3d48b1e0
# - Calibration dataset 'ppl_results.csv' SHA-256: b45b30a282e29922fc46ea4a030f1aa27df5be790c9d921cfac0fba27adc3faa
# - tensors.bf16.map SHA-256: f264323f789d0da78bc21ccf208cbb16709c5808c9c1f939a0c78f7c03c1ece1
# - tensors.bf16.map model name: DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-01148-of-01148
# - tensors.iq6_k.map SHA-256: c2b301156703fd3d360a6a9406d5079bd6625c2f7557f89c7de632df78eed822
# - tensors.iq6_k.map model name: DeepSeek-R1-0528-THIREUS-IQ6_K-SPECIAL_TENSOR-01148-of-01148
# - tensors.iq5_ks_r4.map SHA-256: 5916f9ee20192160667abccfb2e1f54fb110f42150eb0c9f97d19b8a52732a56
# - tensors.iq5_ks_r4.map model name: DeepSeek-R1-0528-THIREUS-IQ5_KS_R4-SPECIAL_TENSOR-01148-of-01148
# - tensors.iq4_ks_r4.map SHA-256: 129f784365f36bd4a0039f97674750c0128b013a5c17f7302afa29b037692ebe
# - tensors.iq4_ks_r4.map model name: DeepSeek-R1-0528-THIREUS-IQ4_KS_R4-SPECIAL_TENSOR-01148-of-01148
# - tensors.q8_0.map SHA-256: 8d064fb71d986348b38df6d0517ba527dd5bbd25cd1f45535971087f042f1b32
# - tensors.q8_0.map model name: DeepSeek-R1-0528-THIREUS-Q8_0-SPECIAL_TENSOR-01148-of-01148
# - tensors.iq1_m_r4.map SHA-256: 895e207c15f5e8a2c81f5b7061b0fe3a64a481b83c9499fb75ad7724f92d6c25
# - tensors.iq1_m_r4.map model name: DeepSeek-R1-0528-THIREUS-IQ1_M_R4-SPECIAL_TENSOR-01148-of-01148
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-irq-k 1.5 --gpu-irq-k 1.5 --gpu-assign-qtype iq5_ks_r4 \
# --cpu-tensors-max-size 380 --gpu-tensors-max-size 100% --exponential-factor 8 --cpu-tensors \
# 'blk\.([3-9]|[1-5][0-9]|60)\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-5][0-9]|60)\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-5][0-9]|60)\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq6_k iq5_ks_r4 iq4_ks_r4 \
# --gpu-quants q8_0 --gpu-assign-tensors 'blk\.([0-9]|[1-5][0-9]|60)\.attn_k_b\.weight=q8_0' --harmonize-tensors \
# '^blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

The PPL with 8k batches:

Final estimate: PPL = 3.2223 +/- 0.01703

The sweep-bench for 64C 8 chaneel DDR4-3200 + 2 x RTX3090 :

ulimit -n 9999
CUDA_VISIBLE_DEVICES="0,1" \
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
    --warmup-batch \
    --model /opt/THIREUS/DeepSeek-R1-0528.ROOT-5.0601bpw/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf \
    --alias THIREUS/DeepSeek-R1-0528-5.0601bpw \
    --ctx-size $((112 * 1024)) \
    -b 8192 -ub 8192 \
    --mlock \
    --seed 3407 \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.1 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    --override-tensor exps=CPU \
    --n-gpu-layers 99 \
    --split-mode layer \
    --tensor-split 6,10 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --special \
    --log-enable \
    --logdir /var/log/ \
    --verbosity 1 \
    --verbose-prompt \
    --reasoning-format none \
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --keep -1 \
    --slot-prompt-similarity 0.35 \
    --metrics

...
....................................................................................................
llama_new_context_with_model: n_ctx      = 114688
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  1606.51 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2476.71 MiB
llama_new_context_with_model: KV self size  = 4083.19 MiB, c^KV (q8_0): 4083.19 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size = 15840.03 MiB
llama_new_context_with_model:      CUDA1 compute buffer size = 12432.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  3808.09 MiB
llama_new_context_with_model: graph nodes  = 24343
llama_new_context_with_model: graph splits = 214

main: n_kv_max = 114688, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   35.831 |   228.63 |  248.032 |     8.26 |
|  8192 |   2048 |   8192 |   44.152 |   185.54 |  262.106 |     7.81 |
|  8192 |   2048 |  16384 |   52.418 |   156.28 |  275.490 |     7.43 |
|  8192 |   2048 |  24576 |   60.804 |   134.73 |  289.005 |     7.09 |
|  8192 |   2048 |  32768 |   69.010 |   118.71 |  300.315 |     6.82 |
|  8192 |   2048 |  40960 |   77.502 |   105.70 |  317.015 |     6.46 |
|  8192 |   2048 |  49152 |   86.236 |    95.00 |  328.902 |     6.23 |
|  8192 |   2048 |  57344 |   94.761 |    86.45 |  343.068 |     5.97 |
|  8192 |   2048 |  65536 |  104.065 |    78.72 |  354.633 |     5.77 |
|  8192 |   2048 |  73728 |  112.820 |    72.61 |  367.935 |     5.57 |
|  8192 |   2048 |  81920 |  121.205 |    67.59 |  381.775 |     5.36 |
|  8192 |   2048 |  90112 |  129.894 |    63.07 |  394.891 |     5.19 |
|  8192 |   2048 |  98304 |  138.521 |    59.14 |  407.781 |     5.02 |
...

system usage:

┌─NVIDIA GeForce RTX 3090──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Temp: 50C  Fan: 100%  Power: 195W                                                                                        │
│ GPU: 2055MHz (00%)  VRAM: 10251MHz (00%)  Mem: 23.7GB                                                                    │
│ Temp [49-52]                            Fan [0-100]                             Power [142-206]                          │
││                     ███        ███    │██████████████████████████████████████ │                               █         │
││                     ███        ███    │██████████████████████████████████████ │            ██████████████████████████   │
││                  █ █████    ███████   │██████████████████████████████████████ │            ██████████████████████████   │
││                  █ █████    ███████   │██████████████████████████████████████ │█ █    █    ██████████████████████████   │
││  █ █ █     ██████████████████████████ │██████████████████████████████████████ │███ ██ ███ ███████████████████████████   │
││██████████████████████████████████████ │██████████████████████████████████████ │██████████████████████████████████████   │
││────────────────────────────────────── │────────────────────────────────────── │──────────────────────────────────────   │
│ GPU Clock [1995-2055]                   VRAM Clock [10251-10251]                Util [0-85]                              │
││██████ ███████████████████████████████ │                                       │  █      █                               │
││██████ ███████████████████████████████ │                                       │█ █      █                               │
││██████ ███████████████████████████████ │                                       │█ █    █ █ █                             │
││██████ ███████████████████████████████ │                                       │█ █    █ █ █       ███       ████        │
││██████ ███████████████████████████████ │                                       │█ █    █ █ ███   ███████    ███████      │
││██████████████████████████████████████ │██████████████████████████████████████ │██████████████████████████████████████   │
││────────────────────────────────────── │────────────────────────────────────── │──────────────────────────────────────   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌─NVIDIA GeForce RTX 3090──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Temp: 59C  Fan: 100%  Power: 243W                                                                                        │
│ GPU: 2025MHz (27%)  VRAM: 10251MHz (18%)  Mem: 23.3GB                                                                    │
│ Temp [59-65]                            Fan [0-100]                             Power [236-397]                          │
││██ █ █    █                            │██████████████████████████████████████ │ █   █                                   │
││██ █ █ █ ██                            │██████████████████████████████████████ │ █ ███ ██ █                              │
││██ █ █ █ ██                            │██████████████████████████████████████ │ █ ███ ██ █                              │
││██ █ █ █████ ████                      │██████████████████████████████████████ │ ██████████                              │
││████████████████████   ██████ ██       │██████████████████████████████████████ │████████████                             │
││██████████████████████████████████████ │██████████████████████████████████████ │██████████████████████████████████████   │
││────────────────────────────────────── │────────────────────────────────────── │──────────────────────────────────────   │
│ GPU Clock [1710-2025]                   VRAM Clock [10251-10251]                Util [0-100]                             │
││    █ █ █  ███████████████████████████ │                                       │██ █ ██████                              │
││    █ █ █  ███████████████████████████ │                                       │██ █ ██████                              │
││    █ █ █  ███████████████████████████ │                                       │██ █ ██████                              │
││    █ █ █  ███████████████████████████ │                                       │██ █ ██████      ████       ███          │
││ ██████ █  ███████████████████████████ │                                       │██ █ ████████  █████████  ████████  ██   │
││██████████████████████████████████████ │██████████████████████████████████████ │██████████████████████████████████████   │
││────────────────────────────────────── │────────────────────────────────────── │──────────────────────────────────────   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
RAM: 66.91 GB/s [0.58-67.11]                                                                                                
                                                                                                                         █  
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
                                                                                                 ███████████████████████████
              █                                                                                  ███████████████████████████
              █                     █ █      █ █      █ █      █      █ █                        ███████████████████████████
   █          █ █      █          █ █ █    █ █ █    █ █ █    █ █ █    █ █      █ █      █ █      ███████████████████████████
   █ █ █ █    █ █      █ █ █ █    █ █ █    █ █ █    █ █ █    █ █ █    █ █      █ █      █ █      ███████████████████████████
   █ █ █ █    █ █    █ █ █ █ █    █ █ █ █  █ █ █ █  █ █ █ █  █ █ █  █ █ █ █  █ █ █ █  █ █ █ █    ███████████████████████████
█  █ █ █ █  █ █ █ █  █ █ █ █ █ █  █ █ █ ██ █ █ █ █  █ █ █ █  █ █ █  █ █ █ █  █ █ █ █  █ █ █ █    ███████████████████████████
█  █ █ █ █  █ █ █ █  █ █ █ █ █ ██ █ █ █ ██ █ █ █ ██ █ █ █ ██ █ █ █ ██ █ █ ████ █ █ ████ █ █ █    ███████████████████████████
████████ ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████
                                                                                                                            
Evil:Off OC:On Auto-OC:Off RAM-BW:On | Controls: [O]verclock [E]vil [R]eset [Q]uit

magikRUKKOLA · 2025-10-14T00:40:24Z

magikRUKKOLA
Oct 14, 2025
Author

@Thireus

Qwen3-Coder added.

## Summary of tensor sizes per class
# GPU Total: 17.094 GiB (100.0%) | 17.09 GiB max, if all were q8_0 | 17.09 GiB min, if all were q8_0
# CPU Total: 271.033 GiB (75.9%) | 357.13 GiB max, if all were iq6_k | 229.10 GiB min, if all were iq4_ks_r4
# GPU+CPU Total: 288.126 GiB (87.9%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# +f32          311     32.0      0.23 GiB      -               -
# +q8_0         248     8.5      10.04 GiB      -               -
# q8_0          4       8.5       6.83 GiB      100.0%          6.83
#
# CPU-friendly quants:
# QTYPE         Count   BPW     Assigned GiB    % Assigned      Max GiB (all)
# iq6_k         19      6.625    36.88 GiB      10.3%           357.13
# iq5_ks_r4     98      5.25    150.73 GiB      53.3%           283.01
# iq4_ks_r4     67      4.25     83.42 GiB      36.4%           229.10
#
# -Average BPW: 5.1546
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-10-13 02:12:11 UTC+0000 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: f385e17ea9998140203cc543e8e9d3635f6f1292999c58d837cc6d2d3d48b1e0
# - Calibration dataset 'ppl_results.csv' SHA-256: 38d858d6842666ee42eb7450fd97c0f5bae6d79db72ade6ae5b9a5750cb7d7bb
# - tensors.bf16.map SHA-256: c95d0b5f51a24b34f981167b247dfb693649fa4ec40af103376ddb755eae8727
# - tensors.bf16.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00748-of-00748
# - tensors.iq6_k.map SHA-256: 255c8ff2a041c83c52b81615f874a728b04c0babf5135ace9218ef293cc8347a
# - tensors.iq6_k.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-IQ6_K-SPECIAL_TENSOR-00748-of-00748
# - tensors.iq5_ks_r4.map SHA-256: d205cc277b04c193fd5d2bf0f7962cc54a0b20607f83c532edc9f91fd6135a11
# - tensors.iq5_ks_r4.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-IQ5_KS_R4-SPECIAL_TENSOR-00748-of-00748
# - tensors.iq4_ks_r4.map SHA-256: 3f08223c723fed524d60108f3ee873bba6d5a1fce565611d64616651032e0a2c
# - tensors.iq4_ks_r4.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-IQ4_KS_R4-SPECIAL_TENSOR-00748-of-00748
# - tensors.q8_0.map SHA-256: 38b4d5922f4031af583ed27f803ee2ce98fd67468d6676b2b141bbb493be43ae
# - tensors.q8_0.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-Q8_0-SPECIAL_TENSOR-00748-of-00748
# - tensors.iq1_m_r4.map SHA-256: 4122072c36bdbc44872205d28d9dee2093593ec464d1d3bfb70ec40178a63e94
# - tensors.iq1_m_r4.map model name: Qwen3-Coder-480B-A35B-Instruct-THIREUS-IQ1_M_R4-SPECIAL_TENSOR-00748-of-00748
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-irq-k 1.5 --gpu-irq-k 1.5 --gpu-assign-qtype q8_0 \
# --cpu-tensors-max-size 270 --gpu-tensors-max-size 100% --exponential-factor 8 --cpu-tensors \
# 'blk\.([0-9]|[1-5][0-9]|6[0-1]])\.ffn_down_exps\.weight' 'blk\.([0-9]|[1-5][0-9]|6[0-1])\.ffn_up_exps\.weight' \
# 'blk\.([0-9]|[1-5][0-9]|6[0-1])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq6_k iq5_ks_r4 iq4_ks_r4 \
# --gpu-quants q8_0 --harmonize-tensors 'blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

Final estimate: PPL = 5.1057 +/- 0.03269

0 replies

Nexesenex · 2025-10-27T16:19:28Z

Nexesenex
Oct 27, 2025

@magikRUKKOLA : Amazing work!!!

Could you please, if that's not too much of a hassle, add the sizes you collected for your graphs to your DATA SOURCES json for readability?

I invested into a new mobo/cpu set with 192GB DDR5 (+my existing 64GB of VRAM) and your data are very helpful to figure out grossly what quant I'll need to use those biggies.

2 replies

magikRUKKOLA Oct 28, 2025
Author

@Nexesenex

Well, okay.
You can just multiply the BPW by the number of parameters and divide it by (1024^3 * 8).

Keep in mind though that the part of the LLM in the MoE models actually ending up at GPU's VRAM, not CPU RAM. For example:

{"name": "THIREUS-3.1858", "bpw": 3.1858, "ppl": 3.3261, "size": 247.60, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1858bpw-3.3261ppl.249GB-GGUF_18GB-GPU_231GB-CPU.3c88ec6_027b7ff.recipe"},

You see, its about 249GB in size total but only 231GB is getting occupied in the RAM, the rest of 18GB goes into the VRAM.

Nexesenex Oct 28, 2025

@magikRUKKOLA : Thank you for adding the data, and.. for the explanations! ^^

Ph0rk0z · 2025-10-28T19:17:30Z

Ph0rk0z
Oct 28, 2025

So what should I stick to for speed? I have been going by <200GB yet that seems not to be working so well after trying a bunch of GLM REAP quants.

Despite using less memory, many reap quants were slower than the full model. Even though I have 4x3090, my PP speeds between 100-200 with smaller batch like 1024 + RTR. And yes, using high batch size can artificially pump up the number but will absolutely decimate your latency on small prompts.

When I load, I only do 32k CTX and fill the rest of the GPUs with layers or pieces of the experts. Yet IQ3_XXS pruned quant (130GB) is slower than Q4K or Q3K_XL (160-180gb). I'd try them all but I don't have the b/w to keep downloading so can't get a feel.

18 replies

magikRUKKOLA Oct 29, 2025
Author

in sweep bench I saw a gain so either it never used it? or 100% acceptance rate.

Honestly, I don't know how you got these results and what they were measuring.

In server with actual messages it cuts t/s in half.

Yes, this sounds more like it. In case you're trying to use ~1B model as a draft model for spec decoding for such behemoth as DeepSeek-R1 you'll get about that -- that is, the acceptance rate less than 50% and decrease in speed in half or so.

That's exactly what I was saying -- you'd have to use the model for spec. decoding which shows AT LEAST 60% of the acceptance rate. Otherwise you'll get your speed cut down in half of so. Honestly, I don't know why such models exists at all ( that is, https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.6B-v3.0-GGUF ) -- they simply do not work.

Ph0rk0z Oct 29, 2025

I got the results by running the same speculative parameters in sweep bench as server. Is there any way to see acceptance rate in server? Do I have to put on verbose?

Highest pcm-memory I see steadily is maybe 62GB on GLM. It's during token gen afaik, the cards can communicate P2P during prompt processing. I see large PCIE traffic then.

There is one of those models for mistral-large so I'll see if I also get a speed cut in exllama. What else would work for deepsek? I could try using lite, its 8GB. Maybe there is some calculus where spec decoding > layers on GPU.

magikRUKKOLA Oct 29, 2025
Author

@Ph0rk0z

I got the results by running the same speculative parameters in sweep bench as server. Is there any way to see acceptance rate in server? Do I have to put on verbose?

Of course there is a way. Yeah, you have to put the verbose-something. I believe these:

 --verbosity 2 --verbose-prompt

Take a look at the bash scripts here: #839 (reply in thread)

What else would work for deepsek?

Possibly the distill deepseek like ~~Qwen-8B~~ Qwen-14B ? But I haven't tried it yet.

Ph0rk0z Oct 29, 2025

Sadly the distils don't match special tokens. I tried a larger qwen and it didn't work.

magikRUKKOLA Oct 29, 2025
Author

@Ph0rk0z

Possibly the distill deepseek like ~~Qwen-8B~~ Qwen-14B ? But I haven't tried it yet.

No luck.

magikRUKKOLA · 2025-10-29T02:31:51Z

magikRUKKOLA
Oct 29, 2025
Author

@Ph0rk0z

related to: #715 (reply in thread)

sweep-bench 4k batches with THIREUS-R1-3.5652bpw, 124k ctx, two rtx 3090:

 /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench --warmup-batch --model /mnt/data/opt/THIREUS-R1-3.5652bpw/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf --alias THIREUS/DeepSeek-R1-0528-3.5652bpw --ctx-size 126976 --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 -ctk q8_0 -mla 3 -amb 512 -b 4096 -ub 4096 --split-mode layer --tensor-split 7,8 --main-gpu 1 --override-tensor exps=CPU --n-gpu-layers 99 --threads 64 --host 0.0.0.0 --port 8080 --lookup-cache-dynamic /mnt/data/ik_llama.kv.dump --log-enable --logdir /var/log/ --jinja --special --verbose-prompt --verbosity 2

main: n_kv_max = 126976, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   22.752 |   180.03 |   97.566 |    10.50 |
|  4096 |   1024 |   4096 |   24.661 |   166.09 |  100.495 |    10.19 |
|  4096 |   1024 |   8192 |   26.795 |   152.86 |  104.519 |     9.80 |
|  4096 |   1024 |  12288 |   29.025 |   141.12 |  107.518 |     9.52 |
|  4096 |   1024 |  16384 |   31.115 |   131.64 |  110.798 |     9.24 |
|  4096 |   1024 |  20480 |   33.330 |   122.89 |  114.090 |     8.98 |
|  4096 |   1024 |  24576 |   35.468 |   115.48 |  116.913 |     8.76 |
|  4096 |   1024 |  28672 |   37.656 |   108.77 |  120.709 |     8.48 |
|  4096 |   1024 |  32768 |   40.403 |   101.38 |  123.557 |     8.29 |
|  4096 |   1024 |  36864 |   42.638 |    96.07 |  127.204 |     8.05 |
...

So the prefill is about 150 tps. I have no idea why you're getting only 40-something.

nvoc -R
Detected 2 NVIDIA GPU(s)

NVOC System Report
==================
System: Linux lenovo 6.16.12+deb14+1-amd64 x86_64
Driver Version: 580.95.05

GPU 0: NVIDIA GeForce RTX 3090
------------------------------------------------
  PCI Bus ID:        00000000:41:00.0
  VBIOS Version:     94.02.4B.00.0B
  Persistence Mode:  Enabled
  Core Temperature:  57°C
  Power Usage:       214W
  Current Power Limit: 400W
  Power Limits:      Default: 350W, Min: 100W, Max: 400W
  GPU Clock:         2055 MHz
  VRAM Clock:        10251 MHz
  GPU Utilization:   18%
  VRAM Utilization:  12%
  Memory Usage:      23.6 / 24.0 GB
  Applied Offsets:   GPU: 100 MHz, VRAM: 1500 MHz

GPU 1: NVIDIA GeForce RTX 3090
------------------------------------------------
  PCI Bus ID:        00000000:61:00.0
  VBIOS Version:     94.02.4B.00.0B
  Persistence Mode:  Enabled
  Core Temperature:  43°C
  Power Usage:       192W
  Current Power Limit: 400W
  Power Limits:      Default: 350W, Min: 100W, Max: 400W
  GPU Clock:         2025 MHz
  VRAM Clock:        10251 MHz
  GPU Utilization:   20%
  VRAM Utilization:  14%
  Memory Usage:      23.5 / 24.0 GB
  Applied Offsets:   GPU: 100 MHz, VRAM: 1500 MHz


Peer-to-Peer (P2P) Support Matrix:
=================================
GPU 0 -> GPU 1: Supported
GPU 1 -> GPU 0: Supported

Report generated at: Wed Oct 29 02:34:52 2025
==================

21 replies

Ph0rk0z Oct 31, 2025

I can see PCIE bandwidth inside nvtop and it's not maxed out. In WAN (the video model) it uses more. I know the upgrade path is DDR5 but $3K is not cheap, plus in the US there is tariff. A lot of people have dual socket epyc and other such boards which will have similar issues. Don't know either how much have_fancy_SIMD is worth on processors.

I got my system in early 2023 and mainly expanded. It was before MoE times. IK simply doesn't have a dual CPU or multi-gpu setup and that's why we don't have tensor parallel for CPU or GPU. Only hope is mainline or porting the code from fastllm, which is partly why I keep bringing it up. They already took code from here to support GGUF and have numa code.

magikRUKKOLA Oct 31, 2025
Author

@Ph0rk0z

I can see PCIE bandwidth inside nvtop and it's not maxed out.

What about the latency? You see, the tool from nvidia-toolkit I was mentioning, it provides such info. nvtop doesn't.

IK simply doesn't have a dual CPU or multi-gpu setup and that's why we don't have tensor parallel for CPU or GPU.

I don't think the absence or the presence of a certain thing can do a dent in this regard. For example I do have a hardware to build the DDR5 system but I simply do not have a time to build it etc.. Also, anyone with some spare machines (I only have multi-GPU setups) can provide an SSH access etc. but I never seen any such requests from the IK or whoever.

plus in the US there is tariff.

Ah. The US. That explains a lot.

ikawrakow Oct 31, 2025
Maintainer

Also, anyone with some spare machines (I only have multi-GPU setups) can provide an SSH access etc. but I never seen any such requests from the IK or whoever.

Where do I post such a request?

Ph0rk0z Oct 31, 2025

What about the latency? You see, the tool from nvidia-toolkit I was mentioning, it provides such info. nvtop doesn't.

Latency is much improved when I did the P2P driver mod. I already tried all the simple stuff. People do literal 1x links on multi-gpu rigs. Until you do parallel computation PCIE speed isn't that massive of a deal. On mistral large this morning I get 368 t/s in 5bpw. I also tried the speculative decoding there and yea, it doesn't work either. Prompt processing above 100 is pretty tolerable as long as t/s above 10. I originally asked how it's tied to quants because of the big variance I saw playing with the REAP models. If x_KM is faster than IQ on both prompt and t/s then why am I downloading IQ quants, etc? But I don't have the internet to play with ones that count and the small models aren't representative because my rig just chews through them as gimpy as it is.

Where do I post such a request?

I think you should have made a discussion. I bet there are lots of people with multi-gpu and multi-cpu machines that have nowhere to really go as most frameworks are either for exclusively GPU or home users with consumer machines. DDR5 prices are way way up there compared to DDR4.

if my pcm-memory look like that when decoding, I bet t/s would be way closer to GPU speeds

magikRUKKOLA Oct 31, 2025
Author

@ikawrakow

Where do I post such a request?

For a multi-GPU machine? Well, you could generate the public key (ssh-keygen) and ask me to add it into ~/.ssh/authorized_keys. The only inconvenience is that I do not have a dedicated IP address to access it via the clearnet. Probably the easiest solution would be to generate an onion domain so you could access it via torsocks? Alternatively, I can order some VPS with a dedicated IP address and build a tunnel to my machine so everything will be available via the clearnet IP.
Which option would you prefer?

Perplexity vs Size Graphs for the recent quants (Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.) #715

Uh oh!

Uh oh!

Replies: 9 comments · 75 replies

Uh oh!

ikawrakow Aug 21, 2025 Maintainer

Uh oh!

magikRUKKOLA Aug 21, 2025 Author

Uh oh!

magikRUKKOLA Aug 24, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Aug 30, 2025 Author

Uh oh!

saood06 Aug 30, 2025 Collaborator

Uh oh!

magikRUKKOLA Aug 30, 2025 Author

Uh oh!

magikRUKKOLA Sep 23, 2025 Author

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Oct 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Oct 3, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Oct 7, 2025 Author

Uh oh!

magikRUKKOLA Oct 3, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Oct 8, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Oct 14, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Oct 28, 2025 Author

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Oct 29, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 9 comments 75 replies

ikawrakow
Aug 21, 2025
Maintainer

magikRUKKOLA Aug 21, 2025
Author

magikRUKKOLA Aug 24, 2025
Author

magikRUKKOLA Aug 30, 2025
Author

saood06 Aug 30, 2025
Collaborator

magikRUKKOLA Aug 30, 2025
Author

magikRUKKOLA
Sep 23, 2025
Author

magikRUKKOLA Oct 2, 2025
Author

magikRUKKOLA Oct 3, 2025
Author

magikRUKKOLA Oct 7, 2025
Author

magikRUKKOLA
Oct 3, 2025
Author

magikRUKKOLA Oct 8, 2025
Author

magikRUKKOLA
Oct 14, 2025
Author

magikRUKKOLA Oct 28, 2025
Author

magikRUKKOLA Oct 29, 2025
Author