-
Notifications
You must be signed in to change notification settings - Fork 163
Add GLM 4.6 support #814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GLM 4.6 support #814
Conversation
|
Ahh thanks for opening this as otherwise I was gonna do it. I'll test it shortly, downloading new GLM-4.6 safetensors now and about to convert. i'll report back here within a few hours |
|
See the workaround here, otherwise it will fail to convert from .safetensors! ggml-org/llama.cpp#16361 (comment) Also, per a discussion here: https://huggingface.co/steampunque/GLM-4.5-Air-Hybrid-GGUF/discussions/1#68dd13c2744d0e48b5723179 I think we can convert layer 92 to IQ1_S and shave off a gigabyte or two, since it's unused anyway. |
So those nextn layers for MTP are marked as If you want to quantize them very small like that anyway be careful as it might throw an error depending on the imatrix data so u might have to use I'm converting safetensors to GGUF now using mainline lcpp, so will be able to test your PR soon! 🤞 |
|
I see. Makes sense. Since they're not loaded, leaving them at a reasonable quant seems like a good idea. For what it's worth, they didn't throw an error quantizing to IQ1_S. |
|
So far so good, using ur PR to calculate imatrix from the full bf16 currently: model=/mnt/data/models/ubergarm/GLM-4.6-GGUF/GLM-160x19B-4.6-BF16-00001-of-00015.gguf
numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
-m "$model" \
-fa \
--no-fused-up-gate \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/data/models/ubergarm/GLM-4.6-GGUF/imatrix-GLM-4.6-BF16.dat \
--verbosity 1 \
--layer-similarity \
--seed 1337 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
system_info: n_threads = 128 (n_threads_batch = 192) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 1050.35 ms
compute_imatrix: computing over 814 chunks with batch_size 512
compute_imatrix: 13.13 seconds per pass - ETA 2 hours 58.10 minutes
======================================= HAVE_FANCY_SIMD is defined
[1]17.7095,[2]6.8624,[3]4.4259,[4]3.1997,[5]2.5997,[6]2.2235,[7]2.0004,[8]1.8473,[9]1.8407,
save_imatrix: entry ' blk.48.ffn_gate_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.48.ffn_down_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.48.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: stored collected data after 10 chunks in /mnt/data/models/ubergarm/GLM-4.6-GGUF/imatrix-GLM-4.6-BF16.dat
[10]1.7624,[11]1.8774,[12]1.9641,[13]2.0361,[14]2.0966,[15]1.9987,[16]1.9195,[17]1.8662,[18]1.8114,[19]1.7560,Seems like a well-behaved model in terms of routed experts having enough imatrix data so that is good. I'll confirm the imatrix has all the tensors (except mtp / layer 92 of course) and upload it then continue quantizing, testing, and releasing. So far so good! Thanks! |
|
imatrix is available run off the bf16 here: https://huggingface.co/ubergarm/GLM-4.6-GGUF/blob/main/imatrix-GLM-4.6-BF16.dat Seems to have everything and am using it to quantize now: load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
load_imatrix: loaded 1001 importance matrix entries from /mnt/data/models/ubergarm/GLM-4.6-GGUF/imatrix-GLM-4.6-BF16.dat computed on 814 chunks
prepare_imatrix: have 1001 importance matrix entries |
|
Still get a curious message when offloading regarding those tensors. Model works though. |
When GLM-4.5 supported was added, a new concept So the strategy was to go ahead and quantize them without any imatrix data, include them in the gguf files, but mark them as So the tensors only take up space on disk, but are not loaded into RAM/VRAM on inference time. In the future, if MTP support is added, these quants will already have some data available there without needing to re-download the entire model. |
|
I remember that part, but there were new tensors for 4.6 None of the 4.5 tensors have the second |
|
I just tested GLM-4.6-smol-IQ2_KS on my local rig and see similar messages e.g. Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 344555520 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 345702400 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 344555520 bytes) -- ignoringGoing back and looking at GLM-4.5 it does have routed exps layers on blk 92 e.g. My impression is the logging is due to Perhaps you were not running Anyway, it seems okay to me, but if you think there is an issue let me know. thanks! |
|
I am using my exact same OT for 4.6 as I had for 4.5. It's definitely new. Not sure why it tries to override these layers but not the other unused ones. I only bring it up in case there is some place the ignore is skipped in the code leading to a possible bug, i.e they get loaded into sysram and then unused. |
|
Has anyone figured out how to enable thinking when using the openai compatible API? Thinking works fine on the web UI, but not via the chat API where I always get |
|
That's strange. I had to set reasoning budget to 0 otherwise it would think in chat completions. On text completions it's a matter of using the correct preset. |
|
Thanks @Ph0rk0z, you are right! It is enabled by default, for some reason my system prompt disables it 😳, so now I'm trying to figure out which part of my system prompt triggers this! I don't have any /no_think or similar... Edit: Looks like the model decide on its own that it shouldn't think... especially when the system prompt is quite long and detailed about answer formatting and how to process information. |
|
Not sure if related, but I've heard the default (GGUF) chat template is bugged, but unsloth fixed it in their upload: https://huggingface.co/unsloth/GLM-4.6-GGUF |
Fixes: #812
By marking a few tensors as optional. GLM 4.6 loads and inferences for me, if you wish to test: https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF
Plucked from: ggml-org/llama.cpp#16359
Random aside, but this might be a good opportunity to make sure the MTP (multi-token prediction) tensors are never loaded into RAM, since they are not used anyway? Their size is on the order of ~2GB quantized, I think, which is signficant when squeezing in a >3bpw quant.