-
Notifications
You must be signed in to change notification settings - Fork 14k
convert: support Mistral 3 Large MoE #17730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
So far so good with this, in a couple hours will be able to test generation |
|
seems to work and produce coherent results! |
|
This PR still needs to be clean up before it is ready for review 😅 |
| # remap hparams from Mistral MoE format to DeepseekV2 format | ||
| # we do this way to be able to reuse DeepseekV2Model set_gguf_parameters logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat ugly but an acceptable trade-off.
|
@ngxson Thank you so much for this. I've tried your Q4_K_M - seems working just fine. Is there any other setting or change needed for the conversion? |
|
It disappeared?? 👀 I can re-upload if necessary I guess .. Only difference is using |
|
Yeah I've used the mistral format. Than I guess I have a corrupted bf16 version (I cannot think of anything else) https://github.com/csabakecskemeti/ministral-3_dequantizer_fp8-bf16 |
|
I can see it here: https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-BF16 |
|
You're right, they just removed it from the collection (it it was ever there :p) there's where I looked for. My bad |
It looks like @ngxson forgot |
CISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csabakecskemeti This should work.
| name = name.replace(".qscale_act", ".input_scale") | ||
| if name.endswith(".qscale_weight"): | ||
| name = name.replace(".qscale_weight", ".weight_scale") | ||
| if ".experts." in name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if ".experts." in name: | |
| if ".wkv_b." in name: | |
| name = name.replace(".wkv_b.", ".kv_b_proj.") | |
| if ".experts." in name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change gave me:
ValueError: Can not map tensor 'layers.32.attention.k_b_proj.weight'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you need the changes below as well (cannot be applied directly because GitHub's "new experience" is useless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working so far with the other change included :)
| MODEL_TENSOR.ATTN_KV_B: ( | ||
| "model.layers.{bid}.self_attn.kv_b_proj", # deepseek2 | ||
| "layers.{bid}.attention.wkv_b", # mistral-large | ||
| ), | ||
|
|
||
| MODEL_TENSOR.ATTN_K_B: ( | ||
| "model.layers.{bid}.self_attn.k_b_proj", # deepseek2 | ||
| ), | ||
|
|
||
| MODEL_TENSOR.ATTN_V_B: ( | ||
| "model.layers.{bid}.self_attn.v_b_proj", # deepseek2 | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| MODEL_TENSOR.ATTN_KV_B: ( | |
| "model.layers.{bid}.self_attn.kv_b_proj", # deepseek2 | |
| ), | |
| MODEL_TENSOR.ATTN_K_B: ( | |
| "model.layers.{bid}.self_attn.k_b_proj", # deepseek2 | |
| "layers.{bid}.attention.k_b_proj", # mistral-large | |
| ), | |
| MODEL_TENSOR.ATTN_V_B: ( | |
| "model.layers.{bid}.self_attn.v_b_proj", # deepseek2 | |
| "layers.{bid}.attention.v_b_proj", # mistral-large | |
| ), |
GitHub will mess up the diff here, but you get the gist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I guess I needed this one too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I didn't notice that the changes was overwritten by git merge. thanks!
(feel free to ping me when these changes are OK to be added)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's working so far @ngxson but I can wait until I have a quant I can run and do that first to confirm
WIP, the code is quite ugly for now, but just want to get it to work.
Remember to convert with the
--mistral-formatargument, as the weight is not yet transformers-compatibleOutput
F16 weight is 1.35 TerabytesQ8_0 weight is 716GB and I don't have enough hw to test itEdit: thanks @bartowski1182 for testing it!
NOTE: this PR only covers the conversion to GGUF. the C++ code still missing llama 4 scaling to work, but it will be another PR