Releases: turboderp-org/exllamav3
Releases · turboderp-org/exllamav3
0.0.17
- Fix Mistral3 implementation (supports Ministral models now)
- Fix for REAPed models with arbitrary number of experts
- Various other fixes
Full Changelog: v0.0.16...v0.0.17
0.0.16
- Fix regression breaking tensor-parallel inference
- Allow TP text-model to work with vision tower
Full Changelog: v0.0.15...v0.0.16
0.0.15
- Support Glm4vForConditionalGeneration
- Support Glm4vMoeForConditionalGeneration
- Fix some tokenizer issues
- QoL improvements
Full Changelog: v0.0.14...v0.0.15
0.0.14
- Fix small regression in Gemma and Mistral vision towers.
Full Changelog: v0.0.13...v0.0.14
0.0.13
- Support Qwen3-VL and Qwen3-VL MoE
- Minor bugfixes
Full Changelog: v0.0.12...v0.0.13
0.0.12
- Support MiniMaxM2ForCausalLM
- Graphs (reduce CPU overhead)
- Misc. optimizations
- Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)
- Fix some bugs
Full Changelog: v0.0.11...v0.0.12
0.0.11
- Fix issue with TP loading of models quantized since v0.0.9+
Full Changelog: v0.0.10...v0.0.11
0.0.10
- Fix issue preventing AsyncGenerator from working with new requeue option
Full Changelog: v0.0.9...v0.0.10
0.0.9
- Lock MCG and MUL1 multipliers, no longer flag as experimental
- Switch to MCG codebook by default to new models (use
--codebook 3instfor previous default) - Add more calibration data
- Increase default calibration size to 250 rows (use
--cal_rows 100for previous default) - Fix quantized cache for bsz > 1
- Fix kernel selection on A100
- A few more TP-related fixes
Full Changelog: v0.0.8...v0.0.9
0.0.8
- New GEMM kernel tuning scheme
- Fix banned strings regression
- Fix some memory leaks
- Fix potential stack overflow in cache defrag
Full Changelog: v0.0.7...v0.0.8