Releases · turboderp-org/exllamav3 · GitHub

07 Dec 16:51

0.0.17 Latest

Latest

Fix Mistral3 implementation (supports Ministral models now)
Fix for REAPed models with arbitrary number of experts
Various other fixes

Full Changelog: v0.0.16...v0.0.17

Assets 27

exllamav3-0.0.17+cu128.torch2.7.0-cp310-cp310-linux_x86_64.whl

sha256:6b3e12f8e72dbbd4dba37bfd0d703fc84262ec6a7e977bdf721042975f763f74

140 MB 2025-12-07T17:17:12Z
exllamav3-0.0.17+cu128.torch2.7.0-cp310-cp310-win_amd64.whl

sha256:fb2f89115aeafd2b8784a13920bb5f41c432ed5168559c9a077e7d01f9ae3c3f

128 MB 2025-12-07T17:51:39Z
exllamav3-0.0.17+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl

sha256:f09dffdefd919165571f44922edda5da62de176a9f1e966eb40f091e6362ac73

140 MB 2025-12-07T17:16:05Z
exllamav3-0.0.17+cu128.torch2.7.0-cp311-cp311-win_amd64.whl

sha256:93679dc0786b2b17a917e310a27509a460a520f593f5d3dc5b336c884d186fbc

128 MB 2025-12-07T17:18:41Z
exllamav3-0.0.17+cu128.torch2.7.0-cp312-cp312-linux_x86_64.whl

sha256:fefddd4945c97fbd236c58d3c64524edc8a236741fba4c8f78ae82b0b2a20a2e

140 MB 2025-12-07T17:17:19Z
exllamav3-0.0.17+cu128.torch2.7.0-cp312-cp312-win_amd64.whl

sha256:97b8469ce27fe284c37773275b26c796e606f5a26278c23ac2e767ae392b0ed8

128 MB 2025-12-07T17:18:34Z
exllamav3-0.0.17+cu128.torch2.7.0-cp313-cp313-linux_x86_64.whl

sha256:9d5572a13f1b7ae74cdfbdf46105882aa0cb3b014e59b9c84b7ec96565240bd0

140 MB 2025-12-07T17:16:22Z
exllamav3-0.0.17+cu128.torch2.7.0-cp313-cp313-win_amd64.whl

sha256:3d3143a5ced87658a485cb9bc70b774b8a6ef276e0ff752dac728e1cc0718276

128 MB 2025-12-07T17:20:39Z
exllamav3-0.0.17+cu128.torch2.8.0-cp310-cp310-linux_x86_64.whl

sha256:51d2b848b38857f5f2a275b8616b6d43ac668919f5968054cb13118668be215d

140 MB 2025-12-07T17:19:14Z
exllamav3-0.0.17+cu128.torch2.8.0-cp310-cp310-win_amd64.whl

sha256:1afc4e6d2115ab25d1d1d2bde3a973fcd226f47498c91223d14516dfce98992e

128 MB 2025-12-07T17:51:16Z
Source code (zip)

2025-12-07T16:47:20Z
Source code (tar.gz)

2025-12-07T16:47:20Z

25 Nov 16:57

0.0.16

Fix regression breaking tensor-parallel inference
Allow TP text-model to work with vision tower

Full Changelog: v0.0.15...v0.0.16

Assets 19

16 Nov 12:55

0.0.15

Support Glm4vForConditionalGeneration
Support Glm4vMoeForConditionalGeneration
Fix some tokenizer issues
QoL improvements

Full Changelog: v0.0.14...v0.0.15

Assets 19

10 Nov 00:38

0.0.14

Fix small regression in Gemma and Mistral vision towers.

Full Changelog: v0.0.13...v0.0.14

Assets 19

09 Nov 22:04

0.0.13

Support Qwen3-VL and Qwen3-VL MoE
Minor bugfixes

Full Changelog: v0.0.12...v0.0.13

Assets 19

01 Nov 17:27

0.0.12

Support MiniMaxM2ForCausalLM
Graphs (reduce CPU overhead)
Misc. optimizations
Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)
Fix some bugs

Full Changelog: v0.0.11...v0.0.12

Assets 19

17 Oct 15:35

0.0.11

Fix issue with TP loading of models quantized since v0.0.9+

Full Changelog: v0.0.10...v0.0.11

Assets 19

15 Oct 12:51

0.0.10

Fix issue preventing AsyncGenerator from working with new requeue option

Full Changelog: v0.0.9...v0.0.10

Assets 19

13 Oct 21:42

0.0.9

Lock MCG and MUL1 multipliers, no longer flag as experimental
Switch to MCG codebook by default to new models (use --codebook 3inst for previous default)
Add more calibration data
Increase default calibration size to 250 rows (use --cal_rows 100 for previous default)
Fix quantized cache for bsz > 1
Fix kernel selection on A100
A few more TP-related fixes

Full Changelog: v0.0.8...v0.0.9

Assets 19

09 Oct 22:12

0.0.8

New GEMM kernel tuning scheme
Fix banned strings regression
Fix some memory leaks
Fix potential stack overflow in cache defrag

Full Changelog: v0.0.7...v0.0.8

Assets 19