Releases · ggml-org/llama.cpp

29 Nov 14:27

7d2add5

b7197

sycl : support to malloc memory on device more than 4GB, update the d…

Assets 20

29 Nov 14:23

github-actions

b7196

f698a79

b7196

ggml: replace hwcap with riscv_hwprobe for RVV detection (#17567)

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

Assets 20

29 Nov 09:01

github-actions

b7195

47a268e

b7195

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (#16900)

* vulkan: split mul_mmq_funcs for mul_mat_vecq use

* add mxfp4 mmvq

* add q2_k mmvq

* add q3_k mmvq

* add q4_k and q5_k mmvq

* add q6_k mmvq

* handle 4x4 quants per mmvq thread

* enable MUL_MAT_ID mmvq support

* enable subgroup optimizations for mul_mat_vec_id shaders

* device tuning

* request prealloc_y sync after quantization

* fix indentation

* fix llvmpipe test failures

* fix mul_mat_id mmvq condition

* fix unused variable warning

Assets 20

29 Nov 08:30

github-actions

b7194

59d8d4e

b7194

vulkan: improve topk perf for large k, fix overflow in unit tests (#1…

Assets 20

28 Nov 20:21

github-actions

b7192

03914c7

b7192

common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

Assets 20

28 Nov 19:57

github-actions

b7191

3ce7a65

b7191

server: fix: /metrics endpoint returning JSON-escaped Prometheus form…

Assets 20

28 Nov 18:39

github-actions

b7190

e072b20

b7190

ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in g…

Assets 20

28 Nov 17:42

github-actions

b7189

c6f7a42

b7189

[MUSA] enable fp16/fast_fp16/bf16_mma on PH1 (#17551)

* [MUSA] enable fp16/fast_fp16/bf16_mma on PH1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-tile.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Assets 20

28 Nov 15:19

github-actions

b7188

2e7ef98

b7188

ggml-cuda: add stricter checking for fusion (#17568)

* ggml-cuda: make conditions for fusion more explicit

* ggml-cuda: remove size check as std::equal already does it

Assets 20

28 Nov 14:09

github-actions

b7187

ddf9f94

b7187

server : add Anthropic Messages API support (#17570)

* server : add Anthropic Messages API support

* remove -@pytest.mark.slow from tool calling/jinja tests

* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py

* server : removed redundant n field logic in anthropic_params_from_json

* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

* server : refactor Anthropic API to use OAI conversion

* make sure basic test always go first

* clean up

* clean up api key check, add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Assets 20

Releases: ggml-org/llama.cpp

b7197

Uh oh!

b7196

Uh oh!

b7195

Uh oh!

b7194

Uh oh!

b7192

Uh oh!

b7191

Uh oh!

b7190

Uh oh!

b7189

Uh oh!

b7188

Uh oh!

b7187

Uh oh!