-
Notifications
You must be signed in to change notification settings - Fork 163
server : add Anthropic Messages API support #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM, but I see this comment in the mainline PR. Is the consensus that this is valuable? |
|
@ikawrakow: Just merge it. It is very nice. |
|
While it’s possible to use a proxy that maps Anthropic’s API to the OpenAI API, I still think native Anthropic API support is valuable. Anthropic provides a standardized mechanism for transmitting reasoning segments, whereas OpenAI’s For models that depend on interleaved reasoning (e.g. minimax-m2 and kimi-k2) this difference matters. Leveraging Anthropic’s native Messages API can provide better model behavior and, in some cases, improved performance. It also enables claude-code to work with ik_llama.cpp out of the box, without requiring manually setup additional services. |
|
@hksdpc255: |
But have you seen this? npm for the cli tool? The npm which is getting hacked almost every week? https://www.npmjs.com/package/@kilocode/cli?activeTab=dependencies Are you sure its a good idea?
I've seen the RooCode just today. I do believe absolutely anything will be better than the RooCode. :) |
But what is the consensus regarding that? Is it actually necessary to send the reasoning content back? I am having an impression that it might be too redundant. Bad thing is that the prompt caching will not work as fast, but the context will be optimized (compressed) without the reasoning. |
At least GLM-4.5, Minimax-M2, and Kimi-K2-Thinking benefit from this. For Minimax-M2 and Kimi-K2-Thinking, the official documentation explicitly states that interleaved reasoning support is required. Without it, the model’s performance degrades. This makes Anthropic-style reasoning blocks more than just a convenience scence they’re part of the intended interface for these models. Edit: gpt-oss series also needs to send reasoning content back. |
|
Regarding context length: based on the official chat templates for GLM-4.5, Minimax-M2, and Kimi-K2-Thinking, all three models automatically strip out past reasoning blocks, preserving only the reasoning associated with the current user request. So the additional context overhead is actually not so much. Also, context management is typically handled by the client-side AI agent, not the model itself. For example, claude-code does not send the full history back to the server every time, it performs its own close-sourced “smart” context selection to determine what needs to be included. Thus, reasoning content handling should not materially increase the true context load. |
I didn't get it. They doing what? Its the server-side we are talking about, right? [EDIT]:
|
They doing this by their chat template, so ik_llama.cpp/llama.cpp automatically get this. |
|
How is anthropic at handling media vs openai? That may be a useful bit too. IIRC their handling of assistant prefill and continues was better. Never heard of current models reusing old reasoning beyond something new that's being tried. |
|
@ magikRUKKOLA Realistically though, only GPT-5.1 High cuts it. Gemini 3 can only do brand new frontend projects. The guy at cursor forum also commented that don't let Gemini 3 touch existing GPT-5.1 code base because it can't do sheet. Sadly this patch is already merged in mainline, but not yet here in ik_llama. Ironic. |

Ported from ggml-org/llama.cpp#17425
Closes #1010