server : add Anthropic Messages API support #1012

hksdpc255 · 2025-11-26T08:48:29Z

Ported from ggml-org/llama.cpp#17425

Closes #1010

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

hksdpc255 · 2025-11-26T09:14:45Z

Tested with Qwen3-Coder-30B:

ikawrakow · 2025-11-26T10:08:18Z

LGTM, but I see this comment in the mainline PR.

Is the consensus that this is valuable?

whatever1983 · 2025-11-26T11:18:04Z

@ikawrakow:
yeah, it's very valuable. You shouldn't refer to mainline commentators because they don't get it. I recently dumped Kilocode for Claude Code. Code quality is about 20% higher. Prompts and reasoning are more concise. You should try it. Right now, if you have to use Claude Code with OpenAI endpoints, you have to use Claude Code Router from GLM. It is very hard to configure to use as a openai->cc proxy. (Took me 2 days to figure out)

Just merge it. It is very nice.

hksdpc255 · 2025-11-26T13:51:18Z

While it’s possible to use a proxy that maps Anthropic’s API to the OpenAI API, I still think native Anthropic API support is valuable. Anthropic provides a standardized mechanism for transmitting reasoning segments, whereas OpenAI’s reasoning_content field is not consistently implemented across clients (OpenAI's official codex even use another field for reasoning). As a result, Anthropic clients reliably send reasoning blocks back to the server, but OpenAI-style clients often drop them.

For models that depend on interleaved reasoning (e.g. minimax-m2 and kimi-k2) this difference matters. Leveraging Anthropic’s native Messages API can provide better model behavior and, in some cases, improved performance. It also enables claude-code to work with ik_llama.cpp out of the box, without requiring manually setup additional services.

whatever1983 · 2025-11-26T14:47:53Z

@hksdpc255:
I am on your side. IK will eventually get it, since he's technical, unlike the mainline which is turning into illegal CANN/Vulkan crappy hardware framework playground.

magikRUKKOLA · 2025-11-27T01:44:32Z

@whatever1983

I recently dumped Kilocode for Claude Code.

But have you seen this?

npm install -g @kilocode/cli

npm for the cli tool? The npm which is getting hacked almost every week?
There are more than 100 dependencies.

https://www.npmjs.com/package/@kilocode/cli?activeTab=dependencies

Are you sure its a good idea?

Code quality is about 20% higher. Prompts and reasoning are more concise.

I've seen the RooCode just today. I do believe absolutely anything will be better than the RooCode. :)

magikRUKKOLA · 2025-11-27T01:50:10Z

@hksdpc255

As a result, Anthropic clients reliably send reasoning blocks back to the server, but OpenAI-style clients often drop them.

But what is the consensus regarding that? Is it actually necessary to send the reasoning content back? I am having an impression that it might be too redundant. Bad thing is that the prompt caching will not work as fast, but the context will be optimized (compressed) without the reasoning.

hksdpc255 · 2025-11-27T01:53:50Z

Is it actually necessary to send the reasoning content back?

At least GLM-4.5, Minimax-M2, and Kimi-K2-Thinking benefit from this. For Minimax-M2 and Kimi-K2-Thinking, the official documentation explicitly states that interleaved reasoning support is required. Without it, the model’s performance degrades. This makes Anthropic-style reasoning blocks more than just a convenience scence they’re part of the intended interface for these models.

Edit: gpt-oss series also needs to send reasoning content back.

hksdpc255 · 2025-11-27T02:03:44Z

Regarding context length: based on the official chat templates for GLM-4.5, Minimax-M2, and Kimi-K2-Thinking, all three models automatically strip out past reasoning blocks, preserving only the reasoning associated with the current user request. So the additional context overhead is actually not so much.

Also, context management is typically handled by the client-side AI agent, not the model itself. For example, claude-code does not send the full history back to the server every time, it performs its own close-sourced “smart” context selection to determine what needs to be included. Thus, reasoning content handling should not materially increase the true context load.

magikRUKKOLA · 2025-11-27T02:07:49Z

@hksdpc255

all three models automatically strip out past reasoning blocks,

I didn't get it. They doing what? Its the server-side we are talking about, right?

[EDIT]:

Also, context management is typically handled by the client-side AI agent, not the model itself.

~~So basically its not the models that are stripping the reasoning_content etc. but the ... agents that go with them?~~

hksdpc255 · 2025-11-27T02:12:53Z

I didn't get it. They doing what? Its the server-side we are talking about, right?

They doing this by their chat template, so ik_llama.cpp/llama.cpp automatically get this.

Ph0rk0z · 2025-11-28T13:38:51Z

How is anthropic at handling media vs openai? That may be a useful bit too. IIRC their handling of assistant prefill and continues was better. Never heard of current models reusing old reasoning beyond something new that's being tried.

whatever1983 · 2025-11-29T04:02:23Z

@ magikRUKKOLA
Sorry, just saw this. Been busy last two days. The difference between KiloCode and claude code is not npm. After using both extensively, you don't discover Kilocode's shortcomings until you fire very complex queries at it. Kilo typically would error out after 5-10 complex turns, it depends on the model to not crap out after 10 turns. Error correction on Open Source Chinese models are so much worse on kilocode. If it errors out on a $1/M Chinese open source model on openrouter, it costs $0.50 which is ok. If you are starting to use $10-$20/M heavy lifters, each error crap out is now $5 dollars out the window. Claude code typically doesn't error out because it is used to use $75/M models, so wastes are none existent.

Realistically though, only GPT-5.1 High cuts it. Gemini 3 can only do brand new frontend projects. The guy at cursor forum also commented that don't let Gemini 3 touch existing GPT-5.1 code base because it can't do sheet.

Sadly this patch is already merged in mainline, but not yet here in ik_llama. Ironic.

server : add Anthropic Messages API support

eaeaa60

hksdpc255 marked this pull request as ready for review November 26, 2025 09:09

ikawrakow approved these changes Nov 29, 2025

View reviewed changes

ikawrakow merged commit e7ecdb8 into ikawrakow:main Nov 29, 2025

server : add Anthropic Messages API support #1012

server : add Anthropic Messages API support #1012

Uh oh!

Conversation

hksdpc255 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 26, 2025

Uh oh!

ikawrakow commented Nov 26, 2025

Uh oh!

whatever1983 commented Nov 26, 2025

Uh oh!

hksdpc255 commented Nov 26, 2025

Uh oh!

whatever1983 commented Nov 26, 2025

Uh oh!

magikRUKKOLA commented Nov 27, 2025

Uh oh!

magikRUKKOLA commented Nov 27, 2025

Uh oh!

hksdpc255 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 27, 2025

Uh oh!

magikRUKKOLA commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 27, 2025

Uh oh!

Ph0rk0z commented Nov 28, 2025

Uh oh!

whatever1983 commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hksdpc255 commented Nov 26, 2025 •

edited

Loading

hksdpc255 commented Nov 27, 2025 •

edited

Loading

magikRUKKOLA commented Nov 27, 2025 •

edited

Loading