Skip to content

Conversation

@hksdpc255
Copy link
Contributor

@hksdpc255 hksdpc255 commented Nov 26, 2025

Ported from ggml-org/llama.cpp#17425

Closes #1010

@hksdpc255 hksdpc255 marked this pull request as ready for review November 26, 2025 09:09
@hksdpc255
Copy link
Contributor Author

Tested with Qwen3-Coder-30B:
pic

@ikawrakow
Copy link
Owner

LGTM, but I see this comment in the mainline PR.

Is the consensus that this is valuable?

@whatever1983
Copy link

@ikawrakow:
yeah, it's very valuable. You shouldn't refer to mainline commentators because they don't get it. I recently dumped Kilocode for Claude Code. Code quality is about 20% higher. Prompts and reasoning are more concise. You should try it. Right now, if you have to use Claude Code with OpenAI endpoints, you have to use Claude Code Router from GLM. It is very hard to configure to use as a openai->cc proxy. (Took me 2 days to figure out)

Just merge it. It is very nice.

@hksdpc255
Copy link
Contributor Author

While it’s possible to use a proxy that maps Anthropic’s API to the OpenAI API, I still think native Anthropic API support is valuable. Anthropic provides a standardized mechanism for transmitting reasoning segments, whereas OpenAI’s reasoning_content field is not consistently implemented across clients (OpenAI's official codex even use another field for reasoning). As a result, Anthropic clients reliably send reasoning blocks back to the server, but OpenAI-style clients often drop them.

For models that depend on interleaved reasoning (e.g. minimax-m2 and kimi-k2) this difference matters. Leveraging Anthropic’s native Messages API can provide better model behavior and, in some cases, improved performance. It also enables claude-code to work with ik_llama.cpp out of the box, without requiring manually setup additional services.

@whatever1983
Copy link

@hksdpc255:
I am on your side. IK will eventually get it, since he's technical, unlike the mainline which is turning into illegal CANN/Vulkan crappy hardware framework playground.

@magikRUKKOLA
Copy link

@whatever1983

I recently dumped Kilocode for Claude Code.

But have you seen this?

npm install -g @kilocode/cli

npm for the cli tool? The npm which is getting hacked almost every week?
There are more than 100 dependencies.

https://www.npmjs.com/package/@kilocode/cli?activeTab=dependencies

Are you sure its a good idea?

Code quality is about 20% higher. Prompts and reasoning are more concise.

I've seen the RooCode just today. I do believe absolutely anything will be better than the RooCode. :)

@magikRUKKOLA
Copy link

@hksdpc255

As a result, Anthropic clients reliably send reasoning blocks back to the server, but OpenAI-style clients often drop them.

But what is the consensus regarding that? Is it actually necessary to send the reasoning content back? I am having an impression that it might be too redundant. Bad thing is that the prompt caching will not work as fast, but the context will be optimized (compressed) without the reasoning.

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 27, 2025

Is it actually necessary to send the reasoning content back?

At least GLM-4.5, Minimax-M2, and Kimi-K2-Thinking benefit from this. For Minimax-M2 and Kimi-K2-Thinking, the official documentation explicitly states that interleaved reasoning support is required. Without it, the model’s performance degrades. This makes Anthropic-style reasoning blocks more than just a convenience scence they’re part of the intended interface for these models.

Edit: gpt-oss series also needs to send reasoning content back.

@hksdpc255
Copy link
Contributor Author

Regarding context length: based on the official chat templates for GLM-4.5, Minimax-M2, and Kimi-K2-Thinking, all three models automatically strip out past reasoning blocks, preserving only the reasoning associated with the current user request. So the additional context overhead is actually not so much.

Also, context management is typically handled by the client-side AI agent, not the model itself. For example, claude-code does not send the full history back to the server every time, it performs its own close-sourced “smart” context selection to determine what needs to be included. Thus, reasoning content handling should not materially increase the true context load.

@magikRUKKOLA
Copy link

magikRUKKOLA commented Nov 27, 2025

@hksdpc255

all three models automatically strip out past reasoning blocks,

I didn't get it. They doing what? Its the server-side we are talking about, right?

[EDIT]:

Also, context management is typically handled by the client-side AI agent, not the model itself.

So basically its not the models that are stripping the reasoning_content etc. but the ... agents that go with them?

@hksdpc255
Copy link
Contributor Author

I didn't get it. They doing what? Its the server-side we are talking about, right?

They doing this by their chat template, so ik_llama.cpp/llama.cpp automatically get this.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 28, 2025

How is anthropic at handling media vs openai? That may be a useful bit too. IIRC their handling of assistant prefill and continues was better. Never heard of current models reusing old reasoning beyond something new that's being tried.

@whatever1983
Copy link

@ magikRUKKOLA
Sorry, just saw this. Been busy last two days. The difference between KiloCode and claude code is not npm. After using both extensively, you don't discover Kilocode's shortcomings until you fire very complex queries at it. Kilo typically would error out after 5-10 complex turns, it depends on the model to not crap out after 10 turns. Error correction on Open Source Chinese models are so much worse on kilocode. If it errors out on a $1/M Chinese open source model on openrouter, it costs $0.50 which is ok. If you are starting to use $10-$20/M heavy lifters, each error crap out is now $5 dollars out the window. Claude code typically doesn't error out because it is used to use $75/M models, so wastes are none existent.

Realistically though, only GPT-5.1 High cuts it. Gemini 3 can only do brand new frontend projects. The guy at cursor forum also commented that don't let Gemini 3 touch existing GPT-5.1 code base because it can't do sheet.

Sadly this patch is already merged in mainline, but not yet here in ik_llama. Ironic.

@ikawrakow ikawrakow merged commit e7ecdb8 into ikawrakow:main Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Port upstream Anthropic Messages API support

5 participants