Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 3, 2025

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

  • Bug Fixes

    • Improved response consistency by ensuring message content is always included in streaming responses.
  • Refactor

    • Streamlined internal state management for server operations to enforce proper state initialization before response serialization.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Walkthrough

This change refactors internal state management by removing public chat message tracking from server_slot and introducing a new task_result_state struct with virtual update hooks in result classes. State updates are now coordinated through server_response_reader::set_states() and applied via polymorphic update() methods before serialization.

Changes

Cohort / File(s) Summary
State Tracking Infrastructure
tools/server/server-task.h, tools/server/server-task.cpp
Introduced task_result_state struct to encapsulate per-task streaming state (diffs, chat message, generated text, tool call IDs). Added virtual update() hook to server_task_result base struct and implemented overrides in server_task_result_cmpl_final and server_task_result_cmpl_partial with is_updated flags. Implemented task_result_state::update_chat_msg() method to append text, parse messages, assign tool-call IDs, and compute diffs.
Public API Refactoring
tools/server/server-context.cpp
Removed public members chat_msg, chat_format, and generated_tool_call_ids from server_slot struct. Removed public update_chat_msg() method. Shifts chat state management from public to internal tracking.
Response Queue Integration
tools/server/server-queue.h, tools/server/server-queue.cpp
Added server_response struct and std::vector<task_result_state> states member to server_response_reader. Implemented set_states() method and integrated state application in next() method to update results before stop-check with bounds validation.
Serialization Updates
tools/server/server-task.cpp
Added runtime assertions enforcing update() called prior to JSON serialization in both cmpl_final and cmpl_partial. Modified non-OpenAI-compat and OpenAI-compat JSON serialization to always include content and tokens, removing stream-based conditional exclusion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Attention areas:
    • Verify state application order in server_response_reader::next() occurs before is_stop() check and with proper bounds validation
    • Confirm all server_task_result subclasses properly implement update() override and set is_updated flag
    • Review assertion enforcement for serialization paths to ensure update() is guaranteed called in all code paths
    • Validate that moved task_result_state initialization correctly mirrors removed server_slot members in usage

Poem

🐰 We've tucked the chat state deep inside,
Where public eyes cannot collide,
With virtual hooks to coordinate the flow,
State updates dance before we show!
No more exposures at the seam—
Just clean encapsulation's dream ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains the template reminder and lacks substantive details about the changes, objectives, motivation, or impact of the pull request. Add a detailed description explaining the refactoring of message diffing logic, the rationale for moving state management, and any breaking changes or migration notes for reviewers.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Xsn/server improve msg diff' is vague and lacks specificity about the primary change; 'improve msg diff' is too generic to convey meaningful information about the refactoring. Rewrite the title to clearly describe the main change, such as 'Refactor server message diffing: move chat state from server_slot to task_result_state' or 'Improve server message diff handling with task-level state management'.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xsn/server_improve_msg_diff

Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson
Copy link
Owner Author

ngxson commented Dec 3, 2025

@coderabbitai find the bug here

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch PR details and diff
gh pr view 49 --repo ngxson/llama.cpp --json title,body,files,commits

# Get the diff for this PR
gh pr diff 49 --repo ngxson/llama.cpp

Length of output: 15968


🏁 Script executed:

#!/bin/bash
# Search for where the index field is set in server_task_result_cmpl_final and server_task_result_cmpl_partial
rg -nP -A5 -B5 '\bindex\s*=' tools/server/server-context.cpp | head -100

# Look for how results are created and how index is set
rg -nP 'server_task_result_cmpl_(final|partial)' tools/server/server-context.cpp -A10 | head -150

# Check the definition of index in server_task_result
rg -nP '^\s*int\s+index' tools/server/server-task.h -B2 -A2

Length of output: 7522


@ngxson I found two bugs in this PR:

1. Critical: Use-after-move bug (tools/server/server-context.cpp, lines 2607-2608)

tasks.push_back(std::move(task));                     // line 2607
states.emplace_back(task.params.oaicompat_chat_syntax); // line 2608

After std::move(task) on line 2607, the task object is in a moved-from state. Accessing task.params.oaicompat_chat_syntax on line 2608 is undefined behavior.

Fix: Access the parameter before moving the task:

states.emplace_back(task.params.oaicompat_chat_syntax);
tasks.push_back(std::move(task));

2. Minor: Redundant ternary operator (tools/server/server-task.cpp, line 717)

if (!new_msg.empty()) {
    new_msg.set_tool_call_ids(generated_tool_call_ids, gen_tool_call_id);
    chat_msg = new_msg;
    diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg.empty() ? msg_prv_copy : new_msg);
}

Inside the if (!new_msg.empty()) block, the ternary new_msg.empty() ? msg_prv_copy : new_msg will always evaluate to new_msg. This should be simplified to:

diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg);

The first bug is critical and will cause unpredictable behavior at runtime! 🐰


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)

607-608: Critical: Use-after-move undefined behavior.

At line 607, task is moved into the tasks vector. Then at line 608, task.params.oaicompat_chat_syntax is accessed from the moved-from object. Accessing members of a moved-from object is undefined behavior and can lead to unpredictable results.

Apply this diff to fix the issue by capturing the value before the move:

+        auto chat_syntax = task.params.oaicompat_chat_syntax;
         tasks.push_back(std::move(task));
-        states.emplace_back(task.params.oaicompat_chat_syntax);
+        states.emplace_back(chat_syntax);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7c2cf1 and 0af6362.

📒 Files selected for processing (5)
  • common/chat.cpp (2 hunks)
  • common/chat.h (1 hunks)
  • tools/server/server-context.cpp (6 hunks)
  • tools/server/server-task.cpp (1 hunks)
  • tools/server/server-task.h (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • common/chat.h
  • common/chat.cpp
  • tools/server/server-task.h
  • tools/server/server-task.cpp
  • tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • common/chat.h
  • common/chat.cpp
  • tools/server/server-task.h
  • tools/server/server-task.cpp
  • tools/server/server-context.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (2)
tools/server/server-task.h (2)
tools/server/server-task.cpp (4)
  • update_chat_msg (703-720)
  • update_chat_msg (703-706)
  • update (1453-1493)
  • update (1453-1453)
common/chat-peg-parser.h (1)
  • content (14-14)
tools/server/server-task.cpp (3)
common/chat-parser.cpp (4)
  • common_chat_parse (1402-1485)
  • common_chat_parse (1402-1402)
  • common_chat_parse (1487-1509)
  • common_chat_parse (1487-1487)
tools/server/server-common.cpp (2)
  • gen_tool_call_id (83-85)
  • gen_tool_call_id (83-83)
common/chat.cpp (2)
  • compute_diffs (88-135)
  • compute_diffs (88-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: android-build
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server-windows
🔇 Additional comments (6)
common/chat.cpp (1)

88-134: LGTM - Parameter renaming is consistent throughout.

The rename from previous_msg/new_msg to msg_prv/msg_new is applied consistently. The diff computation logic remains correct: it compares reasoning content, content, and tool calls between the previous and new message states.

common/chat.h (1)

80-80: LGTM - Header matches implementation.

The function signature declaration correctly mirrors the implementation in chat.cpp.

tools/server/server-task.h (4)

164-181: LGTM - Well-structured state tracking for streaming.

The task_result_state struct cleanly encapsulates per-task streaming state (diffs, chat message, generated text, tool call IDs). The constructor properly initializes the syntax member.


197-199: LGTM - Virtual update hook enables polymorphic state updates.

The default no-op implementation is appropriate for result types that don't need state updates.


272-274: LGTM - Final result update correctly passes is_partial=false.

The final result correctly updates the state with the complete content and populates both oaicompat_msg and oaicompat_msg_diffs.


321-323: LGTM - Partial result update correctly passes is_partial=true.

The partial result correctly updates only the diffs, allowing incremental streaming.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tools/server/server-context.cpp (1)

2659-2659: State update before JSON conversion is correct.

The pattern of updating state before JSON conversion is consistent with the non-streaming path. The same bounds safety consideration from the previous comment applies here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30f7637 and 2f2774c.

📒 Files selected for processing (1)
  • tools/server/server-context.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-context.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tools/server/server-context.cpp (1)
tools/server/server-common.cpp (6)
  • format_anthropic_sse (1485-1506)
  • format_anthropic_sse (1485-1485)
  • format_oai_sse (1466-1483)
  • format_oai_sse (1466-1466)
  • format_error_response (16-58)
  • format_error_response (16-16)
🔇 Additional comments (2)
tools/server/server-context.cpp (2)

1293-1300: LGTM! Clear handling of streaming vs non-streaming modes.

The conditional logic correctly differentiates between streaming and non-streaming modes. In streaming mode, content and tokens have already been sent in partial chunks, so they're appropriately cleared here. The comment clarifies the intent well.


2672-2750: Excellent refactoring with improved error handling and state management.

The streaming lambda has been significantly improved:

  • State ownership: Moving states into the lambda (line 2672) correctly transfers ownership for the lifetime of the streaming response
  • Error handling: The try-catch block (lines 2684-2749) gracefully handles exceptions and sends properly formatted SSE error responses
  • Consistent state updates: State is updated before JSON conversion (line 2731), maintaining consistency with non-streaming path
  • Clear control flow: Early returns with debug logging make the logic easy to follow
  • Format-aware errors: The format_error lambda respects res_type for Anthropic vs OAI compatibility

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tools/server/server-task.cpp (1)

704-721: Redundant ternary in compute_diffs call

Inside if (!new_msg.empty()), the expression

diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy,
        new_msg.empty() ? msg_prv_copy : new_msg);

always selects new_msg, so the ternary adds noise without changing behavior. You can simplify to:

-        diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg.empty() ? msg_prv_copy : new_msg);
+        diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg);
🧹 Nitpick comments (1)
tools/server/server-task.h (1)

164-181: task_result_state design is sound; diffs member appears unused

The state container (syntax, accumulated text, chat_msg, tool-call IDs) matches how update_chat_msg is used later. The diffs member in the struct itself is not referenced anywhere; since all callers pass an external std::vector<common_chat_msg_diff> &, you can drop the member to avoid confusion unless you plan to use it shortly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e89b80d and ba2af58.

📒 Files selected for processing (5)
  • tools/server/server-context.cpp (5 hunks)
  • tools/server/server-queue.cpp (2 hunks)
  • tools/server/server-queue.h (3 hunks)
  • tools/server/server-task.cpp (5 hunks)
  • tools/server/server-task.h (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-queue.cpp
  • tools/server/server-task.cpp
  • tools/server/server-context.cpp
  • tools/server/server-task.h
  • tools/server/server-queue.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-queue.cpp
  • tools/server/server-task.cpp
  • tools/server/server-context.cpp
  • tools/server/server-task.h
  • tools/server/server-queue.h
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (4)
tools/server/server-task.cpp (3)
common/chat-parser.cpp (4)
  • common_chat_parse (1402-1485)
  • common_chat_parse (1402-1402)
  • common_chat_parse (1487-1509)
  • common_chat_parse (1487-1487)
tools/server/server-common.cpp (2)
  • gen_tool_call_id (83-85)
  • gen_tool_call_id (83-83)
common/chat.cpp (2)
  • compute_diffs (88-135)
  • compute_diffs (88-88)
tools/server/server-context.cpp (1)
tools/server/server-common.cpp (6)
  • format_anthropic_sse (1485-1506)
  • format_anthropic_sse (1485-1485)
  • format_oai_sse (1466-1483)
  • format_oai_sse (1466-1466)
  • format_error_response (16-58)
  • format_error_response (16-16)
tools/server/server-task.h (1)
tools/server/server-task.cpp (4)
  • update_chat_msg (704-721)
  • update_chat_msg (704-707)
  • update (1455-1495)
  • update (1455-1455)
tools/server/server-queue.h (1)
tools/server/server-queue.cpp (2)
  • set_states (274-276)
  • set_states (274-274)
🔇 Additional comments (11)
tools/server/server-task.h (3)

197-199: Base update(task_result_state &) no-op is appropriate

Leaving the base implementation empty and documenting that only completion result types use it keeps the hierarchy flexible without impacting other result kinds (embeddings, metrics, etc.). No changes needed here.


258-277: Final completion result update() correctly wires per-task state

Using task_result_state::update_chat_msg(content, /*is_partial=*/false, oaicompat_msg_diffs) and caching into oaicompat_msg plus an is_updated flag matches the later GGML_ASSERT contract in to_json(). This ensures all JSON paths have an up-to-date parsed message and diffs before serialization.


311-327: Partial completion result update() correctly accumulates streaming diffs

Marking is_updated and routing each chunk’s content through state.update_chat_msg(content, /*is_partial=*/true, oaicompat_msg_diffs) matches the intended “diffs per chunk” design and keeps state ownership outside the result object.

tools/server/server-queue.cpp (1)

274-276: State application via set_states and next() is consistent and guarded

Storing states by move and, in next(), mapping result->get_index() into states[idx] with GGML_ASSERT(idx >= 0 && idx < states.size()) gives a clear 1:1 task-to-state contract and fail-fast behavior on mismatch, while remaining a no-op when states is empty (embeddings, rerank, metrics, etc.). This is a clean integration point for the new per-task state mechanism.

Also applies to: 305-310

tools/server/server-queue.h (3)

10-11: Queue-level comments improve discoverability

The brief description that server_queue is normally used via server_response_reader makes the intended usage clear and matches how routes are structured.


72-73: server_response documentation matches actual usage

Clarifying that server_response_reader is the typical consumer for server_response aligns with the implementation and keeps the lower-level API clearly marked as internal-ish.


127-130: Per-task states storage on reader is well-scoped

Tracking std::vector<task_result_state> states only on server_response_reader, with a dedicated set_states entry point and limited to “streaming completions” via comments, keeps the new stateful behavior localized and avoids polluting server_queue/server_response with extra coupling.

Also applies to: 138-138

tools/server/server-task.cpp (2)

567-581: to_json() now requires prior update(); contract is satisfied by current callers

Adding GGML_ASSERT(is_updated && "update() must be called before to_json()") to both final and partial completion results enforces that per-task state has been applied. Given that all HTTP completion paths now go through server_response_reader::next(), which calls result->update(state) before any to_json() call, this is consistent and should be safe for current usages.

Also applies to: 978-992


583-605: Always including content/tokens/text in JSON matches PR goal

Non-OAI and OAI completion JSON now unconditionally include content, tokens, and text fields irrespective of streaming mode. This simplifies clients that previously had to special-case stream vs non-stream payloads and aligns with the PR’s “JSON always includes content/tokens” objective.

Also applies to: 608-650

tools/server/server-context.cpp (2)

577-596: Per-task state construction fixes prior use-after-move and aligns indices

Building std::vector<task_result_state> states; alongside tasks, reserving to inputs.size(), and pushing task.params.oaicompat_chat_syntax into states before tasks.push_back(std::move(task)); both removes the earlier use-after-move bug and guarantees that states[i] corresponds to task.index == i. Passing the whole vector into rd.set_states(std::move(states)); cleanly wires this into server_response_reader.

Also applies to: 613-619


661-682: Streaming SSE path correctly composes with stateful results and error handling

For streaming completions, you now:

  • Obtain first_result via rd.next(should_stop) (which applies state), then serialize once to first_result_json.
  • Seed the HTTP response body with formatted SSE from that JSON.
  • Use res->next to flush the initial chunk, then repeatedly call rd.next(should_stop) to get subsequent results, handling should_stop(), errors (formatted via format_error), and protocol-specific termination ([DONE] for OpenAI-style, final events for Anthropic).

This keeps the transport layer oblivious to result internals while respecting the new update()/to_json() contract and providing clear termination semantics.

Also applies to: 671-748

@ngxson ngxson force-pushed the xsn/server_improve_msg_diff branch from aad20a6 to 1c30c28 Compare December 4, 2025 13:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tools/server/server-task.cpp (1)

704-721: Redundant ternary in update_chat_msg() can be simplified.

Inside if (!new_msg.empty()), the expression

diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg.empty() ? msg_prv_copy : new_msg);

always passes new_msg as the second argument, since new_msg.empty() is already known to be false. This can be simplified to:

-        diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg.empty() ? msg_prv_copy : new_msg);
+        diffs = common_chat_msg_diff::compute_diffs(msg_prv_copy, new_msg);

to remove dead conditional logic.

🧹 Nitpick comments (5)
tools/server/server-queue.cpp (1)

274-276: State application wiring looks correct; document/update the coupling to get_index()/update().

The new set_states() + next() logic cleanly applies per-index task_result_state to results and defensively asserts the index. The only thing worth noting is the implicit contract: once states is non-empty for a given server_response_reader, every non-error result it sees must (a) implement get_index() consistently with the task indices used to build states, and (b) provide a sensible update(task_result_state &) implementation. It may be helpful to mention this in a comment near set_states() or in task_result_state’s docs so future result types don’t accidentally violate the assumption.

Also applies to: 305-310

tools/server/server-task.cpp (1)

568-581: Assert before to_json() is fine; ensure all call sites go through update().

The GGML_ASSERT(is_updated && "update() must be called before to_json()"); guards are a good invariant for cmpl_final and cmpl_partial, given update() is now the only place that wires in task_result_state (chat msg + diffs). Please double‑check (or grep) that no code path constructs these result types and calls to_json() without going through server_response_reader::next() (or otherwise calling update()), especially any older/direct uses of queue_results.recv(...) that might predate server_response_reader.

Also applies to: 979-992

tools/server/server-context.cpp (1)

1287-1300: Stream-mode cmpl_final now omits repeated content/tokens; verify client expectations.

For streaming requests you now zero out res->content and res->tokens in send_final_response, relying on the last partial chunk to have already carried the final text/tokens. This avoids duplication but changes the shape of the last native-format response: content/tokens will now be present but empty instead of repeating data. Please confirm this matches how existing clients consume the stream (especially any that were only looking at the final chunk).

tools/server/server-task.h (2)

164-181: task_result_state encapsulation aligns well with streaming needs.

Bundling diffs, oaicompat_chat_syntax, chat_msg, generated_text, and generated_tool_call_ids into task_result_state gives a clear per-task home for streaming state. The non-explicit ctor from common_chat_syntax is what enables the convenient states.push_back(task.params.oaicompat_chat_syntax); call in server-context.cpp, which is reasonable here—just be aware that it’s an implicit conversion point if more constructors are ever added.


183-202: Result update(task_result_state &) hook is consistent; rely on it for all chat/diff wiring.

Adding a virtual update(task_result_state &) to server_task_result with a no-op default, and overriding it only in server_task_result_cmpl_final and _partial, keeps non-chat results untouched while giving a clean extension point for chat state/diffs. The new members:

  • oaicompat_msg, oaicompat_msg_diffs, is_updated in cmpl_final.
  • oaicompat_msg_diffs, is_updated in cmpl_partial.

combined with the corresponding overrides:

cmpl_final::update:  oaicompat_msg = state.update_chat_msg(content, false, oaicompat_msg_diffs);
cmpl_partial::update: state.update_chat_msg(content, true,  oaicompat_msg_diffs);

form a coherent contract for server_response_reader::next() to invoke once per result before any to_json() calls. This design looks solid; just ensure any future result types that participate in streaming diffs also override update() instead of duplicating parsing logic.

Also applies to: 228-262, 291-328

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aad20a6 and 1c30c28.

📒 Files selected for processing (5)
  • tools/server/server-context.cpp (5 hunks)
  • tools/server/server-queue.cpp (2 hunks)
  • tools/server/server-queue.h (3 hunks)
  • tools/server/server-task.cpp (5 hunks)
  • tools/server/server-task.h (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-task.cpp
  • tools/server/server-task.h
  • tools/server/server-queue.cpp
  • tools/server/server-queue.h
  • tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-task.cpp
  • tools/server/server-task.h
  • tools/server/server-queue.cpp
  • tools/server/server-queue.h
  • tools/server/server-context.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (3)
tools/server/server-task.cpp (3)
common/chat-parser.cpp (4)
  • common_chat_parse (1402-1485)
  • common_chat_parse (1402-1402)
  • common_chat_parse (1487-1509)
  • common_chat_parse (1487-1487)
tools/server/server-common.cpp (2)
  • gen_tool_call_id (83-85)
  • gen_tool_call_id (83-83)
common/chat.cpp (2)
  • compute_diffs (88-135)
  • compute_diffs (88-88)
tools/server/server-task.h (2)
tools/server/server-task.cpp (4)
  • update_chat_msg (704-721)
  • update_chat_msg (704-707)
  • update (1455-1495)
  • update (1455-1455)
tools/server/server-context.cpp (1)
  • state (241-243)
tools/server/server-context.cpp (1)
tools/server/server-common.cpp (6)
  • format_anthropic_sse (1485-1506)
  • format_anthropic_sse (1485-1485)
  • format_oai_sse (1466-1483)
  • format_oai_sse (1466-1466)
  • format_error_response (16-58)
  • format_error_response (16-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: server-windows
  • GitHub Check: check-requirements
  • GitHub Check: pyright type-check
🔇 Additional comments (3)
tools/server/server-context.cpp (2)

2577-2619: Per-task state setup and use-after-move fix look correct.

The new states vector of task_result_state is built in lockstep with tasks:

  • tasks.reserve(inputs.size()); / states.reserve(inputs.size());
  • Within the loop: task.index = i; ... states.push_back(task.params.oaicompat_chat_syntax); tasks.push_back(std::move(task));
  • Afterward: rd.set_states(std::move(states));.

This both fixes the prior use-after-move (state captured before std::move(task)) and ensures result->get_index() maps cleanly into states[idx] in server_response_reader::next(). Just keep in mind that if you ever reuse handle_completions_impl for non-chat responses but still call set_states(), those result types must implement get_index()/update() consistently with this indexing.


2662-2747: Streaming lambda refactor improves robustness; capture lifetime is acceptable.

The updated next lambda:

  • Flushes the precomputed first chunk once from res_this->data.
  • Uses rd.has_next() + rd.next(should_stop) to drive streaming, honoring should_stop() both before and during result retrieval.
  • Emits SSE-formatted error events via the local format_error helper for both OpenAI-like and Anthropic streams.
  • Catches exceptions, maps them to format_error_response(...), and terminates the stream cleanly.

The capture of &should_stop is okay as long as the underlying server_http_req::should_stop outlives the generator (which matches the existing pattern). Overall this is a solid incremental hardening of the streaming path.

tools/server/server-queue.h (1)

10-12: Header-level design and documentation for queues/responses look good.

The header changes clearly separate responsibilities:

  • server_queue for task management.
  • server_response for result fan-in.
  • server_response_reader as the recommended high-level API, now with an explicit states vector and set_states(...) for streaming chat state.

The comments around “only used by streaming completions” help constrain expectations; implementation in the .cpp file aligns with these declarations.

Also applies to: 72-114, 127-139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants