Skip to content

Misc. bug: Large tokens produce incorrect error messages due to integer overflow. #17463

@ylwango613

Description

@ylwango613

Name and Version

./build/bin/llama-cli --version
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 9654 96-Core Processor)
load_backend: failed to find ggml_backend_init in /data/ylwang/Projects/llama.cpp/build/bin/libggml-cpu.so
version: 7139 (923ae3c)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m 0.gguf

Problem description & steps to reproduce

PoC

import requests
a = '\n' * 2147483648
resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "messages": [
             {"role": 'user', "content": a}
        ],
        "max_tokens": 20
    }
)
response = resp.json()
print(resp.json())

Running this Python file will reproduce the issue.

Displayed Result

You can see that llama.server returns:

python3 1.py
{'error': {'code': 500, 'message': 'this custom template is not supported, try using --jinja', 'type': 'server_error'}}

This message says that the template is incorrect, but the real reason is that the token is too long, causing an integer overflow.

gdb Debugging

Thread 6 "llama-server" hit Breakpoint 1, common_chat_templates_apply_legacy (tmpls=0x5030002a8c20, inputs=...) at /data/ylwang/Projects/llama.cpp/common/chat.cpp:3392
3392        int32_t res = llama_chat_apply_template(src.c_str(), chat.data(), chat.size(), inputs.add_generation_prompt, buf.data(), buf.size());
(gdb) p buf.size()
$2 = 2684354565
(gdb) p res
$3 = -2147483648
(gdb) n
3395        if (res < 0) {
(gdb) n
3398            throw std::runtime_error("this custom template is not supported, try using --jinja");
(gdb) n

Root Cause Analysis

In:

int32_t res = llama_chat_apply_template(src.c_str(), chat.data(), chat.size(), inputs.add_generation_prompt, buf.data(), buf.size());

an int32_t is used to store the return value of llama_chat_apply_template:

    int32_t res = llm_chat_apply_template(detected_tmpl, chat_vec, formatted_chat, add_ass);
    ...
    return res;

Inside llm_chat_apply_template:

return dest.size();

The actual return value is dest.size(), whose type is size_t. This should actually be stored in int64_t. When a negative value is returned, it indicates an error.

Fix Suggestion

For the function located at ./src/llama.cpp:336:int32_t llama_chat_apply_template(:

int32_t llama_chat_apply_template(
                              const char * tmpl,
         const struct llama_chat_message * chat,
                                  size_t   n_msg,
                                    bool   add_ass,
                                    char * buf,
                                 int32_t   length) {

Two places need to be changed:

  • Change the return type int32_t to int64_t, so that template errors and integer-overflow-induced template error misreports can be distinguished.
  • Change int32_t length to int64_t length, because the argument passed in is buf.size(), which is actually an unsigned integer. Considering that in real-world cases it is unlikely to exceed int64, changing int32 to int64 allows negative values to be handled properly.

In ./common/chat.cpp:3392: int32_t res = llama_chat_apply_template(src.c_str(), chat.data(), chat.size(), inputs.add_generation_prompt, buf.data(), buf.size());

we should use int64_t to store the value of res, making an overflow very unlikely in this range.

For ./src/llama-chat.cpp:225:int32_t llm_chat_apply_template(:

int32_t llm_chat_apply_template(
    llm_chat_template tmpl,
    const std::vector<const llama_chat_message *> & chat,
    std::string & dest, bool add_ass) {

Change its return type to int64_t.

Use grep to ensure the fix does not affect other code

In a directory excluding tests and examples, run:

grep -Rn "llama_chat_apply_templae" .

The occurrences found are only:

./include/llama.h:1071:    /// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
./include/llama.h:1079:    LLAMA_API int32_t llama_chat_apply_template(
./common/chat.cpp:456:    const int res = llama_chat_apply_template(tmpl.c_str(), chat, 1, true, nullptr, 0);
./common/chat.cpp:3357:// Legacy template route (adhoc C++ implementation of known templates), forward to llama_chat_apply_template.
./common/chat.cpp:3392:    int32_t res = llama_chat_apply_template(src.c_str(), chat.data(), chat.size(), inputs.add_generation_prompt, buf.data(), buf.size());
./common/chat.cpp:3404:        res = llama_chat_apply_template(src.c_str(), chat.data(), chat.size(), inputs.add_generation_prompt, buf.data(), buf.size());
./src/llama.cpp:336:int32_t llama_chat_apply_template(

Here:

  • ./include/llama.h is the declaration
  • ./common/chat.cpp is where the res variable issue occurs
  • ./src/llama.cpp:336 is where the function is defined

So we only need to modify the types declared in llama.h.

Using:

grep -Rn "llm_chat_apply_templae" .

The results are only:

./src/llama-chat.h:66:int32_t llm_chat_apply_template(
./src/llama.cpp:357:    int32_t res = llm_chat_apply_template(detected_tmpl, chat_vec, formatted_chat, add_ass);
./src/llama-chat.cpp:225:int32_t llm_chat_apply_template(

So we only need to change:

  • ./src/llama.cpp:357 — change res to int64
  • Change the return type in ./src/llama-chat.cpp and .h to int64_t.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomerslow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)server

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions