Skip to content

Eval bug: Crash on llama_decode: failed to decode when using universal assisted generation #17480

@wallentri88

Description

@wallentri88

Name and Version

96ac5a2

Operating systems

Linux

GGML backends

CUDA

Hardware

6x3090

Models

Main:
https://huggingface.co/bartowski/TheDrummer_Precog-123B-v1-GGUF?show_file_info=TheDrummer_Precog-123B-v1-Q6_K%2FTheDrummer_Precog-123B-v1-Q6_K-00001-of-00003.gguf

Draft
https://huggingface.co/bartowski/TheDrummer_Precog-24B-v1-GGUF?show_file_info=TheDrummer_Precog-24B-v1-Q3_K_XL.gguf

Problem description & steps to reproduce

Models tokenizers are not compatible, so UAG should come into play. But it often gives errors and crashes described in logs above (i don't know why but Q3_K_XL gives them often). I'm using SillyTavern and just doing multiple swipes of first message. For example, I checked log init: invalid token[1] = 49250, token № 49250 is a "<th" (start of thinking token) in draft model tokenizer. Main model max token is 32767, so somehow draft tokenized string mistakingly goes to the main model? Idk

I'm launching it as this:
./llama-server -m "/TheDrummer_Precog-123B-v1-GGUF/TheDrummer_Precog-123B-v1-Q6_K/TheDrummer_Precog-123B-v1-Q6_K-00001-of-00003.gguf" -dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4 -ts 18,17,17,18,19 -sm layer -c 35000 -b 2048 -ub 2048 -ngl 89 -t 7 -fa auto --no-mmap --no-webui --port 5001 -md "/TheDrummer_Precog-24B-v1-GGUF/TheDrummer_Precog-24B-v1-Q3_K_XL.gguf" -ngld 99 -devd CUDA5

First Bad Commit

No response

Relevant log output

slot update_slots: id  0 | task 97 | accepted 0/16 draft tokens, new n_tokens = 1321
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 98
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 99, front = 0
slot update_slots: id  0 | task 97 | slot decode token, n_ctx = 35072, n_tokens = 1322, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1321
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 97, error: Invalid input batch.
srv          send: sending result for task id = 97
srv          send: task id = 97 pushed to result queue
slot      release: id  0 | task 97 | stop processing: n_tokens = 1322, truncated = 0
slot   clear_slot: id  0 | task -1 | clearing slot with 1322 tokens


OR

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 394 | processing task
slot update_slots: id  0 | task 394 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id  0 | task 394 | need to evaluate at least 1 token for each active slot (n_past = 1320, task.n_tokens() = 1320)
slot update_slots: id  0 | task 394 | n_past was set to 1319
slot update_slots: id  0 | task 394 | n_tokens = 1319, memory_seq_rm [1319, end)
slot update_slots: id  0 | task 394 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  0 | task 394 | prompt done, n_tokens = 1320, batch.n_tokens = 1
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1321
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 394, error: Invalid input batch.
slot      release: id  0 | task 394 | stop processing: n_tokens = 1322, truncated = 0
slot   clear_slot: id  0 | task -1 | clearing slot with 1322 tokens
srv          stop: cancel task, id_task = 394
srv  log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv  update_slots: all slots are idle
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 19962994937
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 398 | processing task
slot update_slots: id  0 | task 398 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id  0 | task 398 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 398 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1320, progress = 1.000000
slot update_slots: id  0 | task 398 | prompt done, n_tokens = 1320, batch.n_tokens = 1320
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
THEN IT CRASHED

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions