Eval bug: Crash on llama_decode: failed to decode when using universal assisted generation

### Name and Version

96ac5a2329029dfc35c9cbbb24c09fd91ae9416b

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

6x3090

### Models

Main:
https://huggingface.co/bartowski/TheDrummer_Precog-123B-v1-GGUF?show_file_info=TheDrummer_Precog-123B-v1-Q6_K%2FTheDrummer_Precog-123B-v1-Q6_K-00001-of-00003.gguf

Draft
https://huggingface.co/bartowski/TheDrummer_Precog-24B-v1-GGUF?show_file_info=TheDrummer_Precog-24B-v1-Q3_K_XL.gguf

### Problem description & steps to reproduce

Models tokenizers are not compatible, so UAG should come into play. But it often gives errors and crashes described in logs above (i don't know why but Q3_K_XL gives them often). I'm using SillyTavern and just doing multiple swipes of first message. For example, I checked log `init: invalid token[1] = 49250`, token № 49250 is a "<th" (start of thinking token) in draft model tokenizer. Main model max token is 32767, so somehow draft tokenized string mistakingly goes to the main model? Idk

I'm launching it as this:
./llama-server -m "<redacted>/TheDrummer_Precog-123B-v1-GGUF/TheDrummer_Precog-123B-v1-Q6_K/TheDrummer_Precog-123B-v1-Q6_K-00001-of-00003.gguf" -dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4 -ts 18,17,17,18,19 -sm layer -c 35000 -b 2048 -ub 2048 -ngl 89 -t 7 -fa auto --no-mmap --no-webui --port 5001 -md "<redacted>/TheDrummer_Precog-24B-v1-GGUF/TheDrummer_Precog-24B-v1-Q3_K_XL.gguf" -ngld 99 -devd CUDA5

### First Bad Commit

_No response_

### Relevant log output

```shell
slot update_slots: id  0 | task 97 | accepted 0/16 draft tokens, new n_tokens = 1321
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 98
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 99, front = 0
slot update_slots: id  0 | task 97 | slot decode token, n_ctx = 35072, n_tokens = 1322, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1321
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 97, error: Invalid input batch.
srv          send: sending result for task id = 97
srv          send: task id = 97 pushed to result queue
slot      release: id  0 | task 97 | stop processing: n_tokens = 1322, truncated = 0
slot   clear_slot: id  0 | task -1 | clearing slot with 1322 tokens


OR

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 394 | processing task
slot update_slots: id  0 | task 394 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id  0 | task 394 | need to evaluate at least 1 token for each active slot (n_past = 1320, task.n_tokens() = 1320)
slot update_slots: id  0 | task 394 | n_past was set to 1319
slot update_slots: id  0 | task 394 | n_tokens = 1319, memory_seq_rm [1319, end)
slot update_slots: id  0 | task 394 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  0 | task 394 | prompt done, n_tokens = 1320, batch.n_tokens = 1
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1321
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 394, error: Invalid input batch.
slot      release: id  0 | task 394 | stop processing: n_tokens = 1322, truncated = 0
slot   clear_slot: id  0 | task -1 | clearing slot with 1322 tokens
srv          stop: cancel task, id_task = 394
srv  log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv  update_slots: all slots are idle
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 19962994937
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 398 | processing task
slot update_slots: id  0 | task 398 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id  0 | task 398 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 398 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1320, progress = 1.000000
slot update_slots: id  0 | task 398 | prompt done, n_tokens = 1320, batch.n_tokens = 1320
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
THEN IT CRASHED
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Crash on llama_decode: failed to decode when using universal assisted generation #17480

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Crash on llama_decode: failed to decode when using universal assisted generation #17480

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions