-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Description
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
6x3090
Models
Problem description & steps to reproduce
Models tokenizers are not compatible, so UAG should come into play. But it often gives errors and crashes described in logs above (i don't know why but Q3_K_XL gives them often). I'm using SillyTavern and just doing multiple swipes of first message. For example, I checked log init: invalid token[1] = 49250, token № 49250 is a "<th" (start of thinking token) in draft model tokenizer. Main model max token is 32767, so somehow draft tokenized string mistakingly goes to the main model? Idk
I'm launching it as this:
./llama-server -m "/TheDrummer_Precog-123B-v1-GGUF/TheDrummer_Precog-123B-v1-Q6_K/TheDrummer_Precog-123B-v1-Q6_K-00001-of-00003.gguf" -dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4 -ts 18,17,17,18,19 -sm layer -c 35000 -b 2048 -ub 2048 -ngl 89 -t 7 -fa auto --no-mmap --no-webui --port 5001 -md "/TheDrummer_Precog-24B-v1-GGUF/TheDrummer_Precog-24B-v1-Q3_K_XL.gguf" -ngld 99 -devd CUDA5
First Bad Commit
No response
Relevant log output
slot update_slots: id 0 | task 97 | accepted 0/16 draft tokens, new n_tokens = 1321
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 98
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 99, front = 0
slot update_slots: id 0 | task 97 | slot decode token, n_ctx = 35072, n_tokens = 1322, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
- the tokens for sequence 0 in the input batch have a starting position of Y = 1321
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv send_error: task id = 97, error: Invalid input batch.
srv send: sending result for task id = 97
srv send: task id = 97 pushed to result queue
slot release: id 0 | task 97 | stop processing: n_tokens = 1322, truncated = 0
slot clear_slot: id 0 | task -1 | clearing slot with 1322 tokens
OR
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 394 | processing task
slot update_slots: id 0 | task 394 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id 0 | task 394 | need to evaluate at least 1 token for each active slot (n_past = 1320, task.n_tokens() = 1320)
slot update_slots: id 0 | task 394 | n_past was set to 1319
slot update_slots: id 0 | task 394 | n_tokens = 1319, memory_seq_rm [1319, end)
slot update_slots: id 0 | task 394 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id 0 | task 394 | prompt done, n_tokens = 1320, batch.n_tokens = 1
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 1319
- the tokens for sequence 0 in the input batch have a starting position of Y = 1321
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv send_error: task id = 394, error: Invalid input batch.
slot release: id 0 | task 394 | stop processing: n_tokens = 1322, truncated = 0
slot clear_slot: id 0 | task -1 | clearing slot with 1322 tokens
srv stop: cancel task, id_task = 394
srv log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv update_slots: all slots are idle
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 19962994937
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 398 | processing task
slot update_slots: id 0 | task 398 | new prompt, n_ctx_slot = 35072, n_keep = 0, task.n_tokens = 1320
slot update_slots: id 0 | task 398 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 398 | prompt processing progress, n_tokens = 1320, batch.n_tokens = 1320, progress = 1.000000
slot update_slots: id 0 | task 398 | prompt done, n_tokens = 1320, batch.n_tokens = 1320
init: invalid token[1] = 49250
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
THEN IT CRASHED