[BugFix] Optional tokenizer argument when loading GGUF models #29582

sts07142 · 2025-11-27T05:12:47Z

Purpose

Fix correct to use optional --tokenizer argument when using GGUF model (local and remote).

As-Is: vllm serve <gguf_model> --tokenizer <tokenizer>
To-Be: vllm serve <gguf_model> (--tokenizer optional)

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S

Test Result

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S
(APIServer pid=77097) INFO 11-27 14:25:21 [api_server.py:2056] vLLM API server version 0.11.2.dev296+gd9d342d21.d20251127
(APIServer pid=77097) INFO 11-27 14:25:21 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'}
(APIServer pid=77097) INFO 11-27 14:25:22 [model.py:623] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=77097) INFO 11-27 14:25:22 [model.py:1732] Using max model len 40960
(APIServer pid=77097) INFO 11-27 14:25:22 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev296+gd9d342d21.d20251127) with config: model='unsloth/Qwen3-0.6B-GGUF:IQ1_S', speculative_config=None, tokenizer='unsloth/Qwen3-0.6B-GGUF:IQ1_S', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [parallel_state.py:1219] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.13:38149 backend=nccl
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [parallel_state.py:1427] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [gpu_model_runner.py:3412] Starting to load model unsloth/Qwen3-0.6B-GGUF:IQ1_S...
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:49 [cuda.py:416] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:54 [gpu_model_runner.py:3494] Model loading took 0.2154 GiB memory and 13.978488 seconds
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:57 [backends.py:655] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/d364c824cb/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:57 [backends.py:715] Dynamo bytecode transform time: 3.00 s
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:00 [backends.py:216] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.184 s
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:01 [monitor.py:34] torch.compile takes 6.19 s in total
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [gpu_worker.py:349] Available KV cache memory: 65.28 GiB
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [kv_cache_utils.py:1286] GPU KV cache size: 611,136 tokens
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [kv_cache_utils.py:1291] Maximum concurrency for 40,960 tokens per request: 14.92x
(EngineCore_DP0 pid=77312) 2025-11-27 14:26:02,535 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=77312) 2025-11-27 14:26:02,548 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████| 51/51 [00:01<00:00, 33.69it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████| 51/51 [00:01<00:00, 46.53it/s]
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:05 [gpu_model_runner.py:4411] Graph capturing finished in 3 secs, took 0.71 GiB
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:05 [core.py:253] init engine (profile, create kv cache, warmup model) took 11.36 seconds
(APIServer pid=77097) INFO 11-27 14:26:18 [api_server.py:1804] Supported tasks: ['generate']
(APIServer pid=77097) INFO 11-27 14:26:19 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:38] Available routes are:
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=77097) INFO:     Started server process [77097]
(APIServer pid=77097) INFO:     Waiting for application startup.
(APIServer pid=77097) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

gemini-code-assist

Code Review

This pull request improves the experience of loading GGUF models by making the --tokenizer argument optional. The changes automatically detect the correct GGUF file from a HuggingFace repository based on the quantization type. The implementation is mostly solid, but I've identified two high-severity issues. One is related to overly broad exception handling which could hide bugs, and the other is a potential silent failure if the GGUF tokenizer file isn't found, which could lead to incorrect model behavior. Addressing these will make the new feature more robust.

vllm/transformers_utils/tokenizer.py

vllm/transformers_utils/gguf_utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/config/model.py

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

sts07142 · 2025-11-27T05:29:39Z

@Isotr0py
Review it please!

vllm/transformers_utils/gguf_utils.py

vllm/config/model.py

- list_filtered_repo_files Signed-off-by: Injae Ryou <injaeryou@gmail.com>

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py

LGTM now! Thanks for fixing this!

feat: improve loading tokenizer when loadling gguf models

de8e328

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

sts07142 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners November 27, 2025 05:12

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

vllm/transformers_utils/tokenizer.py Outdated Show resolved Hide resolved

vllm/transformers_utils/gguf_utils.py Outdated Show resolved Hide resolved

raise error when remote gguf's tokenizer is None

9ee0a0c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com>

chatgpt-codex-connector bot reviewed Nov 27, 2025

View reviewed changes

vllm/config/model.py Show resolved Hide resolved

refactor: catch Exception

34f3d15

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

sts07142 changed the title ~~feat: improve loading tokenizer when loadling gguf models~~ [Feature] Improve loading tokenizer when loadling gguf models Nov 27, 2025

sts07142 changed the title ~~[Feature] Improve loading tokenizer when loadling gguf models~~ [Feature] Improve tokenizer loading when loading GGUF models Nov 27, 2025

sts07142 changed the title ~~[Feature] Improve tokenizer loading when loading GGUF models~~ [BugFix] Optional tokenizer argument when loading GGUF models Nov 27, 2025

Isotr0py self-assigned this Nov 27, 2025

Isotr0py reviewed Nov 27, 2025

View reviewed changes

vllm/transformers_utils/gguf_utils.py Outdated Show resolved Hide resolved

vllm/config/model.py Show resolved Hide resolved

sts07142 added 2 commits November 27, 2025 22:59

refactor: reuse existing function

bc4e6b2

- list_filtered_repo_files Signed-off-by: Injae Ryou <injaeryou@gmail.com>

chore: pre-commit

ac41103

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

sts07142 force-pushed the feat/improve-tokenizer-remote-gguf branch from cbd7d7b to ac41103 Compare November 27, 2025 14:07

raise invalid tokenizer early for mm gguf

46f1b44

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py approved these changes Nov 27, 2025

View reviewed changes

Merge branch 'main' into feat/improve-tokenizer-remote-gguf

0052a5b

Isotr0py enabled auto-merge (squash) November 27, 2025 14:39

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025

Isotr0py merged commit 0840abd into vllm-project:main Nov 27, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Optional tokenizer argument when loading GGUF models #29582

[BugFix] Optional tokenizer argument when loading GGUF models #29582

sts07142 commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

sts07142 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Isotr0py left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] Optional tokenizer argument when loading GGUF models #29582

[BugFix] Optional tokenizer argument when loading GGUF models #29582

Conversation

sts07142 commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

sts07142 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sts07142 commented Nov 27, 2025 •

edited by github-actions bot

Loading

sts07142 commented Nov 27, 2025 •

edited

Loading