Skip to content

Conversation

@sts07142
Copy link
Contributor

@sts07142 sts07142 commented Nov 27, 2025

Purpose

fixes: #29563

Fix correct to use optional --tokenizer argument when using GGUF model (local and remote).

As-Is: vllm serve <gguf_model> --tokenizer <tokenizer>
To-Be: vllm serve <gguf_model> (--tokenizer optional)

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S

Test Result

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S
vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S
(APIServer pid=77097) INFO 11-27 14:25:21 [api_server.py:2056] vLLM API server version 0.11.2.dev296+gd9d342d21.d20251127
(APIServer pid=77097) INFO 11-27 14:25:21 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'}
(APIServer pid=77097) INFO 11-27 14:25:22 [model.py:623] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=77097) INFO 11-27 14:25:22 [model.py:1732] Using max model len 40960
(APIServer pid=77097) INFO 11-27 14:25:22 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev296+gd9d342d21.d20251127) with config: model='unsloth/Qwen3-0.6B-GGUF:IQ1_S', speculative_config=None, tokenizer='unsloth/Qwen3-0.6B-GGUF:IQ1_S', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [parallel_state.py:1219] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.13:38149 backend=nccl
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [parallel_state.py:1427] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:39 [gpu_model_runner.py:3412] Starting to load model unsloth/Qwen3-0.6B-GGUF:IQ1_S...
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:49 [cuda.py:416] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:54 [gpu_model_runner.py:3494] Model loading took 0.2154 GiB memory and 13.978488 seconds
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:57 [backends.py:655] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/d364c824cb/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=77312) INFO 11-27 14:25:57 [backends.py:715] Dynamo bytecode transform time: 3.00 s
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:00 [backends.py:216] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.184 s
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:01 [monitor.py:34] torch.compile takes 6.19 s in total
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [gpu_worker.py:349] Available KV cache memory: 65.28 GiB
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [kv_cache_utils.py:1286] GPU KV cache size: 611,136 tokens
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:02 [kv_cache_utils.py:1291] Maximum concurrency for 40,960 tokens per request: 14.92x
(EngineCore_DP0 pid=77312) 2025-11-27 14:26:02,535 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=77312) 2025-11-27 14:26:02,548 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████| 51/51 [00:01<00:00, 33.69it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████| 51/51 [00:01<00:00, 46.53it/s]
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:05 [gpu_model_runner.py:4411] Graph capturing finished in 3 secs, took 0.71 GiB
(EngineCore_DP0 pid=77312) INFO 11-27 14:26:05 [core.py:253] init engine (profile, create kv cache, warmup model) took 11.36 seconds
(APIServer pid=77097) INFO 11-27 14:26:18 [api_server.py:1804] Supported tasks: ['generate']
(APIServer pid=77097) INFO 11-27 14:26:19 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:38] Available routes are:
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=77097) INFO 11-27 14:26:19 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=77097) INFO:     Started server process [77097]
(APIServer pid=77097) INFO:     Waiting for application startup.
(APIServer pid=77097) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the experience of loading GGUF models by making the --tokenizer argument optional. The changes automatically detect the correct GGUF file from a HuggingFace repository based on the quantization type. The implementation is mostly solid, but I've identified two high-severity issues. One is related to overly broad exception handling which could hide bugs, and the other is a potential silent failure if the GGUF tokenizer file isn't found, which could lead to incorrect model behavior. Addressing these will make the new feature more robust.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
@sts07142 sts07142 changed the title feat: improve loading tokenizer when loadling gguf models [Feature] Improve loading tokenizer when loadling gguf models Nov 27, 2025
@sts07142
Copy link
Contributor Author

sts07142 commented Nov 27, 2025

@Isotr0py
Review it please!

@sts07142 sts07142 changed the title [Feature] Improve loading tokenizer when loadling gguf models [Feature] Improve tokenizer loading when loading GGUF models Nov 27, 2025
@sts07142 sts07142 changed the title [Feature] Improve tokenizer loading when loading GGUF models [BugFix] Optional tokenizer argument when loading GGUF models Nov 27, 2025
@Isotr0py Isotr0py self-assigned this Nov 27, 2025
- list_filtered_repo_files

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
@sts07142 sts07142 force-pushed the feat/improve-tokenizer-remote-gguf branch from cbd7d7b to ac41103 Compare November 27, 2025 14:07
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! Thanks for fixing this!

@Isotr0py Isotr0py enabled auto-merge (squash) November 27, 2025 14:39
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025
@Isotr0py Isotr0py merged commit 0840abd into vllm-project:main Nov 27, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Improve tokenizer loading when loading GGUF models

2 participants