-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
System Info
Environment:
- Docker image:
ghcr.io/huggingface/text-generation-inference:latest - GPU: Single Nvidia RTX 4090
- Model:
Qwen2.5-VL-3B-Instruct
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
When I load only the base model, everything runs successfully. However, when I try to use a LoRA adapter, I encounter a problem:
2025-10-10T04:56:41.424770Z INFO text_generation_launcher: Args {
model_id: "/data/models/Qwen/Qwen2.5-VL-3B-Instruct/",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "45deee43948c",
port: 80,
prometheus_port: 9000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: Some(
"/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceanarium/checkpoint-103",
),
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
graceful_termination_timeout: 90,
}
2025-10-10T04:56:42.094905Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-10-10T04:56:42.094914Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0
2025-10-10T04:56:42.124502Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-4090
2025-10-10T04:56:42.139062Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 10000
2025-10-10T04:56:42.139075Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-10-10T04:56:42.139154Z INFO download: text_generation_launcher: Starting check and download process for /data/models/Qwen/Qwen2.5-VL-3B-Instruct/
2025-10-10T04:56:44.131476Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-10-10T04:56:44.548817Z INFO download: text_generation_launcher: Successfully downloaded weights for /data/models/Qwen/Qwen2.5-VL-3B-Instruct/
2025-10-10T04:56:44.548877Z INFO download: text_generation_launcher: Starting check and download process for /data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceanarium/checkpoint-103
2025-10-10T04:56:46.526146Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-10-10T04:56:46.956859Z INFO download: text_generation_launcher: Successfully downloaded weights for /data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceanarium/checkpoint-103
2025-10-10T04:56:46.957039Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-10-10T04:56:49.016998Z INFO text_generation_launcher: Using prefix caching = False
2025-10-10T04:56:49.017018Z INFO text_generation_launcher: Using Attention = flashinfer
2025-10-10T04:56:52.520365Z WARN text_generation_launcher: LoRA adapters enabled (experimental feature).
2025-10-10T04:56:52.520383Z WARN text_generation_launcher: LoRA adapters incompatible with CUDA Graphs. Disabling CUDA Graphs.
2025-10-10T04:56:55.241607Z INFO text_generation_launcher: Using prefill chunking = False
2025-10-10T04:56:55.311710Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in <module>
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 740, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 195, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 313, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
> File "/usr/src/server/text_generation_server/server.py", line 266, in serve_inner
model = get_model_with_lora_adapters(
File "/usr/src/server/text_generation_server/models/__init__.py", line 1830, in get_model_with_lora_adapters
target_to_layer = build_layer_weight_lookup(model.model)
File "/usr/src/server/text_generation_server/utils/adapter.py", line 307, in build_layer_weight_lookup
m = model.text_model.model
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
raise AttributeError(
AttributeError: 'Qwen2Model' object has no attribute 'model'
2025-10-10T04:56:56.472363Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-10-10 04:56:47.736 | INFO | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You are using a model of type qwen2_5_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve │
│ │
│ 116 │ │ raise RuntimeError( │
│ 117 │ │ │ "Only 1 can be set between `dtype` and `quantize`, as they │
│ 118 │ │ ) │
│ ❱ 119 │ server.serve( │
│ 120 │ │ model_id, │
│ 121 │ │ lora_adapters, │
│ 122 │ │ revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ kv_cache_dtype = None │ │
│ │ logger_level = 'INFO' │ │
│ │ lora_adapters = [ │ │
│ │ │ AdapterInfo( │ │
│ │ │ │ │ │
│ │ id='/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceana… │ │
│ │ │ │ path=None, │ │
│ │ │ │ revision=None │ │
│ │ │ ) │ │
│ │ ] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = '/data/models/Qwen/Qwen2.5-VL-3B-Instruct/' │ │
│ │ otlp_endpoint = None │ │
│ │ otlp_service_name = 'text-generation-inference.router' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server = <module 'text_generation_server.server' from │ │
│ │ '/usr/src/server/text_generation_server/server.py'> │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:313 in serve │
│ │
│ 310 │ │ while signal_handler.KEEP_PROCESSING: │
│ 311 │ │ │ await asyncio.sleep(0.5) │
│ 312 │ │
│ ❱ 313 │ asyncio.run( │
│ 314 │ │ serve_inner( │
│ 315 │ │ │ model_id, │
│ 316 │ │ │ lora_adapters, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapters = [ │ │
│ │ │ AdapterInfo( │ │
│ │ │ │ │ │
│ │ id='/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceana… │ │
│ │ │ │ path=None, │ │
│ │ │ │ revision=None │ │
│ │ │ ) │ │
│ │ ] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = '/data/models/Qwen/Qwen2.5-VL-3B-Instruct/' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/runners.py:190 in run │
│ │
│ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │
│ 188 │ │
│ 189 │ with Runner(debug=debug) as runner: │
│ ❱ 190 │ │ return runner.run(main) │
│ 191 │
│ 192 │
│ 193 def _cancel_all_tasks(loop): │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ debug = None │ │
│ │ main = <coroutine object serve.<locals>.serve_inner at 0x74dbb4a7cf70> │ │
│ │ runner = <asyncio.runners.Runner object at 0x74dbb5f23290> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/runners.py:118 in run │
│ │
│ 115 │ │ │
│ 116 │ │ self._interrupt_count = 0 │
│ 117 │ │ try: │
│ ❱ 118 │ │ │ return self._loop.run_until_complete(task) │
│ 119 │ │ except exceptions.CancelledError: │
│ 120 │ │ │ if self._interrupt_count > 0: │
│ 121 │ │ │ │ uncancel = getattr(task, "uncancel", None) │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ context = <_contextvars.Context object at 0x74dbb5e07140> │ │
│ │ coro = <coroutine object serve.<locals>.serve_inner at │ │
│ │ 0x74dbb4a7cf70> │ │
│ │ self = <asyncio.runners.Runner object at 0x74dbb5f23290> │ │
│ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of │ │
│ │ <asyncio.runners.Runner object at 0x74dbb5f23290>>, │ │
│ │ main_task=<Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:242> │ │
│ │ exception=AttributeError("'Qwen2Model' object has no │ │
│ │ attribute 'model'")>) │ │
│ │ task = <Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:242> │ │
│ │ exception=AttributeError("'Qwen2Model' object has no │ │
│ │ attribute 'model'")> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/base_events.py:654 in run_until_complete │
│ │
│ 651 │ │ if not future.done(): │
│ 652 │ │ │ raise RuntimeError('Event loop stopped before Future comp │
│ 653 │ │ │
│ ❱ 654 │ │ return future.result() │
│ 655 │ │
│ 656 │ def stop(self): │
│ 657 │ │ """Stop running the event loop. │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ future = <Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:242> │ │
│ │ exception=AttributeError("'Qwen2Model' object has no │ │
│ │ attribute 'model'")> │ │
│ │ new_task = False │ │
│ │ self = <_UnixSelectorEventLoop running=False closed=True │ │
│ │ debug=False> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:266 in serve_inner │
│ │
│ 263 │ │ │ server_urls = [local_url] │
│ 264 │ │ │
│ 265 │ │ try: │
│ ❱ 266 │ │ │ model = get_model_with_lora_adapters( │
│ 267 │ │ │ │ model_id, │
│ 268 │ │ │ │ lora_adapters, │
│ 269 │ │ │ │ revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ local_url = 'unix:///tmp/text-generation-server-0' │ │
│ │ lora_adapters = [ │ │
│ │ │ AdapterInfo( │ │
│ │ │ │ │ │
│ │ id='/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oce… │ │
│ │ │ │ path=None, │ │
│ │ │ │ revision=None │ │
│ │ │ ) │ │
│ │ ] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = '/data/models/Qwen/Qwen2.5-VL-3B-Instruct/' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server_urls = ['unix:///tmp/text-generation-server-0'] │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ │ unix_socket_template = 'unix://{}-{}' │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/models/__init__.py:1830 in │
│ get_model_with_lora_adapters │
│ │
│ 1827 │ ) │
│ 1828 │ │
│ 1829 │ if len(lora_adapters) > 0: │
│ ❱ 1830 │ │ target_to_layer = build_layer_weight_lookup(model.model) │
│ 1831 │ │ │
│ 1832 │ │ for index, adapter in enumerate(lora_adapters): │
│ 1833 │ │ │ # The AdapterParameters object allows for merging multipl │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapter_ids = [ │ │
│ │ │ │ │
│ │ '/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceanariu… │ │
│ │ ] │ │
│ │ lora_adapters = [ │ │
│ │ │ AdapterInfo( │ │
│ │ │ │ │ │
│ │ id='/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceana… │ │
│ │ │ │ path=None, │ │
│ │ │ │ revision=None │ │
│ │ │ ) │ │
│ │ ] │ │
│ │ max_input_tokens = None │ │
│ │ model = <text_generation_server.models.vlm_causal_lm.VlmCau… │ │
│ │ object at 0x74dbb5f15b90> │ │
│ │ model_id = '/data/models/Qwen/Qwen2.5-VL-3B-Instruct/' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/utils/adapter.py:307 in │
│ build_layer_weight_lookup │
│ │
│ 304 │ if hasattr(model, "language_model"): │
│ 305 │ │ m = model.language_model.model │
│ 306 │ elif hasattr(model, "text_model"): │
│ ❱ 307 │ │ m = model.text_model.model │
│ 308 │ else: │
│ 309 │ │ m = model.model │
│ 310 │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ model = Qwen2_5VLForConditionalGeneration( │ │
│ │ (embed_tokens): TensorParallelEmbedding() │ │
│ │ (visual): Qwen2_5VisionModel( │ │
│ │ │ (patch_embedding): Conv3d(3, 1280, kernel_size=(2, 14, 14), │ │
│ │ stride=(2, 14, 14), bias=False) │ │
│ │ │ (blocks): ModuleList( │ │
│ │ │ (0-31): 32 x Qwen2_5VLVisionBlock( │ │
│ │ │ │ (attn): Qwen2_5VLAttention( │ │
│ │ │ │ (qkv): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ (proj): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (norm1): FastRMSNorm() │ │
│ │ │ │ (norm2): FastRMSNorm() │ │
│ │ │ │ (mlp): Qwen2_5VLVisionMLP( │ │
│ │ │ │ (activation_fn): SiLU() │ │
│ │ │ │ (up): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ (gate): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ (down): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (merger): Qwen2_5VLPatchMerger( │ │
│ │ │ (patch_merger_ln_q): FastRMSNorm() │ │
│ │ │ (fc1): TensorParallelColumnLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ ) │ │
│ │ │ (fc2): TensorParallelRowLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ ) │ │
│ │ │ ) │ │
│ │ ) │ │
│ │ (text_model): Qwen2Model( │ │
│ │ │ (layers): ModuleList( │ │
│ │ │ (0-35): 36 x Qwen2Layer( │ │
│ │ │ │ (self_attn): Qwen2Attention( │ │
│ │ │ │ (rotary_emb): │ │
│ │ RotaryPositionEmbeddingMultimodalSections() │ │
│ │ │ │ (query_key_value): TensorParallelMultiAdapterLinear( │ │
│ │ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (o_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (mlp): Qwen2MLP( │ │
│ │ │ │ (act): SiLU() │ │
│ │ │ │ (gate_up_proj): TensorParallelMultiAdapterLinear( │ │
│ │ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (down_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (input_layernorm): FastRMSNorm() │ │
│ │ │ │ (post_attention_layernorm): FastRMSNorm() │ │
│ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (norm): FastRMSNorm() │ │
│ │ ) │ │
│ │ (lm_head): SpeculativeHead( │ │
│ │ │ (head): TensorParallelHead( │ │
│ │ │ (linear): FastLinear() │ │
│ │ │ ) │ │
│ │ ) │ │
│ │ ) │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1940 │
│ in __getattr__ │
│ │
│ 1937 │ │ │ modules = self.__dict__["_modules"] │
│ 1938 │ │ │ if name in modules: │
│ 1939 │ │ │ │ return modules[name] │
│ ❱ 1940 │ │ raise AttributeError( │
│ 1941 │ │ │ f"'{type(self).__name__}' object has no attribute '{name} │
│ 1942 │ │ ) │
│ 1943 │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ _buffers = {} │ │
│ │ _parameters = {} │ │
│ │ modules = { │ │
│ │ │ 'layers': ModuleList( │ │
│ │ (0-35): 36 x Qwen2Layer( │ │
│ │ │ (self_attn): Qwen2Attention( │ │
│ │ │ (rotary_emb): │ │
│ │ RotaryPositionEmbeddingMultimodalSections() │ │
│ │ │ (query_key_value): TensorParallelMultiAdapterLinear( │ │
│ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (o_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (mlp): Qwen2MLP( │ │
│ │ │ (act): SiLU() │ │
│ │ │ (gate_up_proj): TensorParallelMultiAdapterLinear( │ │
│ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (down_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (input_layernorm): FastRMSNorm() │ │
│ │ │ (post_attention_layernorm): FastRMSNorm() │ │
│ │ ) │ │
│ │ ), │ │
│ │ │ 'norm': FastRMSNorm() │ │
│ │ } │ │
│ │ name = 'model' │ │
│ │ self = Qwen2Model( │ │
│ │ (layers): ModuleList( │ │
│ │ │ (0-35): 36 x Qwen2Layer( │ │
│ │ │ (self_attn): Qwen2Attention( │ │
│ │ │ │ (rotary_emb): │ │
│ │ RotaryPositionEmbeddingMultimodalSections() │ │
│ │ │ │ (query_key_value): │ │
│ │ TensorParallelMultiAdapterLinear( │ │
│ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (o_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (mlp): Qwen2MLP( │ │
│ │ │ │ (act): SiLU() │ │
│ │ │ │ (gate_up_proj): TensorParallelMultiAdapterLinear( │ │
│ │ │ │ (base_layer): TensorParallelColumnLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ │ (down_proj): TensorParallelAdapterRowLinear( │ │
│ │ │ │ (base_layer): TensorParallelRowLinear( │ │
│ │ │ │ │ (linear): FastLinear() │ │
│ │ │ │ ) │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ │ (input_layernorm): FastRMSNorm() │ │
│ │ │ (post_attention_layernorm): FastRMSNorm() │ │
│ │ │ ) │ │
│ │ ) │ │
│ │ (norm): FastRMSNorm() │ │
│ │ ) │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Qwen2Model' object has no attribute 'model' rank=0
Error: ShardCannotStart
2025-10-10T04:56:56.565882Z ERROR text_generation_launcher: Shard 0 failed to start
2025-10-10T04:56:56.565887Z INFO text_generation_launcher: Shutting down shards
Here is the command I used:
sudo docker run --gpus all --shm-size 16g -p 8080:80 \
-v $PWD/output:/data/output \
-v /mnt/nvme_ssd/models:/data/models \
ghcr.io/huggingface/text-generation-inference \
--model-id "/data/models/Qwen/Qwen2.5-VL-3B-Instruct/" \
--lora-adapters "/data/output/Qwen2.5-VL-3B-Instruct-LoRA-Oceanarium/checkpoint-103"Expected behavior
Run successfully.
Metadata
Metadata
Assignees
Labels
No labels