Skip to content

[Bug] 使用ascend 310p基于deepseek v3 q4km量化模型推理,报错call hccl api failed,Failed to allocate memory #1536

@fanzetian

Description

@fanzetian

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

执行python /home/ktransformers-main/ktransformers/server/main.py,使用deepseek-v3 q4km gguf模型,报错:
2025-10-29 03:01:27,790 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:


The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty


warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
set start method
Connected to server at tcp://localhost:41617
2025-10-29 03:01:36,526 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:


The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty


warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
start method already set to spawn
2025-10-29 03:01:48,232 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:


The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty


warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
start to init process group ------rank is 0, world_size is 1
[W1029 03:01:51.812052080 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [localhost]:31777 (errno: 97 - Address family not supported by protocol).
init process group success ------rank is 0, world_size is 1
Connected to server at tcp://localhost:41617
args.architectures: DeepSeek-Coder-V2-Instruct
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
/usr/lib64/python3.11/tempfile.py:904: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpf2t8y3jt'>
_warnings.warn(warn_message, ResourceWarning)
sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self.kwargs)
File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 403, in run_engine
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 255, in init
torch.distributed.barrier(group=tp_group)
File "/usr/local/lib64/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
work = group.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create_config:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:148 HCCL function error: hcclCommInitRootInfoConfig(numRanks, &rootInfo, rank, config, &(comm->hcclComm)), error code is 2
[ERROR] 2025-10-29-03:02:35 (PID:501951, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EL0004: [PID: 501951] 2025-10-29-03:02:35.297.548 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
Failed to allocate resource[DeviceMemory] with info [size:32]. Reason: Memory resources are exhausted.

/usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUB) at 0xfffd27733460>
traceback.print_exc()
/usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed context <zmq.Context() at 0xfffcea209c10>
traceback.print_exc()
sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute

Reproduction

python /home/ktransformers-main/ktransformers/server/main.py --model_path /models/deepseek/deepseek-v3-config/ --gguf_path /models/deepseek/deepseek-v3-gguf/ --cpu_infer 120 --optimize_config_path /home/ktransformers-main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-npu.yaml --backend_type balance_serve --port 31444 --architectures KDeepseekV3ForCausalLM --max_new_tokens 128 --max_batch_size 4 --use_cuda_graph --tp 1

Environment

安装rpm:
Ascend-cann-toolkit-8.2.RC1-linux.aarch64
Ascend-cann-nnal-8.2.RC1-linux.aarch64
Ascend-cann-kernels-310p-8.2.RC1-linux.aarch64

pip安装的关键包版本:
ktransformers 0.3.2+npu2.5.1.post1torch25aarch64
torch 2.5.1
torch-npu 2.5.1.post1
torchaudio 2.5.1
torchvision 0.20.1
transformers 4.57.1

npu及驱动信息:
npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 25.2.2 Version: 25.2.2 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 1 310P3 | OK | NA 47 0 / 0 |
| 0 0 | 0000:01:00.0 | 0 1872 / 23047 |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+======================================================+
| No running processes found in NPU 1 |
+===============================+=================+======================================================+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions