[Bug] 使用ascend 310p基于deepseek v3 q4km量化模型推理，报错call hccl api failed，Failed to allocate memory

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

### Describe the bug

执行python /home/ktransformers-main/ktransformers/server/main.py，使用deepseek-v3 q4km gguf模型，报错：
2025-10-29 03:01:27,790 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************

warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
set start method
Connected to server at tcp://localhost:41617
2025-10-29 03:01:36,526 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************

warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this.
start method already set to spawn
2025-10-29 03:01:48,232 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
found flashinfer
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************

warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
class OllamaShowResponse(BaseModel):
start to init process group ------rank is 0, world_size is 1
[W1029 03:01:51.812052080 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [localhost]:31777 (errno: 97 - Address family not supported by protocol).
init process group success ------rank is 0, world_size is 1
Connected to server at tcp://localhost:41617
args.architectures: DeepSeek-Coder-V2-Instruct
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
/usr/lib64/python3.11/tempfile.py:904: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpf2t8y3jt'>
_warnings.warn(warn_message, ResourceWarning)
sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self.kwargs)
File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 403, in run_engine
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 255, in init
torch.distributed.barrier(group=tp_group)
File "/usr/local/lib64/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
work = group.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create_config:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:148 HCCL function error: hcclCommInitRootInfoConfig(numRanks, &rootInfo, rank, config, &(comm->hcclComm)), error code is 2
[ERROR] 2025-10-29-03:02:35 (PID:501951, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EL0004: [PID: 501951] 2025-10-29-03:02:35.297.548 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
Failed to allocate resource[DeviceMemory] with info [size:32]. Reason: Memory resources are exhausted.

/usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUB) at 0xfffd27733460>
traceback.print_exc()
/usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed context <zmq.Context() at 0xfffcea209c10>
traceback.print_exc()
sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute

### Reproduction

python /home/ktransformers-main/ktransformers/server/main.py --model_path /models/deepseek/deepseek-v3-config/ --gguf_path /models/deepseek/deepseek-v3-gguf/ --cpu_infer 120 --optimize_config_path /home/ktransformers-main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-npu.yaml --backend_type balance_serve --port 31444 --architectures KDeepseekV3ForCausalLM --max_new_tokens 128 --max_batch_size 4 --use_cuda_graph --tp 1

### Environment

安装rpm：
Ascend-cann-toolkit-8.2.RC1-linux.aarch64
Ascend-cann-nnal-8.2.RC1-linux.aarch64
Ascend-cann-kernels-310p-8.2.RC1-linux.aarch64

pip安装的关键包版本：
ktransformers 0.3.2+npu2.5.1.post1torch25aarch64
torch 2.5.1
torch-npu 2.5.1.post1
torchaudio 2.5.1
torchvision 0.20.1
transformers 4.57.1

npu及驱动信息：
npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 25.2.2 Version: 25.2.2 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 1 310P3 | OK | NA 47 0 / 0 |
| 0 0 | 0000:01:00.0 | 0 1872 / 23047 |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+======================================================+
| No running processes found in NPU 1 |
+===============================+=================+======================================================+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] 使用ascend 310p基于deepseek v3 q4km量化模型推理，报错call hccl api failed，Failed to allocate memory #1536

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] 使用ascend 310p基于deepseek v3 q4km量化模型推理，报错call hccl api failed，Failed to allocate memory #1536

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions