[P/D]Add a heartbeat mechanism to PD separation #4071

wangxiaoteng888 · 2025-11-08T08:36:21Z

What this PR does / why we need it?

Add a heartbeat mechanism to PD separation and clean up historical link information based on heartbeat status.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ci

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-11-08T08:36:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a heartbeat mechanism for peer-to-peer communication in the PD separation setup, which is a valuable addition for improving the robustness of the distributed system. The implementation in HeartbeatMonitor is a good starting point, but it has some critical flaws. My review focuses on improving the reliability and debuggability of this new mechanism. Specifically, I've pointed out several places where exceptions are silently ignored, which can hide serious problems. I've also identified some dead code and a suspicious error handling block that could be masking a logic bug. Addressing these points will make the heartbeat feature much more robust and maintainable.

gemini-code-assist · 2025-11-08T08:38:32Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

+                        try:
+                            sock.send_multipart((identity, b"", HEARTBEAT_ACK))
+                        except Exception:
+                            pass


Silently swallowing all exceptions with except Exception: pass is dangerous. If sending the heartbeat acknowledgment fails, the HeartbeatMonitor on the other side will not receive a reply and may incorrectly assume this peer is down. This could lead to connection issues and failures that are very hard to debug. The exception should be logged to provide visibility into network or socket problems.

Suggested change

try:

sock.send_multipart((identity, b"", HEARTBEAT_ACK))

except Exception:

pass

try:

sock.send_multipart((identity, b"", HEARTBEAT_ACK))

except Exception as e:

logger.warning(f"Failed to send heartbeat ACK to {identity!r}: {e}")

gemini-code-assist · 2025-11-08T08:38:32Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

+                except Exception:
+                    # send error: keep _alive as is; next round will try again
+                    pass


Silently swallowing all exceptions with except Exception: pass is dangerous. It can hide critical issues in the heartbeat mechanism, such as configuration errors or network problems, making debugging extremely difficult. If sock.send fails consistently, it will never be reported, and the peer might be considered alive by other parts of the system while it's not being pinged. You should at least log the exception to provide visibility into potential problems.

except Exception as e: # send error: keep _alive as is; next round will try again host, port = key logger.warning(f"[HB->] failed to send ping to {host}:{port}: {e}")

gemini-code-assist · 2025-11-08T08:38:32Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

+        self._status: dict[tuple[str, int], bool] = {}
+        self._counters: dict[tuple[str, int], dict[str, int]] = {}
+        self._down_threshold = (3 * self.poll_interval + self.timeout)


There are a couple of issues with member variables here:

The _status dictionary is initialized here and written to in add_target, but its value is never read. This is dead code and should be removed.

_down_threshold is calculated but never used. The is_alive method re-calculates the same value (3 * self.poll_interval + self.timeout) instead of using this member. This is redundant and could lead to inconsistencies.

I recommend removing _status and using _down_threshold in is_alive.

gemini-code-assist · 2025-11-08T08:38:32Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

+                    try:
+                        self._counters[key]["sent"] += 1
+                    except KeyError:
+                        self._counters[key] = {
+                            "sent": 1,
+                            "recv": 0,
+                            "timeout": 0
+                        }


This try...except KeyError block is suspicious. The _counters dictionary is populated in add_target under the same lock as _targets. Since the run loop iterates over a snapshot of _targets.items(), every key from the snapshot should already exist in _counters. A KeyError here would indicate a serious logic bug or a race condition. Instead of defensively handling it, it would be better to assert that the key exists or remove this handler if the error is indeed impossible. This kind of defensive coding can mask underlying problems.

self._counters[key]["sent"] += 1

LCAIZJ · 2025-11-10T03:06:57Z

I think in the PD-separation scenario, heartbeat management should be handled by the upper-layer proxy. The proxy should use heartbeats as a basis when routing requests to either P or D. So, is the reason we're making P and D aware of heartbeats here because, due to layer-wise execution, their connection establishment takes relatively long, and thus they need to be heartbeat-aware?

wangxiaoteng888 · 2025-11-10T09:16:10Z

I think in the PD-separation scenario, heartbeat management should be handled by the upper-layer proxy. The proxy should use heartbeats as a basis when routing requests to either P or D. So, is the reason we're making P and D aware of heartbeats here because, due to layer-wise execution, their connection establishment takes relatively long, and thus they need to be heartbeat-aware?

The main reason we do this is to clean up the connector’s historical link information. In the connector, the base_address and port information are persisted. By adding heartbeats here, we can detect disconnections and then conveniently clean up the above persisted information afterward.

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 changed the title ~~Add a heartbeat mechanism to PD separation~~ [P/D]Add a heartbeat mechanism to PD separation Nov 8, 2025

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

wangxiaoteng888 added 4 commits November 11, 2025 19:20

add_pingpong

ef02364

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

change every 3 min done

da1719f

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

fix_bug

bc11d34

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

add_clean

00ff3e6

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 force-pushed the add_pingpong branch 3 times, most recently from dbf5cf3 to c50bde1 Compare November 11, 2025 11:42

rebase

21292b6

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 force-pushed the add_pingpong branch from c50bde1 to 21292b6 Compare November 11, 2025 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P/D]Add a heartbeat mechanism to PD separation #4071

[P/D]Add a heartbeat mechanism to PD separation #4071

Uh oh!

wangxiaoteng888 commented Nov 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

LCAIZJ commented Nov 10, 2025

Uh oh!

wangxiaoteng888 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[P/D]Add a heartbeat mechanism to PD separation #4071

Are you sure you want to change the base?

[P/D]Add a heartbeat mechanism to PD separation #4071

Uh oh!

Conversation

wangxiaoteng888 commented Nov 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

LCAIZJ commented Nov 10, 2025

Uh oh!

wangxiaoteng888 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangxiaoteng888 commented Nov 8, 2025 •

edited by github-actions bot

Loading