Skip to content

Conversation

@wangxiaoteng888
Copy link
Contributor

@wangxiaoteng888 wangxiaoteng888 commented Nov 8, 2025

What this PR does / why we need it?

Add a heartbeat mechanism to PD separation and clean up historical link information based on heartbeat status.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ci

@github-actions
Copy link

github-actions bot commented Nov 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiaoteng888 wangxiaoteng888 changed the title Add a heartbeat mechanism to PD separation [P/D]Add a heartbeat mechanism to PD separation Nov 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a heartbeat mechanism for peer-to-peer communication in the PD separation setup, which is a valuable addition for improving the robustness of the distributed system. The implementation in HeartbeatMonitor is a good starting point, but it has some critical flaws. My review focuses on improving the reliability and debuggability of this new mechanism. Specifically, I've pointed out several places where exceptions are silently ignored, which can hide serious problems. I've also identified some dead code and a suspicious error handling block that could be masking a logic bug. Addressing these points will make the heartbeat feature much more robust and maintainable.

Comment on lines +329 to +336
try:
sock.send_multipart((identity, b"", HEARTBEAT_ACK))
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Silently swallowing all exceptions with except Exception: pass is dangerous. If sending the heartbeat acknowledgment fails, the HeartbeatMonitor on the other side will not receive a reply and may incorrectly assume this peer is down. This could lead to connection issues and failures that are very hard to debug. The exception should be logged to provide visibility into network or socket problems.

Suggested change
try:
sock.send_multipart((identity, b"", HEARTBEAT_ACK))
except Exception:
pass
try:
sock.send_multipart((identity, b"", HEARTBEAT_ACK))
except Exception as e:
logger.warning(f"Failed to send heartbeat ACK to {identity!r}: {e}")

Comment on lines +429 to +451
except Exception:
# send error: keep _alive as is; next round will try again
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Silently swallowing all exceptions with except Exception: pass is dangerous. It can hide critical issues in the heartbeat mechanism, such as configuration errors or network problems, making debugging extremely difficult. If sock.send fails consistently, it will never be reported, and the peer might be considered alive by other parts of the system while it's not being pinged. You should at least log the exception to provide visibility into potential problems.

                except Exception as e:
                    # send error: keep _alive as is; next round will try again
                    host, port = key
                    logger.warning(f"[HB->] failed to send ping to {host}:{port}: {e}")

Comment on lines 360 to 362
self._status: dict[tuple[str, int], bool] = {}
self._counters: dict[tuple[str, int], dict[str, int]] = {}
self._down_threshold = (3 * self.poll_interval + self.timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are a couple of issues with member variables here:

  1. The _status dictionary is initialized here and written to in add_target, but its value is never read. This is dead code and should be removed.
  2. _down_threshold is calculated but never used. The is_alive method re-calculates the same value (3 * self.poll_interval + self.timeout) instead of using this member. This is redundant and could lead to inconsistencies.

I recommend removing _status and using _down_threshold in is_alive.

Comment on lines +417 to +444
try:
self._counters[key]["sent"] += 1
except KeyError:
self._counters[key] = {
"sent": 1,
"recv": 0,
"timeout": 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This try...except KeyError block is suspicious. The _counters dictionary is populated in add_target under the same lock as _targets. Since the run loop iterates over a snapshot of _targets.items(), every key from the snapshot should already exist in _counters. A KeyError here would indicate a serious logic bug or a race condition. Instead of defensively handling it, it would be better to assert that the key exists or remove this handler if the error is indeed impossible. This kind of defensive coding can mask underlying problems.

                    self._counters[key]["sent"] += 1

@LCAIZJ
Copy link
Contributor

LCAIZJ commented Nov 10, 2025

I think in the PD-separation scenario, heartbeat management should be handled by the upper-layer proxy. The proxy should use heartbeats as a basis when routing requests to either P or D. So, is the reason we're making P and D aware of heartbeats here because, due to layer-wise execution, their connection establishment takes relatively long, and thus they need to be heartbeat-aware?

@wangxiaoteng888
Copy link
Contributor Author

I think in the PD-separation scenario, heartbeat management should be handled by the upper-layer proxy. The proxy should use heartbeats as a basis when routing requests to either P or D. So, is the reason we're making P and D aware of heartbeats here because, due to layer-wise execution, their connection establishment takes relatively long, and thus they need to be heartbeat-aware?

The main reason we do this is to clean up the connector’s historical link information. In the connector, the base_address and port information are persisted. By adding heartbeats here, we can detect disconnections and then conveniently clean up the above persisted information afterward.

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
@wangxiaoteng888 wangxiaoteng888 force-pushed the add_pingpong branch 3 times, most recently from dbf5cf3 to c50bde1 Compare November 11, 2025 11:42
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants