-
Notifications
You must be signed in to change notification settings - Fork 617
[P/D]Add a heartbeat mechanism to PD separation #4071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a heartbeat mechanism for peer-to-peer communication in the PD separation setup, which is a valuable addition for improving the robustness of the distributed system. The implementation in HeartbeatMonitor is a good starting point, but it has some critical flaws. My review focuses on improving the reliability and debuggability of this new mechanism. Specifically, I've pointed out several places where exceptions are silently ignored, which can hide serious problems. I've also identified some dead code and a suspicious error handling block that could be masking a logic bug. Addressing these points will make the heartbeat feature much more robust and maintainable.
| try: | ||
| sock.send_multipart((identity, b"", HEARTBEAT_ACK)) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently swallowing all exceptions with except Exception: pass is dangerous. If sending the heartbeat acknowledgment fails, the HeartbeatMonitor on the other side will not receive a reply and may incorrectly assume this peer is down. This could lead to connection issues and failures that are very hard to debug. The exception should be logged to provide visibility into network or socket problems.
| try: | |
| sock.send_multipart((identity, b"", HEARTBEAT_ACK)) | |
| except Exception: | |
| pass | |
| try: | |
| sock.send_multipart((identity, b"", HEARTBEAT_ACK)) | |
| except Exception as e: | |
| logger.warning(f"Failed to send heartbeat ACK to {identity!r}: {e}") |
| except Exception: | ||
| # send error: keep _alive as is; next round will try again | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently swallowing all exceptions with except Exception: pass is dangerous. It can hide critical issues in the heartbeat mechanism, such as configuration errors or network problems, making debugging extremely difficult. If sock.send fails consistently, it will never be reported, and the peer might be considered alive by other parts of the system while it's not being pinged. You should at least log the exception to provide visibility into potential problems.
except Exception as e:
# send error: keep _alive as is; next round will try again
host, port = key
logger.warning(f"[HB->] failed to send ping to {host}:{port}: {e}")| self._status: dict[tuple[str, int], bool] = {} | ||
| self._counters: dict[tuple[str, int], dict[str, int]] = {} | ||
| self._down_threshold = (3 * self.poll_interval + self.timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of issues with member variables here:
- The
_statusdictionary is initialized here and written to inadd_target, but its value is never read. This is dead code and should be removed. _down_thresholdis calculated but never used. Theis_alivemethod re-calculates the same value(3 * self.poll_interval + self.timeout)instead of using this member. This is redundant and could lead to inconsistencies.
I recommend removing _status and using _down_threshold in is_alive.
| try: | ||
| self._counters[key]["sent"] += 1 | ||
| except KeyError: | ||
| self._counters[key] = { | ||
| "sent": 1, | ||
| "recv": 0, | ||
| "timeout": 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try...except KeyError block is suspicious. The _counters dictionary is populated in add_target under the same lock as _targets. Since the run loop iterates over a snapshot of _targets.items(), every key from the snapshot should already exist in _counters. A KeyError here would indicate a serious logic bug or a race condition. Instead of defensively handling it, it would be better to assert that the key exists or remove this handler if the error is indeed impossible. This kind of defensive coding can mask underlying problems.
self._counters[key]["sent"] += 1|
I think in the PD-separation scenario, heartbeat management should be handled by the upper-layer proxy. The proxy should use heartbeats as a basis when routing requests to either P or D. So, is the reason we're making P and D aware of heartbeats here because, due to layer-wise execution, their connection establishment takes relatively long, and thus they need to be heartbeat-aware? |
The main reason we do this is to clean up the connector’s historical link information. In the connector, the base_address and port information are persisted. By adding heartbeats here, we can detect disconnections and then conveniently clean up the above persisted information afterward. |
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
dbf5cf3 to
c50bde1
Compare
c50bde1 to
21292b6
Compare
What this PR does / why we need it?
Add a heartbeat mechanism to PD separation and clean up historical link information based on heartbeat status.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By ci