[Engine] apiserver&engine exit when work failed to start#6322
[Engine] apiserver&engine exit when work failed to start#6322CSWYF3634076 wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6322 +/- ##
==========================================
Coverage ? 68.44%
==========================================
Files ? 391
Lines ? 52173
Branches ? 8130
==========================================
Hits ? 35708
Misses ? 13855
Partials ? 2610
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
There was a problem hiding this comment.
Pull request overview
此PR修复了一个问题:当worker进程在启动时失败,apiserver和engine不会正确退出,而是在 _stop_profile 方法中的无限循环中卡住。
This PR fixes an issue where the apiserver and engine do not exit properly when the worker process fails to start. Instead, they get stuck in an infinite loop in the _stop_profile method waiting for a signal that will never arrive because the worker has already crashed.
Changes:
- 在
_stop_profile方法中添加worker进程健康检查 - 当检测到worker进程已终止时抛出 RuntimeError
- 在两个文件中实现相同的修复以保持一致性
Changes (English):
- Added worker process health check in the
_stop_profilemethod - Raises RuntimeError when worker process termination is detected
- Implemented the same fix in both files for consistency
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/engine/engine.py | 在 _stop_profile 方法的等待循环中添加worker进程存活检查,防止在worker失败时出现无限等待 (Added worker process liveness check in _stop_profile method's wait loop to prevent infinite waiting when worker fails) |
| fastdeploy/engine/common_engine.py | 在 _stop_profile 方法中添加相同的worker进程存活检查以保持与 engine.py 的一致性 (Added same worker process liveness check in _stop_profile method to maintain consistency with engine.py) |
| while self.get_profile_block_num_signal.value[0] == 0: | ||
| if hasattr(self, "worker_proc") and self.worker_proc is not None: | ||
| if self.worker_proc.poll() is not None: | ||
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") |
There was a problem hiding this comment.
错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格,使其变为 "Worker process failed to start. Please check log/workerlog.* for details."
The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") | |
| raise RuntimeError( | |
| "Worker process failed to start. Please check log/workerlog.* for details." | |
| ) |
| if self.worker_proc.poll() is not None: | ||
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") |
There was a problem hiding this comment.
建议增强错误消息以包含进程退出码,这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。
Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.
| if self.worker_proc.poll() is not None: | |
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") | |
| exit_code = self.worker_proc.poll() | |
| if exit_code is not None: | |
| raise RuntimeError( | |
| f"Worker process failed to start (exit code: {exit_code}). " | |
| "Please check log/workerlog.* for details." | |
| ) |
| if self.worker_proc.poll() is not None: | ||
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") |
There was a problem hiding this comment.
建议增强错误消息以包含进程退出码,这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。
Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.
| if self.worker_proc.poll() is not None: | |
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") | |
| exit_code = self.worker_proc.poll() | |
| if exit_code is not None: | |
| raise RuntimeError( | |
| f"Worker process failed to start (exit code: {exit_code}). " | |
| "Please check log/workerlog.* for details." | |
| ) |
| while self.get_profile_block_num_signal.value[0] == 0: | ||
| if hasattr(self, "worker_proc") and self.worker_proc is not None: | ||
| if self.worker_proc.poll() is not None: | ||
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") |
There was a problem hiding this comment.
错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格,使其变为 "Worker process failed to start. Please check log/workerlog.* for details."
The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."
| raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.") | |
| raise RuntimeError("Worker process failed to start. " "Please check log/workerlog.* for details.") |
@Jiang-Jia-Jun 这个PR主要是为了在刚启动的时候,如果worker启动失败的话,apiserver&engine能够退出,否则apiserver&engine不会退出,一直卡在while self.get_profile_block_num_signal.val,需要再执行ctrl+c 才行 worker 8 和 swap-space 都实验过了,worker启动失败时,apiserver&engine也都能够退出 |
Motivation
当前启动时如果worker启动失败,apiserver&engine不会退出。具体原因是会一直在
while self.get_profile_block_num_signal.value[0] == 0:等待,但是因为worker已经挂了,永远拿不到If the worker fails to start during the current startup process, the apiserver&engine will not exit. The specific reason is that it will keep waiting in the loop
while self.get_profile_block_num_signal.value[0] == 0:, but since the worker has already crashed, it will never receive the signalModifications
增加判断worker进程是否存在
Add a check to determine whether the worker process exists
Usage or Command
启动时让worker失败:worker随便改错一些代码,或者在已经占用的卡上启动
Accuracy Tests
no
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.