Skip to content

[Engine] apiserver&engine exit when work failed to start#6322

Open
CSWYF3634076 wants to merge 2 commits intoPaddlePaddle:developfrom
CSWYF3634076:apiserver-exit
Open

[Engine] apiserver&engine exit when work failed to start#6322
CSWYF3634076 wants to merge 2 commits intoPaddlePaddle:developfrom
CSWYF3634076:apiserver-exit

Conversation

@CSWYF3634076
Copy link
Collaborator

@CSWYF3634076 CSWYF3634076 commented Feb 3, 2026

Motivation

当前启动时如果worker启动失败,apiserver&engine不会退出。具体原因是会一直在while self.get_profile_block_num_signal.value[0] == 0:等待,但是因为worker已经挂了,永远拿不到

If the worker fails to start during the current startup process, the apiserver&engine will not exit. The specific reason is that it will keep waiting in the loop while self.get_profile_block_num_signal.value[0] == 0:, but since the worker has already crashed, it will never receive the signal

Modifications

增加判断worker进程是否存在

Add a check to determine whether the worker process exists

Usage or Command

启动时让worker失败:worker随便改错一些代码,或者在已经占用的卡上启动

Accuracy Tests

no

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 3, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@1c0a2b0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 0.00% 3 Missing ⚠️
fastdeploy/engine/engine.py 0.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6322   +/-   ##
==========================================
  Coverage           ?   68.44%           
==========================================
  Files              ?      391           
  Lines              ?    52173           
  Branches           ?     8130           
==========================================
  Hits               ?    35708           
  Misses             ?    13855           
  Partials           ?     2610           
Flag Coverage Δ
GPU 68.44% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jiang-Jia-Jun
Copy link
Collaborator

  • 看起来只关注了api-server主进程,当workers指定为8时,是否会有问题
  • 另外还有cache_transfer进程(开cpu cache场景)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

此PR修复了一个问题:当worker进程在启动时失败,apiserver和engine不会正确退出,而是在 _stop_profile 方法中的无限循环中卡住。

This PR fixes an issue where the apiserver and engine do not exit properly when the worker process fails to start. Instead, they get stuck in an infinite loop in the _stop_profile method waiting for a signal that will never arrive because the worker has already crashed.

Changes:

  • _stop_profile 方法中添加worker进程健康检查
  • 当检测到worker进程已终止时抛出 RuntimeError
  • 在两个文件中实现相同的修复以保持一致性

Changes (English):

  • Added worker process health check in the _stop_profile method
  • Raises RuntimeError when worker process termination is detected
  • Implemented the same fix in both files for consistency

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/engine/engine.py _stop_profile 方法的等待循环中添加worker进程存活检查,防止在worker失败时出现无限等待 (Added worker process liveness check in _stop_profile method's wait loop to prevent infinite waiting when worker fails)
fastdeploy/engine/common_engine.py _stop_profile 方法中添加相同的worker进程存活检查以保持与 engine.py 的一致性 (Added same worker process liveness check in _stop_profile method to maintain consistency with engine.py)

while self.get_profile_block_num_signal.value[0] == 0:
if hasattr(self, "worker_proc") and self.worker_proc is not None:
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格,使其变为 "Worker process failed to start. Please check log/workerlog.* for details."

The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."

Suggested change
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
raise RuntimeError(
"Worker process failed to start. Please check log/workerlog.* for details."
)

Copilot uses AI. Check for mistakes.
Comment on lines +697 to +698
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议增强错误消息以包含进程退出码,这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。

Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.

Suggested change
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
exit_code = self.worker_proc.poll()
if exit_code is not None:
raise RuntimeError(
f"Worker process failed to start (exit code: {exit_code}). "
"Please check log/workerlog.* for details."
)

Copilot uses AI. Check for mistakes.
Comment on lines +2024 to +2025
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议增强错误消息以包含进程退出码,这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。

Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.

Suggested change
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
exit_code = self.worker_proc.poll()
if exit_code is not None:
raise RuntimeError(
f"Worker process failed to start (exit code: {exit_code}). "
"Please check log/workerlog.* for details."
)

Copilot uses AI. Check for mistakes.
while self.get_profile_block_num_signal.value[0] == 0:
if hasattr(self, "worker_proc") and self.worker_proc is not None:
if self.worker_proc.poll() is not None:
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格,使其变为 "Worker process failed to start. Please check log/workerlog.* for details."

The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."

Suggested change
raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")
raise RuntimeError("Worker process failed to start. " "Please check log/workerlog.* for details.")

Copilot uses AI. Check for mistakes.
@CSWYF3634076
Copy link
Collaborator Author

CSWYF3634076 commented Feb 5, 2026

  • 看起来只关注了api-server主进程,当workers指定为8时,是否会有问题
  • 另外还有cache_transfer进程(开cpu cache场景)

@Jiang-Jia-Jun 这个PR主要是为了在刚启动的时候,如果worker启动失败的话,apiserver&engine能够退出,否则apiserver&engine不会退出,一直卡在while self.get_profile_block_num_signal.val,需要再执行ctrl+c 才行

worker 8 和 swap-space 都实验过了,worker启动失败时,apiserver&engine也都能够退出

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants