[Engine] apiserver&engine exit when work failed to start by CSWYF3634076 · Pull Request #6322 · PaddlePaddle/FastDeploy

CSWYF3634076 · 2026-02-03T03:56:10Z

Motivation

当前启动时如果worker启动失败，apiserver&engine不会退出。具体原因是会一直在while self.get_profile_block_num_signal.value[0] == 0:等待，但是因为worker已经挂了，永远拿不到

If the worker fails to start during the current startup process, the apiserver&engine will not exit. The specific reason is that it will keep waiting in the loop while self.get_profile_block_num_signal.value[0] == 0:, but since the worker has already crashed, it will never receive the signal

Modifications

增加判断worker进程是否存在

Add a check to determine whether the worker process exists

Usage or Command

启动时让worker失败：worker随便改错一些代码，或者在已经占用的卡上启动

Accuracy Tests

no

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-02-03T03:56:16Z

Thanks for your contribution!

codecov-commenter · 2026-02-03T06:02:16Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@1c0a2b0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/common_engine.py	0.00%	3 Missing ⚠️
fastdeploy/engine/engine.py	0.00%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6322   +/-   ##
==========================================
  Coverage           ?   68.44%           
==========================================
  Files              ?      391           
  Lines              ?    52173           
  Branches           ?     8130           
==========================================
  Hits               ?    35708           
  Misses             ?    13855           
  Partials           ?     2610

Flag	Coverage Δ
GPU	`68.44% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jiang-Jia-Jun · 2026-02-03T07:12:25Z

看起来只关注了api-server主进程，当workers指定为8时，是否会有问题
另外还有cache_transfer进程（开cpu cache场景）

Copilot

Pull request overview

此PR修复了一个问题：当worker进程在启动时失败，apiserver和engine不会正确退出，而是在 _stop_profile 方法中的无限循环中卡住。

This PR fixes an issue where the apiserver and engine do not exit properly when the worker process fails to start. Instead, they get stuck in an infinite loop in the _stop_profile method waiting for a signal that will never arrive because the worker has already crashed.

Changes:

在 _stop_profile 方法中添加worker进程健康检查
当检测到worker进程已终止时抛出 RuntimeError
在两个文件中实现相同的修复以保持一致性

Changes (English):

Added worker process health check in the _stop_profile method
Raises RuntimeError when worker process termination is detected
Implemented the same fix in both files for consistency

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
fastdeploy/engine/engine.py	在 `_stop_profile` 方法的等待循环中添加worker进程存活检查，防止在worker失败时出现无限等待 (Added worker process liveness check in `_stop_profile` method's wait loop to prevent infinite waiting when worker fails)
fastdeploy/engine/common_engine.py	在 `_stop_profile` 方法中添加相同的worker进程存活检查以保持与 engine.py 的一致性 (Added same worker process liveness check in `_stop_profile` method to maintain consistency with engine.py)

Copilot · 2026-02-03T07:16:57Z

fastdeploy/engine/common_engine.py

        while self.get_profile_block_num_signal.value[0] == 0:
+            if hasattr(self, "worker_proc") and self.worker_proc is not None:
+                if self.worker_proc.poll() is not None:
+                    raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")


错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格，使其变为 "Worker process failed to start. Please check log/workerlog.* for details."

The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."

Suggested change

raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")

raise RuntimeError(

"Worker process failed to start. Please check log/workerlog.* for details."

)

Copilot · 2026-02-03T07:16:58Z

fastdeploy/engine/engine.py

+                if self.worker_proc.poll() is not None:
+                    raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")


建议增强错误消息以包含进程退出码，这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。

Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.

Suggested change

if self.worker_proc.poll() is not None:

raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")

exit_code = self.worker_proc.poll()

if exit_code is not None:

raise RuntimeError(

f"Worker process failed to start (exit code: {exit_code}). "

"Please check log/workerlog.* for details."

)

Copilot · 2026-02-03T07:16:58Z

fastdeploy/engine/common_engine.py

+                if self.worker_proc.poll() is not None:
+                    raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")


建议增强错误消息以包含进程退出码，这将提供有关worker进程失败原因的更多上下文信息。可以使用 self.worker_proc.poll() 的返回值来获取退出码。

Consider enhancing the error message to include the process exit code, which would provide more context about why the worker process failed. The return value of self.worker_proc.poll() contains the exit code that can be included in the error message for better debugging.

Suggested change

if self.worker_proc.poll() is not None:

raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")

exit_code = self.worker_proc.poll()

if exit_code is not None:

raise RuntimeError(

f"Worker process failed to start (exit code: {exit_code}). "

"Please check log/workerlog.* for details."

)

Copilot · 2026-02-03T07:16:58Z

fastdeploy/engine/engine.py

        while self.get_profile_block_num_signal.value[0] == 0:
+            if hasattr(self, "worker_proc") and self.worker_proc is not None:
+                if self.worker_proc.poll() is not None:
+                    raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")


错误消息中的两个字符串之间缺少空格。应该在两个字符串文字之间添加空格，使其变为 "Worker process failed to start. Please check log/workerlog.* for details."

The error message is missing a space between the two string literals. There should be a space between the two string literals to make it "Worker process failed to start. Please check log/workerlog.* for details."

Suggested change

raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")

raise RuntimeError("Worker process failed to start. " "Please check log/workerlog.* for details.")

CSWYF3634076 · 2026-02-05T09:18:16Z

看起来只关注了api-server主进程，当workers指定为8时，是否会有问题

另外还有cache_transfer进程（开cpu cache场景）

@Jiang-Jia-Jun 这个PR主要是为了在刚启动的时候，如果worker启动失败的话，apiserver&engine能够退出，否则apiserver&engine不会退出，一直卡在while self.get_profile_block_num_signal.val，需要再执行ctrl+c 才行

worker 8 和 swap-space 都实验过了，worker启动失败时，apiserver&engine也都能够退出

[Engine] apiserver&engine exit when work failed to start

13d99d4

CSWYF3634076 temporarily deployed to Metax_ci February 3, 2026 03:56 — with GitHub Actions Inactive

Jiang-Jia-Jun requested a review from Copilot February 3, 2026 07:12

Copilot started reviewing on behalf of Jiang-Jia-Jun February 3, 2026 07:12 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Merge branch 'develop' into apiserver-exit

376acab

CSWYF3634076 temporarily deployed to Metax_ci February 5, 2026 11:55 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Engine] apiserver&engine exit when work failed to start#6322

[Engine] apiserver&engine exit when work failed to start#6322
CSWYF3634076 wants to merge 2 commits intoPaddlePaddle:developfrom
CSWYF3634076:apiserver-exit

CSWYF3634076 commented Feb 3, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Feb 3, 2026

Uh oh!

codecov-commenter commented Feb 3, 2026 •

edited

Loading

Uh oh!

Jiang-Jia-Jun commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

CSWYF3634076 commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if self.worker_proc.poll() is not None:
		raise RuntimeError("Worker process failed to start." "Please check log/workerlog.* for details.")

Conversation

CSWYF3634076 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Feb 3, 2026

Uh oh!

codecov-commenter commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jiang-Jia-Jun commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

CSWYF3634076 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CSWYF3634076 commented Feb 3, 2026 •

edited

Loading

codecov-commenter commented Feb 3, 2026 •

edited

Loading

CSWYF3634076 commented Feb 5, 2026 •

edited

Loading