[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM by sunlei1024 · Pull Request #7746 · PaddlePaddle/FastDeploy

sunlei1024 · 2026-05-07T14:26:24Z

Motivation

默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM，
以提升 Engine-to-Worker 的张量传递效率，以及引擎任务队列基于共享内存（SHM）的通信性能。

该优化在大模型推理场景下可以减少序列化/反序列化开销，提高吞吐和延迟表现。

Modifications

fastdeploy/envs.py
- 将 FD_ENABLE_E2W_TENSOR_CONVERT 默认值由 0 改为 1
- 将 FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值由 0 改为 1

行为变更说明：

未显式设置环境变量时，将默认启用上述优化能力
如需保持旧行为，可手动设置：
- FD_ENABLE_E2W_TENSOR_CONVERT=0
- FD_ENGINE_TASK_QUEUE_WITH_SHM=0
在容器环境中需确保 /dev/shm 空间充足（建议 ≥ 1GB，视模型规模而定）

Usage or Command

默认无需额外配置，升级后自动生效。

如需关闭相关功能，可通过环境变量控制：

export FD_ENABLE_E2W_TENSOR_CONVERT=0
export FD_ENGINE_TASK_QUEUE_WITH_SHM=0

Docker 使用示例（配置共享内存）：

docker run --shm-size=1g ...

Accuracy Tests

本次修改仅涉及环境变量默认值调整，不涉及模型计算逻辑变更。

验证结果：

✅ 功能验证：服务启动、推理流程正常
✅ 性能验证：E2W tensor convert 与 SHM queue 正常工作
✅ 一致性验证：关闭开关（设为0）后结果与旧版本一致（无精度差异）

Checklist

Add at least one tag in PR title (e.g., [FDConfig])
Code formatted and pre-commit passed
Unit tests added
- 原因：本次修改仅为默认配置变更，无新增逻辑路径
Accuracy results provided
Backward compatibility considered（可通过环境变量回退）
Not a Cherry-Pick PR / OR follows Cherry-Pick rules if applicable

paddle-bot · 2026-05-07T14:26:31Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-07T15:06:09Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-08 17:15:57

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: cc5a7b5
Merge base: 5e3185f (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 有 1 个 Required 任务失败，6 个 Required 任务运行中，请等待 CI 完成后再合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	26	2	7	1	0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	PR问题：修改envs.py需FastDeploy RD审批	请jiangjiajun等4位RD审批PR	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	-	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	-	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	-	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	-	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	-	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	-	-
✅	其余 3 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	12m42s	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR问题（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR问题
置信度: 高
根因摘要: PR修改了fastdeploy/envs.py，需FastDeploy RD成员审批
分析器: 通用分析(fallback)

根因详情:
本次 PR 修改了 fastdeploy/envs.py 文件，该文件属于受保护文件，需要至少一位 FastDeploy RD 成员进行审批方可通过。CI 脚本 scripts/check_approval.sh 检测到尚未获得所需审批，以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请以下任一 FastDeploy RD 成员审批本 PR：@jiangjiajun、@liuyuanle、@chenjian26、@wanglongzhi

修复建议摘要: 请FastDeploy RD（jiangjiajun等）审批本PR

关联变更: fastdeploy/envs.py（PR 中修改了该文件，触发审批检查）

链接: 查看日志

codecov-commenter · 2026-05-07T15:54:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@5e3185f). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7746   +/-   ##
==========================================
  Coverage           ?   72.35%           
==========================================
  Files              ?      396           
  Lines              ?    55577           
  Branches           ?     8688           
==========================================
  Hits               ?    40213           
  Misses             ?    12611           
  Partials           ?     2753

Flag	Coverage Δ
GPU	`72.35% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-08 17:04:35

📋 Review 摘要

PR 概述：默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM 两个环境变量，提升 Engine-to-Worker 张量传递与 SHM 任务队列性能，并在测试工具中增加 /dev/shm 清理逻辑。
变更范围：fastdeploy/envs.py、tests/ 多处测试辅助工具
影响面 Tag：[FDConfig] [CI]

📝 PR 规范检查

标题使用官方 Tag [FDConfig]，格式合规；PR 描述已包含所有必填 section（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist），结构符合模板要求。✓

问题

级别	文件	概述
🟡 建议	`tests/ci_use/Qwen2-7B-Instruct_offline/test_Qwen2-7B-Instruct_offline.py:98`	用 `time.sleep(2)` 替代端口健康检查，引擎就绪检测不可靠
❓ 疑问	`tests/e2e/utils/serving_utils.py:122`	`subprocess.run` 未检查返回码，`rm` 失败会静默通过

总体评价

核心变更（envs.py 默认值调整）逻辑清晰，改动简洁，有向后兼容说明。测试侧有两处值得关注：端口等待逻辑被注释后，引擎就绪检测依赖固定 sleep(2) 存在偶发不稳定风险；/dev/shm 清理的异常捕获对 subprocess 失败无效，建议完善，整体可合入。

PaddlePaddle-bot · 2026-05-08T09:06:53Z

-                    f"Model engine did not start within {MAX_WAIT_SECONDS} seconds on port {FD_ENGINE_QUEUE_PORT}"
-                )
-            time.sleep(1)
+        time.sleep(2)


🟡 建议 用 time.sleep(2) 替代端口健康检查，引擎就绪检测不可靠。

当 FD_ENGINE_TASK_QUEUE_WITH_SHM=1 后，引擎不再监听 TCP 端口，原端口检查失效——但固定睡眠 2 秒无法保证引擎已完全就绪，在负载较重的 CI 环境中容易引发偶发性测试失败（flaky test）。

建议补充适合 SHM 模式的就绪检测机制（如轮询 SHM 文件/信号量是否创建），或至少延长等待时间并加 MAX_WAIT_SECONDS 超时保护，并在注释中说明为何旧的端口检查不再适用：

# SHM 模式下引擎不监听 TCP 端口，改为等待 SHM 就绪 wait_start = time.time() while not shm_is_ready(): # 补充 SHM 就绪检测 if time.time() - wait_start > MAX_WAIT_SECONDS: pytest.fail(f"Engine did not start within {MAX_WAIT_SECONDS}s") time.sleep(1)

如果暂时没有可用的 SHM 就绪探针，建议将 time.sleep(2) 替换为更保险的固定值并留下 TODO 注释。

PaddlePaddle-bot · 2026-05-08T09:06:54Z


+    # 清理/dev/shm中的临时文件
+    try:
+        subprocess.run("rm -rf /dev/shm/*", shell=True)


❓ 疑问 subprocess.run 未检查返回码，rm 失败会静默通过。

subprocess.run("rm -rf /dev/shm/*", shell=True) 即使 rm 命令本身报错（如权限不足），也不会抛出 Python 异常，except Exception 分支永远不会执行，导致清理失败被无声忽略。相同模式在 deploy.py、test_eblite_serving.py、model_loader/utils.py 中也存在。

建议添加显式检查 returncode：

try: result = subprocess.run("rm -rf /dev/shm/*", shell=True) if result.returncode != 0: print(f"Warning: /dev/shm cleanup returned code {result.returncode}") else: print("Successfully cleaned up /dev/shm.") except Exception as e: print(f"Failed to cleanup /dev/shm: {e}")

[test] Stop server with /dev/shm cleanup

dbb1c5c

sunlei1024 had a problem deploying to Metax_ci May 7, 2026 14:26 — with GitHub Actions Failure

sunlei1024 changed the title ~~[test] Stop server with /dev/shm cleanup~~ [FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM May 7, 2026

This comment was marked as outdated.

Sign in to view

cleanup shm by clean_ports

cc5a7b5

sunlei1024 had a problem deploying to Metax_ci May 8, 2026 08:55 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746
sunlei1024 wants to merge 2 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm

sunlei1024 commented May 7, 2026

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

Approval

Uh oh!

codecov-commenter commented May 7, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 8, 2026

Uh oh!

PaddlePaddle-bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunlei1024 commented May 7, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 3/10 通过

2.2 可选任务 — 23/26 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading