Skip to content

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746

Open
sunlei1024 wants to merge 2 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm
Open

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746
sunlei1024 wants to merge 2 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm

Conversation

@sunlei1024
Copy link
Copy Markdown
Collaborator

Motivation

默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM
以提升 Engine-to-Worker 的张量传递效率,以及引擎任务队列基于共享内存(SHM)的通信性能。

该优化在大模型推理场景下可以减少序列化/反序列化开销,提高吞吐和延迟表现。

Modifications

  • fastdeploy/envs.py
    • FD_ENABLE_E2W_TENSOR_CONVERT 默认值由 0 改为 1
    • FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值由 0 改为 1

行为变更说明

  • 未显式设置环境变量时,将默认启用上述优化能力
  • 如需保持旧行为,可手动设置:
    • FD_ENABLE_E2W_TENSOR_CONVERT=0
    • FD_ENGINE_TASK_QUEUE_WITH_SHM=0
  • 在容器环境中需确保 /dev/shm 空间充足(建议 ≥ 1GB,视模型规模而定)

Usage or Command

默认无需额外配置,升级后自动生效。

如需关闭相关功能,可通过环境变量控制:

export FD_ENABLE_E2W_TENSOR_CONVERT=0
export FD_ENGINE_TASK_QUEUE_WITH_SHM=0

Docker 使用示例(配置共享内存):

docker run --shm-size=1g ...

Accuracy Tests

本次修改仅涉及环境变量默认值调整,不涉及模型计算逻辑变更。

验证结果:

  • ✅ 功能验证:服务启动、推理流程正常
  • ✅ 性能验证:E2W tensor convert 与 SHM queue 正常工作
  • ✅ 一致性验证:关闭开关(设为0)后结果与旧版本一致(无精度差异)

Checklist

  • Add at least one tag in PR title (e.g., [FDConfig])
  • Code formatted and pre-commit passed
  • Unit tests added
    • 原因:本次修改仅为默认配置变更,无新增逻辑路径
  • Accuracy results provided
  • Backward compatibility considered(可通过环境变量回退)
  • Not a Cherry-Pick PR / OR follows Cherry-Pick rules if applicable

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@sunlei1024 sunlei1024 changed the title [test] Stop server with /dev/shm cleanup [FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM May 7, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-08 17:15:57

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️1 个 Required 任务失败6 个 Required 任务运行中,请等待 CI 完成后再合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 26 2 7 1 0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:修改envs.py需FastDeploy RD审批 请jiangjiajun等4位RD审批PR Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - - -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - - -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - - -
Run Base Tests / base_tests - 运行中 - - -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - - -
Run Four Cards Tests / run_4_cards_tests - 运行中 - - -
其余 3 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 12m42s Job -
Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR问题(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR问题
  • 置信度: 高
  • 根因摘要: PR修改了fastdeploy/envs.py,需FastDeploy RD成员审批
  • 分析器: 通用分析(fallback)

根因详情:
本次 PR 修改了 fastdeploy/envs.py 文件,该文件属于受保护文件,需要至少一位 FastDeploy RD 成员进行审批方可通过。CI 脚本 scripts/check_approval.sh 检测到尚未获得所需审批,以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请以下任一 FastDeploy RD 成员审批本 PR:@jiangjiajun、@liuyuanle、@chenjian26、@wanglongzhi

修复建议摘要: 请FastDeploy RD(jiangjiajun等)审批本PR

关联变更: fastdeploy/envs.py(PR 中修改了该文件,触发审批检查)

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@5e3185f). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7746   +/-   ##
==========================================
  Coverage           ?   72.35%           
==========================================
  Files              ?      396           
  Lines              ?    55577           
  Branches           ?     8688           
==========================================
  Hits               ?    40213           
  Misses             ?    12611           
  Partials           ?     2753           
Flag Coverage Δ
GPU 72.35% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-08 17:04:35

📋 Review 摘要

PR 概述:默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM 两个环境变量,提升 Engine-to-Worker 张量传递与 SHM 任务队列性能,并在测试工具中增加 /dev/shm 清理逻辑。
变更范围fastdeploy/envs.pytests/ 多处测试辅助工具
影响面 Tag[FDConfig] [CI]

📝 PR 规范检查

标题使用官方 Tag [FDConfig],格式合规;PR 描述已包含所有必填 section(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),结构符合模板要求。✓

问题

级别 文件 概述
🟡 建议 tests/ci_use/Qwen2-7B-Instruct_offline/test_Qwen2-7B-Instruct_offline.py:98 time.sleep(2) 替代端口健康检查,引擎就绪检测不可靠
❓ 疑问 tests/e2e/utils/serving_utils.py:122 subprocess.run 未检查返回码,rm 失败会静默通过

总体评价

核心变更(envs.py 默认值调整)逻辑清晰,改动简洁,有向后兼容说明。测试侧有两处值得关注:端口等待逻辑被注释后,引擎就绪检测依赖固定 sleep(2) 存在偶发不稳定风险;/dev/shm 清理的异常捕获对 subprocess 失败无效,建议完善,整体可合入。

f"Model engine did not start within {MAX_WAIT_SECONDS} seconds on port {FD_ENGINE_QUEUE_PORT}"
)
time.sleep(1)
time.sleep(2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议time.sleep(2) 替代端口健康检查,引擎就绪检测不可靠。

FD_ENGINE_TASK_QUEUE_WITH_SHM=1 后,引擎不再监听 TCP 端口,原端口检查失效——但固定睡眠 2 秒无法保证引擎已完全就绪,在负载较重的 CI 环境中容易引发偶发性测试失败(flaky test)。

建议补充适合 SHM 模式的就绪检测机制(如轮询 SHM 文件/信号量是否创建),或至少延长等待时间并加 MAX_WAIT_SECONDS 超时保护,并在注释中说明为何旧的端口检查不再适用:

# SHM 模式下引擎不监听 TCP 端口,改为等待 SHM 就绪
wait_start = time.time()
while not shm_is_ready():  # 补充 SHM 就绪检测
    if time.time() - wait_start > MAX_WAIT_SECONDS:
        pytest.fail(f"Engine did not start within {MAX_WAIT_SECONDS}s")
    time.sleep(1)

如果暂时没有可用的 SHM 就绪探针,建议将 time.sleep(2) 替换为更保险的固定值并留下 TODO 注释。


# 清理/dev/shm中的临时文件
try:
subprocess.run("rm -rf /dev/shm/*", shell=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 subprocess.run 未检查返回码,rm 失败会静默通过。

subprocess.run("rm -rf /dev/shm/*", shell=True) 即使 rm 命令本身报错(如权限不足),也不会抛出 Python 异常,except Exception 分支永远不会执行,导致清理失败被无声忽略。相同模式在 deploy.pytest_eblite_serving.pymodel_loader/utils.py 中也存在。

建议添加显式检查 returncode:

try:
    result = subprocess.run("rm -rf /dev/shm/*", shell=True)
    if result.returncode != 0:
        print(f"Warning: /dev/shm cleanup returned code {result.returncode}")
    else:
        print("Successfully cleaned up /dev/shm.")
except Exception as e:
    print(f"Failed to cleanup /dev/shm: {e}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants