Skip to content

[XPU] [CI] Fix xpu ci bug#7014

Open
plusNew001 wants to merge 9 commits intoPaddlePaddle:developfrom
plusNew001:ci-update-0325
Open

[XPU] [CI] Fix xpu ci bug#7014
plusNew001 wants to merge 9 commits intoPaddlePaddle:developfrom
plusNew001:ci-update-0325

Conversation

@plusNew001
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 25, 2026 11:42
@paddle-bot
Copy link

paddle-bot bot commented Mar 25, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Mar 25, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复 XPU CI 场景下日志难以获取/排查的问题:在 pytest 执行过程中按用例归档关键日志,并在 GitHub Actions 中将归档目录作为 artifact 上传,便于失败后回溯。

Changes:

  • tests/xpu_ci/conftest.py 新增 pytest hook,将 log* 目录/文件与 server.log 归档到 case_logs/<case>/
  • 在 4 卡/8 卡 XPU CI workflow 中创建 case_logs、调整权限,并无条件上传 artifact。
  • workflow 中通过 set +e 捕获 pytest 退出码,保证日志归档/上传步骤仍会执行。

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/xpu_ci/conftest.py 增加用例级日志归档逻辑与 pytest hook
.github/workflows/_xpu_8cards_case_test.yml 创建/授权 case_logs 并上传 8 卡用例日志 artifact
.github/workflows/_xpu_4cards_case_test.yml 创建/授权 case_logs 并上传 4 卡用例日志 artifact

# 复制所有 log* 目录
for entry in glob.glob("log*"):
if os.path.isdir(entry):
shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里使用了 shutil.copytree(..., dirs_exist_ok=True)。该参数仅在 Python 3.8+ 可用,而仓库 python_requires 为 >=3.7,会导致在 Python 3.7 环境直接 TypeError。建议改为兼容实现(例如:若目标目录已存在则先删除/清空后再 copytree,或手动遍历复制),避免依赖 dirs_exist_ok。

Suggested change
shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True)
dest_path = os.path.join(dest_dir, entry)
if os.path.exists(dest_path):
shutil.rmtree(dest_path)
shutil.copytree(entry, dest_path)

Copilot uses AI. Check for mistakes.
Comment on lines +496 to +508
@pytest.hookimpl(hookwrapper=True, trylast=True)
def pytest_runtest_makereport(item, call):
"""每个测试阶段结束后归档日志(仅在 call 阶段后执行)"""
outcome = yield
report = outcome.get_result()

if report.when == "call":
# 使用测试文件名(不含 .py)作为归档目录名
test_file = os.path.basename(item.fspath)
test_name = os.path.splitext(test_file)[0]
try:
_archive_case_logs(test_name)
except Exception as e:
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 hook 在每个用例的 call 阶段都会归档日志,并且遍历打印归档目录下所有文件。XPU CI 用例数量较多时,这会显著增加 I/O 和 CI 日志输出量,导致执行变慢、artifact 体积膨胀。建议仅在失败时(report.failed)或通过环境变量/pytest 选项开关启用归档与文件清单输出。

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +506
# 使用测试文件名(不含 .py)作为归档目录名
test_file = os.path.basename(item.fspath)
test_name = os.path.splitext(test_file)[0]
try:
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

归档目录名目前仅使用测试文件名(不含 .py)。如果同一文件内存在多个测试函数/参数化用例,日志会被反复覆盖或混合,定位问题会变困难。建议使用 item.nodeid(做路径/非法字符替换)或包含测试函数名的更细粒度标识作为目录名。

Copilot uses AI. Check for mistakes.
@plusNew001
Copy link
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_gpu

Copilot AI review requested due to automatic review settings March 26, 2026 03:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +465 to +468
# ============ 日志归档 pytest hook ============


def _archive_case_logs(test_name):
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本是模板占位,缺少“为什么要改/解决了什么问题/如何验证”的信息。建议补充至少:触发的 XPU CI bug 现象、根因、改动点,以及如何在 CI/本地复现与验证(例如失败时需要哪些日志、上传 artifact 的预期)。

Copilot uses AI. Check for mistakes.

if report.when == "call":
# 使用测试文件名(不含 .py)作为归档目录名
test_file = os.path.basename(item.fspath)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里通过 item.fspath 获取测试文件名。item.fspath 在较新的 pytest 版本中已被弃用(趋向使用 item.path / pathlib.Path),未来升级 pytest 可能导致属性不存在,从而让日志归档 hook 失效。建议改用 pytest 推荐的路径属性(并在需要时兼容旧版本)。

Suggested change
test_file = os.path.basename(item.fspath)
# Prefer pytest's newer `item.path` API and fall back to `item.fspath` for older versions
test_path = getattr(item, "path", None)
if test_path is None:
test_path = getattr(item, "fspath", "")
test_file = os.path.basename(str(test_path))

Copilot uses AI. Check for mistakes.

- name: Upload case logs
if: always()
uses: actions/upload-artifact@v6
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该 workflow 里使用 actions/upload-artifact@v6,但仓库内同类 XPU workflow(例如 _xpu_8cards_case_test.yml)使用的是 @v4。建议统一到同一主版本(并确认所选版本在当前 GitHub Actions 上可用),避免不同 workflow 行为不一致或因版本不可用导致上传步骤失败。

Suggested change
uses: actions/upload-artifact@v6
uses: actions/upload-artifact@v4

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 26, 2026 06:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment on lines +35 to +36
CASE_LOGS_DIR = os.path.join(os.getcwd(), "case_logs")

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本是模板空白(Motivation/Modifications/Usage/Tests 未填写)。建议补充:该 CI bug 的具体表现、根因、修复点,以及如何在本地/CI 复现与验证(例如期望产出的 case_logs 结构与上传 artifact 名称)。

Copilot uses AI. Check for mistakes.
else
echo "============================4卡cases测试失败,请检查日志!============================"
exit $exit_code
exit 1
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在捕获到 pytest 失败后固定 exit 1 会丢失真实的 pytest 退出码(例如用于区分超时/内部错误/用例失败等),也和 8cards workflow 的处理方式不一致。建议保留并 exit $exit_code,让上游能拿到准确失败原因。

Suggested change
exit 1
exit "$exit_code"

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 26, 2026 07:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +496 to +501
if report.when == "call":
# 使用测试文件名(不含 .py)作为归档目录名
test_file = os.path.basename(item.fspath)
test_name = os.path.splitext(test_file)[0]
try:
_archive_case_logs(test_name)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 hook 在每个用例 call 阶段都会归档日志(无论成功/失败),可能导致 CI 额外 I/O 和产物体积显著增加。若目的是排查失败,建议仅在 report.failed(或 report.outcome != "passed")时再触发归档。

Copilot uses AI. Check for mistakes.
Comment on lines +500 to +503
try:
_archive_case_logs(test_name)
except Exception:
pass
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里捕获了所有异常后直接 pass,如果归档失败会被静默吞掉,后续排障很难定位原因。建议至少打印 warning/异常信息(或用 pytestterminalreporter/logging 记录),并考虑只捕获预期异常类型。

Copilot uses AI. Check for mistakes.
Comment on lines 182 to 188
echo "============================开始运行pytest测试============================"
export PYTHONPATH=/workspace/FastDeploy/
export PYTHONPATH=$(pwd)/tests/xpu_ci:$PYTHONPATH
mkdir -p case_logs
set +e
python -m pytest -v -s --tb=short tests/xpu_ci/8cards_cases/
exit_code=$?
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本未填写(Motivation/Modifications/Usage/Command/Tests 等为空),难以确认修复的具体问题、复现/验证方式以及为何需要这些改动。建议补充:1) 具体 CI bug 现象/日志;2) 本次改动如何修复;3) 如何在本地或 CI 中验证(命令/期望结果)。

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants