Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在修复 XPU CI 场景下日志难以获取/排查的问题:在 pytest 执行过程中按用例归档关键日志,并在 GitHub Actions 中将归档目录作为 artifact 上传,便于失败后回溯。
Changes:
- 在
tests/xpu_ci/conftest.py新增 pytest hook,将log*目录/文件与server.log归档到case_logs/<case>/。 - 在 4 卡/8 卡 XPU CI workflow 中创建
case_logs、调整权限,并无条件上传 artifact。 - workflow 中通过
set +e捕获 pytest 退出码,保证日志归档/上传步骤仍会执行。
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/xpu_ci/conftest.py | 增加用例级日志归档逻辑与 pytest hook |
| .github/workflows/_xpu_8cards_case_test.yml | 创建/授权 case_logs 并上传 8 卡用例日志 artifact |
| .github/workflows/_xpu_4cards_case_test.yml | 创建/授权 case_logs 并上传 4 卡用例日志 artifact |
| # 复制所有 log* 目录 | ||
| for entry in glob.glob("log*"): | ||
| if os.path.isdir(entry): | ||
| shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True) |
There was a problem hiding this comment.
这里使用了 shutil.copytree(..., dirs_exist_ok=True)。该参数仅在 Python 3.8+ 可用,而仓库 python_requires 为 >=3.7,会导致在 Python 3.7 环境直接 TypeError。建议改为兼容实现(例如:若目标目录已存在则先删除/清空后再 copytree,或手动遍历复制),避免依赖 dirs_exist_ok。
| shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True) | |
| dest_path = os.path.join(dest_dir, entry) | |
| if os.path.exists(dest_path): | |
| shutil.rmtree(dest_path) | |
| shutil.copytree(entry, dest_path) |
tests/xpu_ci/conftest.py
Outdated
| @pytest.hookimpl(hookwrapper=True, trylast=True) | ||
| def pytest_runtest_makereport(item, call): | ||
| """每个测试阶段结束后归档日志(仅在 call 阶段后执行)""" | ||
| outcome = yield | ||
| report = outcome.get_result() | ||
|
|
||
| if report.when == "call": | ||
| # 使用测试文件名(不含 .py)作为归档目录名 | ||
| test_file = os.path.basename(item.fspath) | ||
| test_name = os.path.splitext(test_file)[0] | ||
| try: | ||
| _archive_case_logs(test_name) | ||
| except Exception as e: |
There was a problem hiding this comment.
当前 hook 在每个用例的 call 阶段都会归档日志,并且遍历打印归档目录下所有文件。XPU CI 用例数量较多时,这会显著增加 I/O 和 CI 日志输出量,导致执行变慢、artifact 体积膨胀。建议仅在失败时(report.failed)或通过环境变量/pytest 选项开关启用归档与文件清单输出。
| # 使用测试文件名(不含 .py)作为归档目录名 | ||
| test_file = os.path.basename(item.fspath) | ||
| test_name = os.path.splitext(test_file)[0] | ||
| try: |
There was a problem hiding this comment.
归档目录名目前仅使用测试文件名(不含 .py)。如果同一文件内存在多个测试函数/参数化用例,日志会被反复覆盖或混合,定位问题会变困难。建议使用 item.nodeid(做路径/非法字符替换)或包含测试函数名的更细粒度标识作为目录名。
|
/skip-ci ci_iluvatar |
| # ============ 日志归档 pytest hook ============ | ||
|
|
||
|
|
||
| def _archive_case_logs(test_name): |
There was a problem hiding this comment.
PR 描述目前基本是模板占位,缺少“为什么要改/解决了什么问题/如何验证”的信息。建议补充至少:触发的 XPU CI bug 现象、根因、改动点,以及如何在 CI/本地复现与验证(例如失败时需要哪些日志、上传 artifact 的预期)。
|
|
||
| if report.when == "call": | ||
| # 使用测试文件名(不含 .py)作为归档目录名 | ||
| test_file = os.path.basename(item.fspath) |
There was a problem hiding this comment.
这里通过 item.fspath 获取测试文件名。item.fspath 在较新的 pytest 版本中已被弃用(趋向使用 item.path / pathlib.Path),未来升级 pytest 可能导致属性不存在,从而让日志归档 hook 失效。建议改用 pytest 推荐的路径属性(并在需要时兼容旧版本)。
| test_file = os.path.basename(item.fspath) | |
| # Prefer pytest's newer `item.path` API and fall back to `item.fspath` for older versions | |
| test_path = getattr(item, "path", None) | |
| if test_path is None: | |
| test_path = getattr(item, "fspath", "") | |
| test_file = os.path.basename(str(test_path)) |
|
|
||
| - name: Upload case logs | ||
| if: always() | ||
| uses: actions/upload-artifact@v6 |
There was a problem hiding this comment.
该 workflow 里使用 actions/upload-artifact@v6,但仓库内同类 XPU workflow(例如 _xpu_8cards_case_test.yml)使用的是 @v4。建议统一到同一主版本(并确认所选版本在当前 GitHub Actions 上可用),避免不同 workflow 行为不一致或因版本不可用导致上传步骤失败。
| uses: actions/upload-artifact@v6 | |
| uses: actions/upload-artifact@v4 |
| CASE_LOGS_DIR = os.path.join(os.getcwd(), "case_logs") | ||
|
|
There was a problem hiding this comment.
PR 描述目前基本是模板空白(Motivation/Modifications/Usage/Tests 未填写)。建议补充:该 CI bug 的具体表现、根因、修复点,以及如何在本地/CI 复现与验证(例如期望产出的 case_logs 结构与上传 artifact 名称)。
| else | ||
| echo "============================4卡cases测试失败,请检查日志!============================" | ||
| exit $exit_code | ||
| exit 1 |
There was a problem hiding this comment.
这里在捕获到 pytest 失败后固定 exit 1 会丢失真实的 pytest 退出码(例如用于区分超时/内部错误/用例失败等),也和 8cards workflow 的处理方式不一致。建议保留并 exit $exit_code,让上游能拿到准确失败原因。
| exit 1 | |
| exit "$exit_code" |
| if report.when == "call": | ||
| # 使用测试文件名(不含 .py)作为归档目录名 | ||
| test_file = os.path.basename(item.fspath) | ||
| test_name = os.path.splitext(test_file)[0] | ||
| try: | ||
| _archive_case_logs(test_name) |
There was a problem hiding this comment.
当前 hook 在每个用例 call 阶段都会归档日志(无论成功/失败),可能导致 CI 额外 I/O 和产物体积显著增加。若目的是排查失败,建议仅在 report.failed(或 report.outcome != "passed")时再触发归档。
| try: | ||
| _archive_case_logs(test_name) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
这里捕获了所有异常后直接 pass,如果归档失败会被静默吞掉,后续排障很难定位原因。建议至少打印 warning/异常信息(或用 pytest 的 terminalreporter/logging 记录),并考虑只捕获预期异常类型。
| echo "============================开始运行pytest测试============================" | ||
| export PYTHONPATH=/workspace/FastDeploy/ | ||
| export PYTHONPATH=$(pwd)/tests/xpu_ci:$PYTHONPATH | ||
| mkdir -p case_logs | ||
| set +e | ||
| python -m pytest -v -s --tb=short tests/xpu_ci/8cards_cases/ | ||
| exit_code=$? |
There was a problem hiding this comment.
PR 描述目前基本未填写(Motivation/Modifications/Usage/Command/Tests 等为空),难以确认修复的具体问题、复现/验证方式以及为何需要这些改动。建议补充:1) 具体 CI bug 现象/日志;2) 本次改动如何修复;3) 如何在本地或 CI 中验证(命令/期望结果)。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.