Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .github/workflows/_xpu_4cards_case_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -193,13 +193,29 @@ jobs:
echo "============================开始运行pytest测试============================"
export PYTHONPATH=/workspace/FastDeploy/
export PYTHONPATH=$(pwd)/tests/xpu_ci:$PYTHONPATH
mkdir -p case_logs
set +e
python -m pytest -v -s --tb=short tests/xpu_ci/4cards_cases/
exit_code=$?
set -e

# 修改case_logs权限,确保Docker外部的runner用户可以读取并上传
chmod -R a+rX case_logs/ 2>/dev/null || true

if [ $exit_code -eq 0 ]; then
echo "============================4卡cases测试通过!============================"
exit $exit_code
else
echo "============================4卡cases测试失败,请检查日志!============================"
exit $exit_code
fi
'

- name: Upload case logs
if: always()
uses: actions/upload-artifact@v6
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该 workflow 里使用 actions/upload-artifact@v6,但仓库内同类 XPU workflow(例如 _xpu_8cards_case_test.yml)使用的是 @v4。建议统一到同一主版本(并确认所选版本在当前 GitHub Actions 上可用),避免不同 workflow 行为不一致或因版本不可用导致上传步骤失败。

Suggested change
uses: actions/upload-artifact@v6
uses: actions/upload-artifact@v4

Copilot uses AI. Check for mistakes.
with:
name: xpu-4cards-case-logs
path: FastDeploy/case_logs/
retention-days: 7
if-no-files-found: ignore
15 changes: 15 additions & 0 deletions .github/workflows/_xpu_8cards_case_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,14 @@ jobs:
echo "============================开始运行pytest测试============================"
export PYTHONPATH=/workspace/FastDeploy/
export PYTHONPATH=$(pwd)/tests/xpu_ci:$PYTHONPATH
mkdir -p case_logs
set +e
python -m pytest -v -s --tb=short tests/xpu_ci/8cards_cases/
exit_code=$?
Comment on lines 182 to 188
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本未填写(Motivation/Modifications/Usage/Command/Tests 等为空),难以确认修复的具体问题、复现/验证方式以及为何需要这些改动。建议补充:1) 具体 CI bug 现象/日志;2) 本次改动如何修复;3) 如何在本地或 CI 中验证(命令/期望结果)。

Copilot uses AI. Check for mistakes.
set -e

# 修改case_logs权限,确保Docker外部的runner用户可以读取并上传
chmod -R a+rX case_logs/ 2>/dev/null || true

if [ $exit_code -eq 0 ]; then
echo "============================8卡cases测试通过!============================"
Expand All @@ -192,3 +198,12 @@ jobs:
exit $exit_code
fi
'

- name: Upload case logs
if: always()
uses: actions/upload-artifact@v6
with:
name: xpu-8cards-case-logs
path: FastDeploy/case_logs/
retention-days: 7
if-no-files-found: ignore
2 changes: 1 addition & 1 deletion tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def print_pd_logs_on_failure():
log_dirs = ["log_router", "log_prefill", "log_decode"]

for log_dir in log_dirs:
nohup_path = os.path.join(log_dir, "log_0/worklog.0")
nohup_path = os.path.join(log_dir, "log_0/workerlog.0")
if os.path.exists(nohup_path):
print(f"\n========== {nohup_path} ==========")
with open(nohup_path, "r") as f:
Expand Down
2 changes: 1 addition & 1 deletion tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def print_pd_logs_on_failure():
log_dirs = ["log_router", "log_prefill", "log_decode"]

for log_dir in log_dirs:
nohup_path = os.path.join(log_dir, "log_0/worklog.0")
nohup_path = os.path.join(log_dir, "log_0/workerlog.0")
if os.path.exists(nohup_path):
print(f"\n========== {nohup_path} ==========")
with open(nohup_path, "r") as f:
Expand Down
2 changes: 1 addition & 1 deletion tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp4_cudagraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def print_pd_logs_on_failure():
log_dirs = ["log_router", "log_prefill", "log_decode"]

for log_dir in log_dirs:
nohup_path = os.path.join(log_dir, "log_0/worklog.0")
nohup_path = os.path.join(log_dir, "log_0/workerlog.0")
if os.path.exists(nohup_path):
print(f"\n========== {nohup_path} ==========")
with open(nohup_path, "r") as f:
Expand Down
2 changes: 1 addition & 1 deletion tests/xpu_ci/8cards_cases/test_pd_p_tp4ep4_d_tp1ep4.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def print_pd_logs_on_failure():
log_dirs = ["log_router", "log_prefill", "log_decode"]

for log_dir in log_dirs:
nohup_path = os.path.join(log_dir, "log_0/worklog.0")
nohup_path = os.path.join(log_dir, "log_0/workerlog.0")
if os.path.exists(nohup_path):
print(f"\n========== {nohup_path} ==========")
with open(nohup_path, "r") as f:
Expand Down
42 changes: 42 additions & 0 deletions tests/xpu_ci/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
4. 环境配置 - 设置XPU相关环境变量
"""

import glob
import json
import os
import shutil
Expand All @@ -31,6 +32,8 @@

import pytest

CASE_LOGS_DIR = os.path.join(os.getcwd(), "case_logs")

Comment on lines +35 to +36
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本是模板空白(Motivation/Modifications/Usage/Tests 未填写)。建议补充:该 CI bug 的具体表现、根因、修复点,以及如何在本地/CI 复现与验证(例如期望产出的 case_logs 结构与上传 artifact 名称)。

Copilot uses AI. Check for mistakes.

def get_xpu_id():
"""获取XPU_ID环境变量"""
Expand Down Expand Up @@ -457,3 +460,42 @@ def setup_logprobs_zmq_env():
os.environ[key] = value
print(f"设置环境变量: {key}={value}")
return original_values


# ============ 日志归档 pytest hook ============


def _archive_case_logs(test_name):
Comment on lines +465 to +468
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本是模板占位,缺少“为什么要改/解决了什么问题/如何验证”的信息。建议补充至少:触发的 XPU CI bug 现象、根因、改动点,以及如何在 CI/本地复现与验证(例如失败时需要哪些日志、上传 artifact 的预期)。

Copilot uses AI. Check for mistakes.
"""
将当前工作目录下所有 log 开头的文件夹和 server.log 复制到 case_logs/{test_name}/ 下
"""
dest_dir = os.path.join(CASE_LOGS_DIR, test_name)
os.makedirs(dest_dir, exist_ok=True)

# 复制所有 log* 目录
for entry in glob.glob("log*"):
if os.path.isdir(entry):
shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里使用了 shutil.copytree(..., dirs_exist_ok=True)。该参数仅在 Python 3.8+ 可用,而仓库 python_requires 为 >=3.7,会导致在 Python 3.7 环境直接 TypeError。建议改为兼容实现(例如:若目标目录已存在则先删除/清空后再 copytree,或手动遍历复制),避免依赖 dirs_exist_ok。

Suggested change
shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True)
dest_path = os.path.join(dest_dir, entry)
if os.path.exists(dest_path):
shutil.rmtree(dest_path)
shutil.copytree(entry, dest_path)

Copilot uses AI. Check for mistakes.
elif os.path.isfile(entry):
# 处理 server.log 等 log 开头的文件
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释“处理 server.log 等 log 开头的文件”与实际逻辑不一致:该分支处理的是 log* 匹配到的普通文件(例如 workerlog),而 server.log 下面又有单独分支复制。建议更新注释以避免误导后续维护。

Suggested change
# 处理 server.log 等 log 开头的文件
# Copy regular files starting with "log" (e.g. workerlog); server.log is handled separately below

Copilot uses AI. Check for mistakes.
shutil.copy2(entry, os.path.join(dest_dir, entry))
Comment on lines +475 to +481
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里使用了 shutil.copytree(..., dirs_exist_ok=True),该参数仅在 Python 3.8+ 支持;仓库 setup.py 声明 python_requires>=3.7,会导致在 Python 3.7 环境运行 xpu_ci 直接 TypeError。建议改为兼容写法(例如目标目录存在时先删除/清空再 copytree,或手动递归复制以实现“合并”效果),避免依赖 dirs_exist_ok。

Suggested change
# 复制所有 log* 目录
for entry in glob.glob("log*"):
if os.path.isdir(entry):
shutil.copytree(entry, os.path.join(dest_dir, entry), dirs_exist_ok=True)
elif os.path.isfile(entry):
# 处理 server.log 等 log 开头的文件
shutil.copy2(entry, os.path.join(dest_dir, entry))
# 复制所有 log* 目录或文件
for entry in glob.glob("log*"):
dest_path = os.path.join(dest_dir, entry)
if os.path.isdir(entry):
# Python 3.7 does not support dirs_exist_ok, so handle existing dirs manually
if os.path.exists(dest_path):
shutil.rmtree(dest_path)
shutil.copytree(entry, dest_path)
elif os.path.isfile(entry):
# 处理 server.log 等 log 开头的文件
shutil.copy2(entry, dest_path)

Copilot uses AI. Check for mistakes.

# 单独处理 server.log(不以 log 开头但也是关键日志)
if os.path.exists("server.log") and not os.path.exists(os.path.join(dest_dir, "server.log")):
shutil.copy2("server.log", os.path.join(dest_dir, "server.log"))


@pytest.hookimpl(hookwrapper=True, trylast=True)
def pytest_runtest_makereport(item, call):
"""每个测试阶段结束后归档日志(仅在 call 阶段后执行)"""
outcome = yield
report = outcome.get_result()

if report.when == "call":
# 使用测试文件名(不含 .py)作为归档目录名
test_file = os.path.basename(item.fspath)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里通过 item.fspath 获取测试文件名。item.fspath 在较新的 pytest 版本中已被弃用(趋向使用 item.path / pathlib.Path),未来升级 pytest 可能导致属性不存在,从而让日志归档 hook 失效。建议改用 pytest 推荐的路径属性(并在需要时兼容旧版本)。

Suggested change
test_file = os.path.basename(item.fspath)
# Prefer pytest's newer `item.path` API and fall back to `item.fspath` for older versions
test_path = getattr(item, "path", None)
if test_path is None:
test_path = getattr(item, "fspath", "")
test_file = os.path.basename(str(test_path))

Copilot uses AI. Check for mistakes.
test_name = os.path.splitext(test_file)[0]
try:
Comment on lines +495 to +498
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

归档目录名目前仅使用测试文件名(不含 .py)。如果同一文件内存在多个测试函数/参数化用例,日志会被反复覆盖或混合,定位问题会变困难。建议使用 item.nodeid(做路径/非法字符替换)或包含测试函数名的更细粒度标识作为目录名。

Copilot uses AI. Check for mistakes.
_archive_case_logs(test_name)
Comment on lines +494 to +499
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 hook 在每个用例 call 阶段都会归档日志(无论成功/失败),可能导致 CI 额外 I/O 和产物体积显著增加。若目的是排查失败,建议仅在 report.failed(或 report.outcome != "passed")时再触发归档。

Copilot uses AI. Check for mistakes.
except Exception:
pass
Comment on lines +498 to +501
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里捕获了所有异常后直接 pass,如果归档失败会被静默吞掉,后续排障很难定位原因。建议至少打印 warning/异常信息(或用 pytestterminalreporter/logging 记录),并考虑只捕获预期异常类型。

Copilot uses AI. Check for mistakes.
Loading