Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions .ci/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# .ci — CI 镜像与流水线

本目录管理 CI 所用的 Docker 镜像构建与测试流水线执行。

## 目录结构

```
.ci/
├── config.yaml # 统一配置(registry、镜像、job 定义)
├── build.py # 镜像构建脚本
├── run.py # CI 流水线执行脚本
├── README.md
└── images/
├── nvidia/Dockerfile # NVIDIA 平台镜像
└── ascend/Dockerfile # 昇腾平台镜像
```

## 前置依赖

- Docker
- Python 3.10+
- pyyaml (`pip install pyyaml`)

## 配置文件 `config.yaml`

```yaml
repo:
url: https://github.com/InfiniTensor/InfiniOps.git
branch: master

registry:
url: "" # Harbor 地址,本地开发时留空
project: infiniops
credentials_env: REGISTRY_TOKEN

images:
nvidia:
dockerfile: .ci/images/nvidia/
build_args:
BASE_IMAGE: nvcr.io/nvidia/pytorch:24.10-py3
ascend:
dockerfile: .ci/images/ascend/
build_args:
BASE_IMAGE: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:24.0.0
private_sdk:
source: "${PRIVATE_SDK_URL}"

jobs:
nvidia_gpu:
image: stable # stable | latest | 具体 commit hash
platform: nvidia
resources:
gpu_ids: "0" # GPU 设备 ID,如 "0" "0,2" "all"
gpu_type: A100
memory: 32GB
timeout: 3600
setup: pip install .[dev]
stages:
- name: test
run: pytest tests/ -v --tb=short --junitxml=/workspace/test-results.xml
```

- **`registry.url`** 为空时镜像仅保存在本地,tag 格式为 `<project>-ci/<platform>:<tag>`。
- **`images.<platform>.build_args`** 会作为 `--build-arg` 传入 `docker build`。
- **`jobs.<name>.image`** 支持 `stable`、`latest` 或具体 commit hash。
- **`resources.gpu_ids`** 指定 GPU 设备 ID,支持 `"0"`、`"0,2"`、`"all"` 等格式,映射为 `docker run --gpus "device=..."`。也可保留 `gpu_count` 按数量分配。

## 镜像构建 `build.py`

```bash
python .ci/build.py [options]
```

| 参数 | 默认值 | 说明 |
|---|---|---|
| `--platform` | `all` | 构建平台:`nvidia`、`ascend` 或 `all` |
| `--commit` | `HEAD` | 用于镜像 tag 的 git ref |
| `--push` | — | 构建后推送到 registry |
| `--force` | — | 跳过变更检测,强制构建 |
| `--dry-run` | — | 仅打印命令,不执行 |
| `--config` | `.ci/config.yaml` | 配置文件路径 |

### 示例

```bash
# 构建 nvidia 镜像(自动检测 Dockerfile 变更,无变更则跳过)
python .ci/build.py --platform nvidia

# 强制构建
python .ci/build.py --platform nvidia --force

# 构建全部平台并推送到 registry
python .ci/build.py --push --force

# 预览实际执行的 docker 命令
python .ci/build.py --platform nvidia --force --dry-run
```

### 构建流程

1. 通过 `git diff HEAD~1` 检测 Dockerfile 目录是否有变更(`--force` 跳过此步)
2. `docker build` 构建镜像,同时打 `<commit-hash>` 和 `latest` 两个 tag
3. 自动透传宿主机的 `http_proxy`/`https_proxy`/`no_proxy` 到构建容器
4. 若指定 `--push`,将两个 tag 推送到 registry

### 产物

| Tag | 说明 |
|---|---|
| `infiniops-ci/<platform>:<commit-hash>` | 精确追溯到某次构建 |
| `infiniops-ci/<platform>:latest` | 最近一次构建 |

## 流水线执行 `run.py`

```bash
python .ci/run.py [options]
```

| 参数 | 默认值 | 说明 |
|---|---|---|
| `--job` | 配置中第一个 job | 要执行的 job 名称 |
| `--branch` | `config.yaml` 中的 `repo.branch` | 覆盖克隆分支 |
| `--stage` | 全部 | 仅运行指定 stage |
| `--image-tag` | job 中的 `image` 字段 | 覆盖镜像版本 |
| `--gpu-id` | config 中的 `gpu_ids` | GPU 设备 ID,如 `0`、`0,2`、`all` |
| `--dry-run` | — | 仅打印 docker 命令,不执行 |
| `--config` | `.ci/config.yaml` | 配置文件路径 |

### 示例

```bash
# 运行默认 job
python .ci/run.py

# 指定分支和镜像版本
python .ci/run.py --branch feature-xxx --image-tag latest

# 只用 GPU 0 运行
python .ci/run.py --gpu-id 0

# 用 GPU 0 和 2 运行
python .ci/run.py --gpu-id 0,2

# 使用全部 GPU
python .ci/run.py --gpu-id all

# 只跑 test stage
python .ci/run.py --stage test

# 预览 docker 命令
python .ci/run.py --dry-run
```

### 执行流程

1. 解析 job 配置,拉取对应镜像
2. `docker run` 启动容器(自动挂载 GPU、限制内存)
3. 容器内 `git clone` → `checkout` → 执行 `setup` 命令
4. 依次执行各 stage,汇总结果

## 代理配置

如果网络环境需要代理,在宿主机设置环境变量后即可:

```bash
export http_proxy=http://localhost:9991
export https_proxy=http://localhost:9991
```

- **`build.py`** 会自动透传代理到 `docker build`(通过 `--build-arg` + `--network host`)。
- **`run.py`** 使用 `--network host`,容器内可直接访问宿主机代理。
210 changes: 210 additions & 0 deletions .ci/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
#!/usr/bin/env python3
"""CI image builder: detect changes, build, tag, and optionally push Docker images."""

import argparse
import json
import os
import subprocess
import sys
from pathlib import Path

try:
import yaml
except ImportError:
print(
"error: pyyaml is required. Install with: pip install pyyaml", file=sys.stderr
)
sys.exit(1)


def load_config(path):
with open(path, encoding="utf-8") as f:
return yaml.safe_load(f)


def get_git_commit(ref="HEAD"):
result = subprocess.run(
["git", "rev-parse", "--short", ref],
capture_output=True,
text=True,
)
if result.returncode != 0:
print(f"error: failed to get commit hash for `{ref}`", file=sys.stderr)
sys.exit(1)

return result.stdout.strip()


def has_dockerfile_changed(dockerfile_dir, base_ref="HEAD~1"):
"""Check if any file under `dockerfile_dir` changed since `base_ref`."""
result = subprocess.run(
["git", "diff", "--name-only", base_ref, "--", dockerfile_dir],
capture_output=True,
text=True,
)

return bool(result.stdout.strip())


def build_image_tag(registry_url, project, platform, tag):
if registry_url:
return f"{registry_url}/{project}/{platform}:{tag}"

return f"{project}-ci/{platform}:{tag}"


def build_image(platform, platform_cfg, registry_cfg, commit, push, dry_run):
"""Build a single platform image. Returns True on success."""
registry_url = registry_cfg.get("url", "")
project = registry_cfg.get("project", "infiniops")
dockerfile_dir = platform_cfg["dockerfile"]

commit_tag = build_image_tag(registry_url, project, platform, commit)
latest_tag = build_image_tag(registry_url, project, platform, "latest")

build_args_cfg = platform_cfg.get("build_args", {})
build_cmd = ["docker", "build", "--network", "host"]
for key, value in build_args_cfg.items():
build_cmd.extend(["--build-arg", f"{key}={value}"])

for proxy_var in ("http_proxy", "https_proxy", "no_proxy"):
proxy_val = os.environ.get(proxy_var) or os.environ.get(proxy_var.upper())
if proxy_val:
build_cmd.extend(["--build-arg", f"{proxy_var}={proxy_val}"])

private_sdk = platform_cfg.get("private_sdk", {})
if private_sdk:
sdk_url = private_sdk.get("source", "")
if sdk_url.startswith("${") and sdk_url.endswith("}"):
env_var = sdk_url[2:-1]
sdk_url = os.environ.get(env_var, "")
if sdk_url:
build_cmd.extend(["--build-arg", f"PRIVATE_SDK_URL={sdk_url}"])

build_cmd.extend(["-t", commit_tag, "-t", latest_tag, dockerfile_dir])

if dry_run:
print(f"[dry-run] {' '.join(build_cmd)}")
if push:
print(f"[dry-run] docker push {commit_tag}")
print(f"[dry-run] docker push {latest_tag}")

return True

print(f"==> building {platform}: {commit_tag}", file=sys.stderr)
result = subprocess.run(build_cmd)
if result.returncode != 0:
error = {
"stage": "build",
"platform": platform,
"tag": commit_tag,
"exit_code": result.returncode,
}
print(json.dumps(error), file=sys.stderr)

return False

if push:
for tag in (commit_tag, latest_tag):
print(f"==> pushing {tag}", file=sys.stderr)
push_result = subprocess.run(["docker", "push", tag])
if push_result.returncode != 0:
error = {
"stage": "push",
"platform": platform,
"tag": tag,
"exit_code": push_result.returncode,
}
print(json.dumps(error), file=sys.stderr)

return False

return True


def main():
parser = argparse.ArgumentParser(description="Build CI Docker images")
parser.add_argument(
"--platform",
type=str,
default="all",
help="Platform to build: nvidia, ascend, or all (default: all)",
)
parser.add_argument(
"--config",
type=Path,
default=Path(__file__).resolve().parent / "config.yaml",
help="Path to config.yaml",
)
parser.add_argument(
"--commit",
type=str,
default="HEAD",
help="Git ref for tagging the image (default: HEAD)",
)
parser.add_argument(
"--push",
action="store_true",
help="Push images to registry after building",
)
parser.add_argument(
"--force",
action="store_true",
help="Skip change detection and force build",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Print commands without executing",
)
args = parser.parse_args()

config = load_config(args.config)
registry_cfg = config.get("registry", {})
images_cfg = config.get("images", {})

if not images_cfg:
print("error: no `images` section in config", file=sys.stderr)
sys.exit(1)

if args.platform == "all":
platforms = list(images_cfg.keys())
else:
if args.platform not in images_cfg:
print(
f"error: platform `{args.platform}` not found in config",
file=sys.stderr,
)
sys.exit(1)
platforms = [args.platform]

commit = get_git_commit(args.commit)
failed = False

for platform in platforms:
platform_cfg = images_cfg[platform]
dockerfile_dir = platform_cfg["dockerfile"]

if not Path(dockerfile_dir).is_dir():
print(
f"warning: dockerfile directory `{dockerfile_dir}` does not exist, skipping {platform}",
file=sys.stderr,
)
continue

if not args.force and not has_dockerfile_changed(dockerfile_dir):
print(f"==> {platform}: no changes detected, skipping", file=sys.stderr)
continue

ok = build_image(
platform, platform_cfg, registry_cfg, commit, args.push, args.dry_run
)
if not ok:
failed = True

if failed:
sys.exit(1)


if __name__ == "__main__":
main()
Loading