Skip to content

[atom-vllm-benchmark] Change AW execution logic from one server one job to one server multi jobs#1005

Open
junyyang-amd wants to merge 5 commits into
ROCm:mainfrom
junyyang-amd:change_AW_execution_logic
Open

[atom-vllm-benchmark] Change AW execution logic from one server one job to one server multi jobs#1005
junyyang-amd wants to merge 5 commits into
ROCm:mainfrom
junyyang-amd:change_AW_execution_logic

Conversation

@junyyang-amd
Copy link
Copy Markdown
Contributor

Motivation

Change AW execution logic from one server one job to one server multi jobs.

junyyang-amd and others added 2 commits June 1, 2026 14:23
Co-authored-by: root <root@hjbog-srdc-15.amd.com>
Co-authored-by: root <root@hjbog-srdc-15.amd.com>
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

  1. 隔离性下降(主要权衡点):常驻 server 若中途 OOM/crash,该模型后续所有 case 全挂。原来每 case 独立 server 不会互相影响。建议:case
  执行顺序上让重负载(大 ISL、高并发)排在后面,避免一开始就把 server 拖垮影响轻量 case。
  2. vllm.log 被所有 case 共享:诊断时 cp 把同一份 server 日志复制成每个 RESULT_PREFIX-*.vllm.log,内容完全相同。定位某个具体失败 case 时,日志不是 case
  粒度的,容易误导。可以在每个 case 前后打个分隔 marker 到日志里。
  3. overall_status 只保留最后一次失败的 exit code:对 pass/fail 没问题,但多个 case 失败时只反映最后一个 code,排查时建议看 ::warning:: 行。
  4. 单 job 墙钟时间显著拉长:一个 job 现在串行跑完一个模型的全部 case。job/client step 都是 timeout-minutes: 240,如果某个 AW 模型 case 很多,要确认 240
  分钟够用。

…m#1004)

* [atom-vllm benchmark] refine model case name (ROCm#995)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* Remove qkv 256 tok limitation

---------

Co-authored-by: junyyang-amd <junyyang@amd.com>
Co-authored-by: root <root@hjbog-srdc-15.amd.com>
@junyyang-amd
Copy link
Copy Markdown
Contributor Author

Hi, @zejunchen-zejun ,我已按照您的建议进行更改,请再次review。
目前将server/client的timeout改成了360,GitHub Actions的timeout-minutes上限就是360了

zejunchen-zejun
zejunchen-zejun previously approved these changes Jun 1, 2026
@junyyang-amd junyyang-amd force-pushed the change_AW_execution_logic branch from 7e8fe16 to 04cb828 Compare June 1, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants