Skip to content

vllm多卡推理问题 #18

@BarryAlllen

Description

@BarryAlllen

我的命令:

vllm serve TeleChat2-7B \ 
--trust-remote-code \ 
--max-model-len 2000 \
--tensor-parallel-size 2 \
--dtype float16 --port 10000

运行之后,会一直卡在一步,不继续加载模型:

INFO 11-27 02:16:22 api_server.py:495] vLLM API server version 0.6.1.post2
INFO 11-27 02:16:22 api_server.py:496] args: Namespace(model_tag='TeleChat2-7B', config='', host=None, port=10000,

......

INFO 11-27 02:16:26 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-27 02:16:26 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-27 02:16:26 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:26 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:26 selector.py:116] Using XFormers backend.
/opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=23694) /opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=23694)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=23694) /opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=23694)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 utils.py:981] Found nccl from library libnccl.so.2
INFO 11-27 02:16:27 utils.py:981] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 11-27 02:16:27 pynccl.py:63] vLLM is using nccl==2.20.5 <------- 一直卡在这一步 不往下进行了

显卡每张只加载了400多m
微信截图_20241127102238

目前不知道是什么问题,想知道我的参数有没有错误,或者是哪里的配置未改。

ps. 单卡加载推理的话是正常的

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions