Skip to content

Conversation

@baominghelly
Copy link
Contributor

Description

Add VLLM test script

Test evidence

========================= 开始测试设备组: "5,6,7,8" =========================
配置: Tensor Parallel Size = 4, 使用物理GPU ID = [5, 6, 7, 8]
正在初始化 vLLM 引擎...
INFO 07-24 16:49:50 [config.py:841] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 07-24 16:49:50 [config.py:1472] Using max model len 32768
INFO 07-24 16:49:50 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-24 16:49:51 [core.py:526] Waiting for init message from front-end.
INFO 07-24 16:49:51 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen1.5-7B-Chat', speculative_config=None, tokenizer='Qwen/Qwen1.5-7B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen1.5-7B-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-24 16:49:52 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-24 16:49:52 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_aae1c6ff'), local_subscribe_addr='ipc:///tmp/56cc05a1-704e-4317-b761-c2a60e31f0fe', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_776ccc6f'), local_subscribe_addr='ipc:///tmp/e5c547a6-3543-450f-941f-e8eadd2f57ca', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_30f11e89'), local_subscribe_addr='ipc:///tmp/c00b6b16-6a09-413d-b558-0a178cf46f02', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c7dfe9c5'), local_subscribe_addr='ipc:///tmp/825e36ab-22ce-4e42-9038-db37c86453df', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2a426ce5'), local_subscribe_addr='ipc:///tmp/3b37f985-ddfe-40b8-a086-d8e3e9e6dd32', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:54 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:54 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:54 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:54 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=63702) WARNING 07-24 16:49:54 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=3 pid=63705) WARNING 07-24 16:49:54 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=63703) WARNING 07-24 16:49:54 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=2 pid=63704) WARNING 07-24 16:49:54 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:54 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_fe252df6'), local_subscribe_addr='ipc:///tmp/93ccc31d-7577-4fc3-940b-588a09623aa5', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:54 [parallel_state.py:1076] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:54 [parallel_state.py:1076] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:54 [parallel_state.py:1076] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:54 [parallel_state.py:1076] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=1 pid=63703) WARNING 07-24 16:49:54 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=63704) WARNING 07-24 16:49:54 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=63702) WARNING 07-24 16:49:54 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=63705) WARNING 07-24 16:49:54 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:54 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen1.5-7B-Chat...
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:54 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen1.5-7B-Chat...
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:54 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen1.5-7B-Chat...
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:54 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen1.5-7B-Chat...
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:55 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:55 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:55 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:55 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:55 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:55 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:55 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:55 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
(VllmWorker rank=2 pid=63704) INFO 07-24 16:49:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:56 [weight_utils.py:292] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:58 [default_loader.py:272] Loading weights took 1.08 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=63703) INFO 07-24 16:49:58 [gpu_model_runner.py:1801] Model loading took 3.6507 GiB and 2.870689 seconds
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  3.93it/s]
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:58 [default_loader.py:272] Loading weights took 1.05 seconds
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  4.07it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  4.24it/s]
(VllmWorker rank=3 pid=63705) INFO 07-24 16:49:59 [gpu_model_runner.py:1801] Model loading took 3.6507 GiB and 3.568055 seconds
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.14it/s]
(VllmWorker rank=0 pid=63702)
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:59 [default_loader.py:272] Loading weights took 0.99 seconds
(VllmWorker rank=0 pid=63702) INFO 07-24 16:49:59 [gpu_model_runner.py:1801] Model loading took 3.6507 GiB and 4.180786 seconds
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:00 [default_loader.py:272] Loading weights took 1.03 seconds
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:00 [gpu_model_runner.py:1801] Model loading took 3.6507 GiB and 5.015395 seconds
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:08 [backends.py:508] Using cache directory: /home/libaoming/.cache/vllm/torch_compile_cache/ba084e5b9a/rank_3_0/backbone for vLLM's torch.compile
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:08 [backends.py:519] Dynamo bytecode transform time: 7.76 s
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:08 [backends.py:508] Using cache directory: /home/libaoming/.cache/vllm/torch_compile_cache/ba084e5b9a/rank_2_0/backbone for vLLM's torch.compile
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:08 [backends.py:519] Dynamo bytecode transform time: 7.85 s
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:08 [backends.py:508] Using cache directory: /home/libaoming/.cache/vllm/torch_compile_cache/ba084e5b9a/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:08 [backends.py:519] Dynamo bytecode transform time: 7.90 s
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:08 [backends.py:508] Using cache directory: /home/libaoming/.cache/vllm/torch_compile_cache/ba084e5b9a/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:08 [backends.py:519] Dynamo bytecode transform time: 8.01 s
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:14 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 5.499 s
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:15 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 5.527 s
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:15 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 5.554 s
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:15 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 5.665 s
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:16 [monitor.py:34] torch.compile takes 7.76 s in total
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:16 [monitor.py:34] torch.compile takes 7.85 s in total
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:16 [monitor.py:34] torch.compile takes 7.90 s in total
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:16 [monitor.py:34] torch.compile takes 8.01 s in total
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:18 [gpu_worker.py:232] Available KV cache memory: 65.83 GiB
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:18 [gpu_worker.py:232] Available KV cache memory: 65.90 GiB
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:18 [gpu_worker.py:232] Available KV cache memory: 65.82 GiB
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:18 [gpu_worker.py:232] Available KV cache memory: 65.82 GiB
INFO 07-24 16:50:19 [kv_cache_utils.py:716] GPU KV cache size: 539,296 tokens
INFO 07-24 16:50:19 [kv_cache_utils.py:720] Maximum concurrency for 32,768 tokens per request: 16.46x
INFO 07-24 16:50:19 [kv_cache_utils.py:716] GPU KV cache size: 539,168 tokens
INFO 07-24 16:50:19 [kv_cache_utils.py:720] Maximum concurrency for 32,768 tokens per request: 16.45x
INFO 07-24 16:50:19 [kv_cache_utils.py:716] GPU KV cache size: 539,168 tokens
INFO 07-24 16:50:19 [kv_cache_utils.py:720] Maximum concurrency for 32,768 tokens per request: 16.45x
INFO 07-24 16:50:19 [kv_cache_utils.py:716] GPU KV cache size: 539,808 tokens
INFO 07-24 16:50:19 [kv_cache_utils.py:720] Maximum concurrency for 32,768 tokens per request: 16.47x
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:23<00:00,  2.86it/s]
(VllmWorker rank=0 pid=63702) INFO 07-24 16:50:42 [gpu_model_runner.py:2326] Graph capturing finished in 23 secs, took 0.68 GiB
(VllmWorker rank=3 pid=63705) INFO 07-24 16:50:42 [gpu_model_runner.py:2326] Graph capturing finished in 23 secs, took 0.68 GiB
(VllmWorker rank=1 pid=63703) INFO 07-24 16:50:42 [gpu_model_runner.py:2326] Graph capturing finished in 23 secs, took 0.68 GiB
(VllmWorker rank=2 pid=63704) INFO 07-24 16:50:42 [gpu_model_runner.py:2326] Graph capturing finished in 23 secs, took 0.68 GiB
INFO 07-24 16:50:42 [core.py:172] init engine (profile, create kv cache, warmup model) took 42.22 seconds
vLLM 引擎初始化完成。

---------- 测试中: Input_Tokens: 32, Max_Output_Tokens: 32 ----------
成功提取到性能数据:
{
  "prompt_tokens": 32,
  "output_tokens": 32,
  "max_steps_requested": 32,
  "total_time_s": "0.2365",
  "tokens_per_sec": "135.3185",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 32, Max_Output_Tokens: 64 ----------
成功提取到性能数据:
{
  "prompt_tokens": 32,
  "output_tokens": 64,
  "max_steps_requested": 64,
  "total_time_s": "0.4307",
  "tokens_per_sec": "148.5950",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 32, Max_Output_Tokens: 128 ----------
成功提取到性能数据:
{
  "prompt_tokens": 32,
  "output_tokens": 128,
  "max_steps_requested": 128,
  "total_time_s": "0.8598",
  "tokens_per_sec": "148.8731",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 32, Max_Output_Tokens: 256 ----------
成功提取到性能数据:
{
  "prompt_tokens": 32,
  "output_tokens": 256,
  "max_steps_requested": 256,
  "total_time_s": "1.7156",
  "tokens_per_sec": "149.2194",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 64, Max_Output_Tokens: 32 ----------
成功提取到性能数据:
{
  "prompt_tokens": 64,
  "output_tokens": 32,
  "max_steps_requested": 32,
  "total_time_s": "0.2254",
  "tokens_per_sec": "141.9831",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 64, Max_Output_Tokens: 64 ----------
成功提取到性能数据:
{
  "prompt_tokens": 64,
  "output_tokens": 64,
  "max_steps_requested": 64,
  "total_time_s": "0.4310",
  "tokens_per_sec": "148.4888",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 64, Max_Output_Tokens: 128 ----------
成功提取到性能数据:
{
  "prompt_tokens": 64,
  "output_tokens": 128,
  "max_steps_requested": 128,
  "total_time_s": "0.8588",
  "tokens_per_sec": "149.0535",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 64, Max_Output_Tokens: 256 ----------
成功提取到性能数据:
{
  "prompt_tokens": 64,
  "output_tokens": 256,
  "max_steps_requested": 256,
  "total_time_s": "1.7198",
  "tokens_per_sec": "148.8587",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 128, Max_Output_Tokens: 32 ----------
成功提取到性能数据:
{
  "prompt_tokens": 128,
  "output_tokens": 32,
  "max_steps_requested": 32,
  "total_time_s": "0.2314",
  "tokens_per_sec": "138.3145",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 128, Max_Output_Tokens: 64 ----------
成功提取到性能数据:
{
  "prompt_tokens": 128,
  "output_tokens": 64,
  "max_steps_requested": 64,
  "total_time_s": "0.4380",
  "tokens_per_sec": "146.1052",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 128, Max_Output_Tokens: 128 ----------
成功提取到性能数据:
{
  "prompt_tokens": 128,
  "output_tokens": 128,
  "max_steps_requested": 128,
  "total_time_s": "0.8693",
  "tokens_per_sec": "147.2475",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

---------- 测试中: Input_Tokens: 128, Max_Output_Tokens: 256 ----------
成功提取到性能数据:
{
  "prompt_tokens": 128,
  "output_tokens": 256,
  "max_steps_requested": 256,
  "total_time_s": "1.7418",
  "tokens_per_sec": "146.9708",
  "device_ids": "5,6,7,8",
  "tp_size": 4
}

设备组 "5,6,7,8" 测试完成,正在释放资源...
资源已释放。


============================== 所有测试结果汇总 ==============================
device_ids  tp_size  prompt_tokens  output_tokens  tokens_per_sec  total_time_s  max_steps_requested
         5        1             32             32         86.2150        0.3712                   32
         5        1             32             64         88.5342        0.7229                   64
         5        1             32            128         88.5935        1.4448                  128
         5        1             32            256         88.2604        2.9005                  256
         5        1             64             32         87.9751        0.3637                   32
         5        1             64             64         88.5647        0.7226                   64
         5        1             64            128         88.4177        1.4477                  128
         5        1             64            256         88.1409        2.9044                  256
         5        1            128             32         87.5079        0.3657                   32
         5        1            128             64         87.9410        0.7278                   64
         5        1            128            128         88.1341        1.4523                  128
         5        1            128            256         87.9278        2.9115                  256
       5,6        2             32             32        132.0248        0.2424                   32
       5,6        2             32             64        137.3694        0.4659                   64
       5,6        2             32            128        137.6692        0.9298                  128
       5,6        2             32            256        137.0516        1.8679                  256
       5,6        2             64             32        135.7837        0.2357                   32
       5,6        2             64             64        137.7889        0.4645                   64
       5,6        2             64            128        137.3587        0.9319                  128
       5,6        2             64            256        136.5429        1.8749                  256
       5,6        2            128             32        132.6739        0.2412                   32
       5,6        2            128             64        135.9552        0.4707                   64
       5,6        2            128            128        136.6829        0.9365                  128
       5,6        2            128            256        119.6714        2.1392                  256
   5,6,7,8        4             32             32        135.3185        0.2365                   32
   5,6,7,8        4             32             64        148.5950        0.4307                   64
   5,6,7,8        4             32            128        148.8731        0.8598                  128
   5,6,7,8        4             32            256        149.2194        1.7156                  256
   5,6,7,8        4             64             32        141.9831        0.2254                   32
   5,6,7,8        4             64             64        148.4888        0.4310                   64
   5,6,7,8        4             64            128        149.0535        0.8588                  128
   5,6,7,8        4             64            256        148.8587        1.7198                  256
   5,6,7,8        4            128             32        138.3145        0.2314                   32
   5,6,7,8        4            128             64        146.1052        0.4380                   64
   5,6,7,8        4            128            128        147.2475        0.8693                  128
   5,6,7,8        4            128            256        146.9708        1.7418                  256

测试结果已保存到: vllm_benchmark_results_multi_group.csv

性能图表已成功保存到: vllm_benchmark_plot.png

@baominghelly baominghelly self-assigned this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants