Skip to content

[Bug] lmdeploy_dlinfer/ascend API Server流式输出问题 #4209

@pengjiang80

Description

@pengjiang80

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用lmdeploy_dlinfer/ascend:a2-latest image在910B上运行api server时,不支持stream output. lmdeploy serve api_server --help也没有看到是否启用流式的相关设置参数。

Reproduction

按照文档方式在NPU使用容器启动api_server

docker run -it --net=host crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a2-latest \
bash -i -c "lmdeploy serve api_server --backend pytorch --device ascend qwen/qwen3-0.6b --server-port 40001  --model-name qwen3-0.6b"

使用以下命令进行测试

curl -N -X POST http://localhost:40001/v1/chat/completions   \
       -H "Content-Type: application/json"   \
       -d '{ 
              "model": "qwen3-0.6b",
              "messages": [{"role": "user", "content": "介绍一下你自己"}],
              "stream": true
           }'

尽管设置了 stream=true,内容仍然是生成完成后一次性返回,而非流式输出。

相同测试在vllm或者openmmlab/lmdeploy:v0.10.2-cu12 image下正常

Environment

root@HW-Ascend-1723723:/# lmdeploy check_env
sys.platform: linux
Python: 3.11.13 (main, Nov 20 2025, 16:02:27) [GCC 11.4.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
PyTorch: 2.8.0+cpu
PyTorch compiling details: PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: DEFAULT
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, COMMIT_SHA=a1cb3cc05d46d198467bebbb6e8fba50a325d4e7, CXX_COMPILER=/opt/rh/gcc-toolset-13/root/usr/bin/c++, CXX_FLAGS=-ffunction-sections -fdata-sections -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=2.8.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 

TorchVision: 0.23.0
LMDeploy: 0.11.0+
transformers: 4.57.3
fastapi: 0.123.8
pydantic: 2.12.5
triton: Not Found

Error traceback

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions