Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 96 additions & 3 deletions demos/continuous_batching/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ ovms_demos_continuous_batching_accuracy
```

This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms.
Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints.
Text generation use case is exposed via OpenAI API `chat/completions`, `completions` and `responses` endpoints.
That makes it easy to use and efficient especially on on Intel® Xeon® processors and ARC GPUs.

> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, and Intel® Core Ultra Series on Ubuntu24 and Windows11.
Expand Down Expand Up @@ -72,7 +72,7 @@ curl http://localhost:8000/v3/models

## Request Generation

Model exposes both `chat/completions` and `completions` endpoints with and without stream capabilities.
Model exposes both `chat/completions`, `completions` and `responses` endpoints with and without stream capabilities.
Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. Here is demonstrated model `Qwen/Qwen3-30B-A3B-Instruct-2507` in int4 precision. It has chat capability so `chat/completions` endpoint will be employed:

Expand Down Expand Up @@ -147,9 +147,76 @@ curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/
:::


### Unary calls via Responses API using cURL

::::{tab-set}

:::{tab-item} Linux
```bash
curl http://localhost:8000/v3/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"max_output_tokens":30,
"input": "What is OpenVINO?"
}'| jq .
```
:::

:::{tab-item} Windows
Windows Powershell
```powershell
(Invoke-WebRequest -Uri "http://localhost:8000/v3/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_output_tokens": 30, "input": "What is OpenVINO?"}').Content
```

Windows Command Prompt
```bat
curl -s http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_output_tokens\": 30, \"input\": \"What is OpenVINO?\"}"
```
:::

::::

:::{dropdown} Expected Response
```json
{
"id": "resp-1724405400",
"object": "response",
"created_at": 1724405400,
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"status": "completed",
"output": [
{
"id": "msg-0",
"type": "message",
"role": "assistant",
"status": "completed",
"content": [
{
"type": "output_text",
"text": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,",
"annotations": []
}
]
}
],
"usage": {
"input_tokens": 27,
"input_tokens_details": { "cached_tokens": 0 },
"output_tokens": 30,
"output_tokens_details": { "reasoning_tokens": 0 },
"total_tokens": 57
}
}
```
:::

### OpenAI Python package

The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
The endpoints `chat/completions`, `completions` and `responses` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:

Install the client library:
```console
Expand Down Expand Up @@ -261,6 +328,31 @@ So, **6 = 3**.
```
:::

:::{tab-item} Responses
```python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v3",
api_key="unused"
)

stream = client.responses.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
input="Say this is a test",
stream=True,
)
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
```

Output:
```
It looks like you're testing me!
```
:::

::::

## Check how to use AI agents with MCP servers and language models
Expand Down Expand Up @@ -299,5 +391,6 @@ Check the [guide of using lm-evaluation-harness](./accuracy/README.md)
- [Official OpenVINO LLM models in HuggingFace](https://huggingface.co/collections/OpenVINO/llm)
- [Chat Completions API](../../docs/model_server_rest_api_chat.md)
- [Completions API](../../docs/model_server_rest_api_completions.md)
- [Responses API](../../docs/model_server_rest_api_responses.md)
- [Writing client code](../../docs/clients_genai.md)
- [LLM calculator reference](../../docs/llm/reference.md)
96 changes: 94 additions & 2 deletions demos/continuous_batching/accuracy/gorilla.patch
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
diff --git a/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py b/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py
index bb625d2..7204adb 100644
index bb625d2..64c01de 100644
--- a/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py
+++ b/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py
@@ -2153,6 +2153,30 @@ third_party_inference_model_map = {
@@ -24,6 +24,6 @@ from bfcl_eval.model_handler.api_inference.openai_completion import (
OpenAICompletionsHandler,
)
from bfcl_eval.model_handler.api_inference.openai_response import OpenAIResponsesHandler
from bfcl_eval.model_handler.api_inference.qwen import (
QwenAgentNoThinkHandler,
QwenAgentThinkHandler,
@@ -2153,6 +2154,42 @@ third_party_inference_model_map = {
is_fc_model=True,
underscore_to_dot=True,
),
Expand All @@ -29,6 +36,18 @@ index bb625d2..7204adb 100644
+ output_price=None,
+ is_fc_model=True,
+ underscore_to_dot=True,
+ ),
+ "ovms-model-responses": ModelConfig(
+ model_name="ovms-model-responses",
+ display_name="ovms-model-responses",
+ url="http://localhost:8000/v3",
+ org="ovms",
+ license="apache-2.0",
+ model_handler=OpenAIResponsesHandler,
+ input_price=None,
+ output_price=None,
+ is_fc_model=True,
+ underscore_to_dot=True,
+ ),
}

Expand Down Expand Up @@ -60,6 +79,79 @@ index 357584f..e45e12c 100644
"store": False,
}

diff --git a/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/openai_response.py b/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/openai_response.py
index 0953fdd..7f6919f 100644
--- a/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/openai_response.py
+++ b/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/openai_response.py
@@ -38,10 +38,10 @@ class OpenAIResponsesHandler(BaseHandler):

kwargs = {}

- if api_key := os.getenv("OPENAI_API_KEY"):
+ if api_key := os.getenv("OPENAI_API_KEY","unused"):
kwargs["api_key"] = api_key

- if base_url := os.getenv("OPENAI_BASE_URL"):
+ if base_url := os.getenv("OPENAI_BASE_URL","http://localhost:8000/v3"):
kwargs["base_url"] = base_url

if headers_env := os.getenv("OPENAI_DEFAULT_HEADERS"):
@@ -99,25 +99,12 @@ class OpenAIResponsesHandler(BaseHandler):
kwargs = {
"input": message,
"model": self.model_name,
- "store": False,
- "include": ["reasoning.encrypted_content"],
- "reasoning": {"summary": "auto"},
"temperature": self.temperature,
+ "max_output_tokens": 2048,
+ "tool_choice": os.getenv("TOOL_CHOICE", "auto"),
+ "extra_body": {"chat_template_kwargs": json.loads(os.getenv("CHAT_TEMPLATE_KWARGS", "{}"))},
}

- # OpenAI reasoning models don't support temperature parameter
- if (
- "o3" in self.model_name
- or "o4-mini" in self.model_name
- or "gpt-5" in self.model_name
- ):
- del kwargs["temperature"]
-
- # Non-reasoning models don't support reasoning parameter
- else:
- del kwargs["reasoning"]
- del kwargs["include"]
-
if len(tools) > 0:
kwargs["tools"] = tools

@@ -218,25 +205,10 @@ class OpenAIResponsesHandler(BaseHandler):
kwargs = {
"input": inference_data["message"],
"model": self.model_name,
- "store": False,
- "include": ["reasoning.encrypted_content"],
- "reasoning": {"summary": "auto"},
"temperature": self.temperature,
+ "extra_body": {"chat_template_kwargs": json.loads(os.getenv("CHAT_TEMPLATE_KWARGS", "{}"))},
}

- # OpenAI reasoning models don't support temperature parameter
- if (
- "o3" in self.model_name
- or "o4-mini" in self.model_name
- or "gpt-5" in self.model_name
- ):
- del kwargs["temperature"]
-
- # Non-reasoning models don't support reasoning parameter
- else:
- del kwargs["reasoning"]
- del kwargs["include"]
-
return self.generate_with_backoff(**kwargs)

def _pre_query_processing_prompting(self, test_entry: dict) -> dict:
diff --git a/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/qwen.py b/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/qwen.py
index 10f1a08..50890c7 100644
--- a/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/qwen.py
Expand Down
108 changes: 105 additions & 3 deletions demos/continuous_batching/vlm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ ovms_demos_vlm_npu
```

This demo shows how to deploy Vision Language Models in the OpenVINO Model Server.
Text generation use case is exposed via OpenAI API `chat/completions` endpoint.
Text generation use case is exposed via OpenAI API `chat/completions` and `responses` endpoints.

> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Core Ultra Series on Ubuntu24, RedHat9 and Windows11.

Expand Down Expand Up @@ -119,6 +119,65 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js
```
:::

:::{dropdown} **Unary call with cURL using Responses API**
**Note**: Using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

```bash
curl http://localhost:8000/v3/responses \
-H "Content-Type: application/json" \
-d '{
"model": "OpenGVLab/InternVL2-2B",
"input": [
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "Describe what is on the picture."
},
{
"type": "input_image",
"image_url": "http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg"
}
]
}
],
"max_output_tokens": 100
}'
```
```json
{
"id": "resp-1741731554",
"object": "response",
"created_at": 1741731554,
"model": "OpenGVLab/InternVL2-2B",
"status": "completed",
"output": [
{
"id": "msg-0",
"type": "message",
"role": "assistant",
"status": "completed",
"content": [
{
"type": "output_text",
"text": "The picture features a zebra standing in a grassy plain. Zebras are known for their distinctive black and white striped patterns, which help them blend in for camouflage purposes.",
"annotations": []
}
]
}
],
"usage": {
"input_tokens": 19,
"input_tokens_details": { "cached_tokens": 0 },
"output_tokens": 83,
"output_tokens_details": { "reasoning_tokens": 0 },
"total_tokens": 102
}
}
```
:::

:::{dropdown} **Unary call with python requests library**

```console
Expand Down Expand Up @@ -177,9 +236,9 @@ print(response.text)
}
```
:::
:::{dropdown} **Streaming request with OpenAI client**
:::{dropdown} **Streaming request with OpenAI client using chat/completions**

The endpoints `chat/completions` is compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
The endpoints `chat/completions` and `responses` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:

Install the client library:
```console
Expand Down Expand Up @@ -223,6 +282,48 @@ The picture features a zebra standing in a grassy area. The zebra is characteriz

:::

:::{dropdown} **Streaming request with OpenAI client via Responses API**

```console
pip3 install openai
```
```python
from openai import OpenAI
import base64
base_url='http://localhost:8080/v3'
model_name = "OpenGVLab/InternVL2-2B"

client = OpenAI(api_key='unused', base_url=base_url)

def convert_image(Image):
with open(Image,'rb' ) as file:
base64_image = base64.b64encode(file.read()).decode("utf-8")
return base64_image

stream = client.responses.create(
model=model_name,
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Describe what is on the picture."},
{"type": "input_image", "image_url": f"data:image/jpeg;base64,{convert_image('zebra.jpeg')}"}
]
}
],
stream=True,
)
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
```

Output:
```
The picture features a zebra standing in a grassy area. The zebra is characterized by its distinctive black and white striped pattern, which covers its entire body, including its legs, neck, and head. Zebras have small, rounded ears and a long, flowing tail. The background appears to be a natural grassy habitat, typical of a savanna or plain.
```

:::

## Testing the model accuracy over serving API

Expand All @@ -237,5 +338,6 @@ Check [VLM usage with NPU acceleration](../../vlm_npu/README.md)
- [Export models to OpenVINO format](../common/export_models/README.md)
- [Supported VLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms)
- [Chat Completions API](../../../docs/model_server_rest_api_chat.md)
- [Responses API](../../../docs/model_server_rest_api_responses.md)
- [Writing client code](../../../docs/clients_genai.md)
- [LLM calculator reference](../../../docs/llm/reference.md)
Loading