Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/model_server_rest_api_chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,11 +235,12 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
|-------|----------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand Down Expand Up @@ -275,7 +276,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
- functions

#### Unsupported params from vLLM:
- min_p
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
- early_stopping
- stop_token_ids
Expand Down
6 changes: 3 additions & 3 deletions docs/model_server_rest_api_completions.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,12 @@ curl http://localhost/v3/completions \
|-------|----------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand All @@ -106,7 +107,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom


#### Unsupported params from vLLM:
- min_p
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
- early_stopping
- stop_token_ids
Expand Down
5 changes: 3 additions & 2 deletions docs/model_server_rest_api_responses.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,12 @@ curl http://localhost/v3/responses \
|-------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ❌ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand Down
25 changes: 20 additions & 5 deletions src/llm/apis/openai_api_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -740,21 +740,36 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
return absl::InvalidArgumentError("top_p out of range(0.0, 1.0)");
}

// min_p: float; optional - defaults to 0 (disabled)
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
it = doc.FindMember("min_p");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsDouble() && !it->value.IsInt())
return absl::InvalidArgumentError("min_p is not a valid number");
const float minPValue = static_cast<float>(it->value.GetDouble());
if (minPValue < 0.0f || minPValue >= 1.0f)
return absl::InvalidArgumentError("min_p out of range [0.0, 1.0)");
request.minP = minPValue;
}

// top_k: int; optional - defaults to 0
// Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
it = doc.FindMember("top_k");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsInt())
return absl::InvalidArgumentError("top_k is not an integer");
request.topK = it->value.GetInt();
}

// seed: int; optional - defaults to 0 (not set)
// seed: uint32; optional - omit to use a random seed
it = doc.FindMember("seed");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint())
return absl::InvalidArgumentError("seed is not an unsigned integer");
request.seed = it->value.GetUint();
if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
return absl::InvalidArgumentError("seed is not an integer");
const int64_t raw = it->value.IsUint64() ? static_cast<int64_t>(it->value.GetUint64()) : it->value.GetInt64();
if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
request.seed = static_cast<uint32_t>(raw);
Comment on lines +764 to +772
}

// stop: string or array; optional - defaults to null (not set)
Expand Down
3 changes: 2 additions & 1 deletion src/llm/apis/openai_request.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,9 @@ struct OpenAIRequest {
// Multinomial decoding specific
std::optional<float> temperature{std::nullopt};
std::optional<float> topP{std::nullopt};
std::optional<float> minP{std::nullopt};
std::optional<int> topK{std::nullopt};
std::optional<int> seed{std::nullopt};
std::optional<uint32_t> seed{std::nullopt};
std::optional<float> frequencyPenalty{std::nullopt};
std::optional<float> presencePenalty{std::nullopt};
std::optional<float> repetitionPenalty{std::nullopt};
Expand Down
4 changes: 4 additions & 0 deletions src/llm/apis/openai_responses.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,10 @@ void OpenAIResponsesHandler::serializeCommonResponseParameters(Writer<StringBuff
writer.String("top_p");
writer.Double(static_cast<double>(request.topP.value()));
}
if (request.minP.has_value()) {
writer.String("min_p");
writer.Double(static_cast<double>(request.minP.value()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzegla please keep in mind that we should always return min_p, even if client did not specify it.
@michalkulakowski
here we add another param that doesnt follow the API correctly

Copy link
Copy Markdown
Collaborator

@michalkulakowski michalkulakowski May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little bit different than top_p for example cause i think min_p is not a part of opnenai API. But i agree that it would be consistent to return values of all the generation parameters that OVMS support in Responses API response.

}
writer.String("truncation");
writer.String("disabled");
// TODO: user not supported
Expand Down
23 changes: 23 additions & 0 deletions src/llm/io_processing/base_generation_config_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#include "../../logging.hpp"
#include <limits>
#include <random>
#include <string>
#include <openvino/genai/generation_config.hpp>
#include "base_generation_config_builder.hpp"
Expand Down Expand Up @@ -121,6 +122,8 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
config.top_k = request.topK.value();
if (request.topP.has_value())
config.top_p = request.topP.value();
if (request.minP.has_value())
config.min_p = request.minP.value();
if (request.seed.has_value())
config.rng_seed = request.seed.value();
if (request.stop.has_value())
Expand All @@ -133,6 +136,26 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
config.presence_penalty = request.presencePenalty.value();
config.do_sample = config.temperature > 0.0f && config.num_beams == 1;

// Apply multinomial sampling defaults when not explicitly set
if (config.do_sample) {
if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
config.top_k = 40;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
}
Comment thread
mzegla marked this conversation as resolved.
Comment on lines +139 to +144
// Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
// Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
// seed produces deterministic output.
// Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
if (!request.seed.has_value()) {
static thread_local std::mt19937 rng{std::random_device{}()};
size_t seed = 0;
while (seed == 0)
seed = rng();
config.rng_seed = seed;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
}
Comment thread
mzegla marked this conversation as resolved.
Comment on lines +145 to +156
}

if (request.logprobschat || request.logprobs)
config.logprobs = 1;
// Assisted decoding specific
Expand Down
34 changes: 34 additions & 0 deletions src/test/llm/llmnode_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3252,6 +3252,40 @@ TEST_P(LLMHttpParametersValidationTest, topKInvalid) {
ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR);
}

TEST_P(LLMHttpParametersValidationTest, minPValid) {
auto params = GetParam();
std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "0.05");

ASSERT_EQ(
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
ovms::StatusCode::OK);

requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "0");

ASSERT_EQ(
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
ovms::StatusCode::OK);
}

TEST_P(LLMHttpParametersValidationTest, minPInvalid) {
auto params = GetParam();
std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "\"INVALID\"");

ASSERT_EQ(
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR);
}

TEST_P(LLMHttpParametersValidationTest, minPOutOfRange) {
auto params = GetParam();
// min_p must be in [0.0, 1.0) — value of 1.0 is out of range
std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "1.0");

ASSERT_EQ(
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR);
Comment on lines +3279 to +3286
}

TEST_P(LLMHttpParametersValidationTest, seedValid) {
auto params = GetParam();
std::string requestBody = validRequestBodyWithParameter(params.modelName, "seed", "1");
Expand Down