Conversation
There was a problem hiding this comment.
Pull request overview
Updates OpenVINO GenAI GenerationConfig request-to-config mapping so multinomial sampling uses new defaults (top-k filtering and non-deterministic seeding) when the client does not explicitly provide those parameters.
Changes:
- When
do_sampleis enabled, defaulttop_kto40if not provided in the request. - When
do_sampleis enabled, randomizerng_seedif not provided in the request and the current seed is0. - Add
<random>include to support seed randomization.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
src/llm/apis/openai_request.hpp:64
OpenAIRequest::seedis typed asstd::optional<int>, but the API parser treatsseedas an unsigned integer and the generation config uses0as a sentinel for “not set”. Using a signed type here risks overflow/negative values and complicates the sentinel semantics. Prefer an unsigned type (uint32_t/uint64_t/size_t) consistent with the parsed JSON andGenerationConfig::rng_seed.
// Multinomial decoding specific
std::optional<float> temperature{std::nullopt};
std::optional<float> topP{std::nullopt};
std::optional<float> minP{std::nullopt};
std::optional<int> topK{std::nullopt};
std::optional<int> seed{std::nullopt};
std::optional<float> frequencyPenalty{std::nullopt};
std::optional<float> presencePenalty{std::nullopt};
src/llm/apis/openai_api_handler.cpp:761
top_kcurrently accepts any JSON int (including negative values other than -1). Becausetop_kis later assigned intoGenerationConfig::top_k(size_t), values like-2will wrap to a huge positive number and won’t behave like “all tokens”. If only-1is meant to disable the filter (per docs), validatetop_k >= -1here and reject other negative values.
// top_k: int; optional - defaults to 0
// Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
it = doc.FindMember("top_k");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsInt())
return absl::InvalidArgumentError("top_k is not an integer");
request.topK = it->value.GetInt();
}
| | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. | | ||
| | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | | ||
| | seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. | | ||
| | seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. | |
There was a problem hiding this comment.
This is described in "Multinomial sampling specific" section
| | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. | | ||
| | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | | ||
| | seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. | | ||
| | seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. | |
There was a problem hiding this comment.
This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.
| #### Multinomial sampling specific | ||
| | Param | OpenVINO Model Server | OpenAI /responses API | Type | Description | | ||
| |-------|----------|----------|---------|-----| | ||
| | temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. | | ||
| | top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | | ||
| | top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. | | ||
| | min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p * p_max` are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. | | ||
| | top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. | | ||
| | repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. | | ||
| | frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. | | ||
| | presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | | ||
| | seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. | | ||
| | seed | ✅ | ❌ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. | | ||
|
|
There was a problem hiding this comment.
This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.
| if (request.seed.has_value()) | ||
| config.rng_seed = request.seed.value(); |
There was a problem hiding this comment.
switched to int64 representation and accepting negative seeds as specified in OpenAI API:
https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20seed%20%3E%20(schema)
There was a problem hiding this comment.
actually, let me revert that and keep uint32 type with -1 being special value
| } | ||
| if (request.minP.has_value()) { | ||
| writer.String("min_p"); | ||
| writer.Double(static_cast<double>(request.minP.value())); |
There was a problem hiding this comment.
@mzegla please keep in mind that we should always return min_p, even if client did not specify it.
@michalkulakowski
here we add another param that doesnt follow the API correctly
There was a problem hiding this comment.
Its a little bit different than top_p for example cause i think min_p is not a part of opnenai API. But i agree that it would be consistent to return values of all the generation parameters that OVMS support in Responses API response.
| // Apply multinomial sampling defaults when not explicitly set | ||
| if (config.do_sample) { | ||
| if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) { | ||
| config.top_k = 40; | ||
| SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling."); | ||
| } |
| // Use random seed for multinomial sampling to ensure non-deterministic behavior by default. | ||
| // Note: rng_seed from generation_config.json is not honoured — only an explicit per-request | ||
| // seed produces deterministic output. | ||
| // Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead. | ||
| if (!request.seed.has_value()) { | ||
| static thread_local std::mt19937 rng{std::random_device{}()}; | ||
| size_t seed = 0; | ||
| while (seed == 0) | ||
| seed = rng(); | ||
| config.rng_seed = seed; | ||
| SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed); | ||
| } |
| // seed: uint32; optional - omit to use a random seed | ||
| it = doc.FindMember("seed"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) | ||
| return absl::InvalidArgumentError("seed is not an unsigned integer"); | ||
| request.seed = it->value.GetUint(); | ||
| if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("seed is not an integer"); | ||
| const int64_t raw = it->value.IsUint64() ? static_cast<int64_t>(it->value.GetUint64()) : it->value.GetInt64(); | ||
| if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max())) | ||
| return absl::InvalidArgumentError("seed out of range [0, 4294967295]"); | ||
| request.seed = static_cast<uint32_t>(raw); |
| TEST_P(LLMHttpParametersValidationTest, minPOutOfRange) { | ||
| auto params = GetParam(); | ||
| // min_p must be in [0.0, 1.0) — value of 1.0 is out of range | ||
| std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "1.0"); | ||
|
|
||
| ASSERT_EQ( | ||
| handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser), | ||
| ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR); |
Introducing new default behavior for multinomial sampling. When temperature > 0:
Additionally adding support to
min_psampling option.