Multinomial defaults change + min_p support by mzegla · Pull Request #4194 · openvinotoolkit/model_server

mzegla · 2026-05-08T12:06:02Z

Introducing new default behavior for multinomial sampling. When temperature > 0:

default top_k is set to 40 (used to be inactive - all tokens considered)
default randomization (random seed for every request if not specified) - new default behavior is non-deterministic, used to be deterministic (same, hardcoded seed with every request)

Additionally adding support to min_p sampling option.

Copilot

Pull request overview

Updates OpenVINO GenAI GenerationConfig request-to-config mapping so multinomial sampling uses new defaults (top-k filtering and non-deterministic seeding) when the client does not explicitly provide those parameters.

Changes:

When do_sample is enabled, default top_k to 40 if not provided in the request.
When do_sample is enabled, randomize rng_seed if not provided in the request and the current seed is 0.
Add <random> include to support seed randomization.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_request.hpp:64

OpenAIRequest::seed is typed as std::optional<int>, but the API parser treats seed as an unsigned integer and the generation config uses 0 as a sentinel for “not set”. Using a signed type here risks overflow/negative values and complicates the sentinel semantics. Prefer an unsigned type (uint32_t/uint64_t/size_t) consistent with the parsed JSON and GenerationConfig::rng_seed.

    // Multinomial decoding specific
    std::optional<float> temperature{std::nullopt};
    std::optional<float> topP{std::nullopt};
    std::optional<float> minP{std::nullopt};
    std::optional<int> topK{std::nullopt};
    std::optional<int> seed{std::nullopt};
    std::optional<float> frequencyPenalty{std::nullopt};
    std::optional<float> presencePenalty{std::nullopt};

src/llm/apis/openai_api_handler.cpp:761

top_k currently accepts any JSON int (including negative values other than -1). Because top_k is later assigned into GenerationConfig::top_k (size_t), values like -2 will wrap to a huge positive number and won’t behave like “all tokens”. If only -1 is meant to disable the filter (per docs), validate top_k >= -1 here and reject other negative values.

    // top_k: int; optional - defaults to 0
    // Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
    it = doc.FindMember("top_k");
    if (it != doc.MemberEnd() && !it->value.IsNull()) {
        if (!it->value.IsInt())
            return absl::InvalidArgumentError("top_k is not an integer");
        request.topK = it->value.GetInt();
    }

mzegla · 2026-05-08T12:48:24Z

 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |


This is described in "Multinomial sampling specific" section

dkalinowski · 2026-05-08T12:42:42Z

 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |


This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.

mzegla · 2026-05-08T12:49:35Z

 #### Multinomial sampling specific
 | Param | OpenVINO Model Server | OpenAI /responses API | Type | Description |
 |-------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
+| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p * p_max` are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
+| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ❌ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |



This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.

dkalinowski · 2026-05-08T12:46:00Z

    if (request.seed.has_value())
        config.rng_seed = request.seed.value();


is it an issue? @mzegla

switched to int64 representation and accepting negative seeds as specified in OpenAI API:
https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20seed%20%3E%20(schema)

actually, let me revert that and keep uint32 type with -1 being special value

dkalinowski · 2026-05-08T12:44:19Z

    }
+    if (request.minP.has_value()) {
+        writer.String("min_p");
+        writer.Double(static_cast<double>(request.minP.value()));


@mzegla please keep in mind that we should always return min_p, even if client did not specify it.
@michalkulakowski
here we add another param that doesnt follow the API correctly

Its a little bit different than top_p for example cause i think min_p is not a part of opnenai API. But i agree that it would be consistent to return values of all the generation parameters that OVMS support in Responses API response.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

+    // Apply multinomial sampling defaults when not explicitly set
+    if (config.do_sample) {
+        if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
+            config.top_k = 40;
+            SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
+        }


+        // Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
+        // Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
+        // seed produces deterministic output.
+        // Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
+        if (!request.seed.has_value()) {
+            static thread_local std::mt19937 rng{std::random_device{}()};
+            size_t seed = 0;
+            while (seed == 0)
+                seed = rng();
+            config.rng_seed = seed;
+            SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
+        }


+    // seed: uint32; optional - omit to use a random seed
    it = doc.FindMember("seed");
    if (it != doc.MemberEnd() && !it->value.IsNull()) {
-        if (!it->value.IsUint())
-            return absl::InvalidArgumentError("seed is not an unsigned integer");
-        request.seed = it->value.GetUint();
+        if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
+            return absl::InvalidArgumentError("seed is not an integer");
+        const int64_t raw = it->value.IsUint64() ? static_cast<int64_t>(it->value.GetUint64()) : it->value.GetInt64();
+        if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
+            return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
+        request.seed = static_cast<uint32_t>(raw);


+TEST_P(LLMHttpParametersValidationTest, minPOutOfRange) {
+    auto params = GetParam();
+    // min_p must be in [0.0, 1.0) — value of 1.0 is out of range
+    std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "1.0");
+
+    ASSERT_EQ(
+        handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
+        ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR);


init

3c49d4f

Copilot AI review requested due to automatic review settings May 8, 2026 12:06

Copilot started reviewing on behalf of mzegla May 8, 2026 12:06 View session

mzegla requested review from dkalinowski and dtrawins May 8, 2026 12:07

dkalinowski approved these changes May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/llm/io_processing/base_generation_config_builder.cpp Outdated

Comment thread src/llm/io_processing/base_generation_config_builder.cpp

Comment thread src/llm/io_processing/base_generation_config_builder.cpp

mzegla added 3 commits May 8, 2026 14:18

support min_p

b242a9b

apply randomization suggestion

b2b1ef2

add min_p tests

b2537ed

mzegla changed the title ~~Multinomial defaults change - random seed + top_k=40~~ Multinomial defaults change + min_p support May 8, 2026

mzegla requested a review from Copilot May 8, 2026 12:30

Copilot started reviewing on behalf of mzegla May 8, 2026 12:30 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

dkalinowski reviewed May 8, 2026

View reviewed changes

michalkulakowski approved these changes May 8, 2026

View reviewed changes

mzegla added 3 commits May 8, 2026 15:59

fixes

9864aaa

revert - seed still uint32 with -1 as special value

9428d63

remove duplicate

1a346bc

mzegla requested a review from Copilot May 8, 2026 14:26

Copilot started reviewing on behalf of mzegla May 8, 2026 14:26 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/llm/apis/openai_api_handler.cpp Outdated

Comment thread src/llm/apis/openai_api_handler.cpp Outdated

apply suggestions

1fad4aa

mzegla requested a review from Copilot May 8, 2026 14:42

Copilot started reviewing on behalf of mzegla May 8, 2026 14:44 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

		if (request.seed.has_value())
		config.rng_seed = request.seed.value();

Conversation

mzegla commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalkulakowski May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mzegla commented May 8, 2026 •

edited

Loading

michalkulakowski May 8, 2026 •

edited

Loading