Skip to content

Multinomial defaults change + min_p support#4194

Open
mzegla wants to merge 8 commits intomainfrom
multinomial_defaults_change
Open

Multinomial defaults change + min_p support#4194
mzegla wants to merge 8 commits intomainfrom
multinomial_defaults_change

Conversation

@mzegla
Copy link
Copy Markdown
Collaborator

@mzegla mzegla commented May 8, 2026

Introducing new default behavior for multinomial sampling. When temperature > 0:

  • default top_k is set to 40 (used to be inactive - all tokens considered)
  • default randomization (random seed for every request if not specified) - new default behavior is non-deterministic, used to be deterministic (same, hardcoded seed with every request)

Additionally adding support to min_p sampling option.

Copilot AI review requested due to automatic review settings May 8, 2026 12:06
@mzegla mzegla requested review from dkalinowski and dtrawins May 8, 2026 12:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates OpenVINO GenAI GenerationConfig request-to-config mapping so multinomial sampling uses new defaults (top-k filtering and non-deterministic seeding) when the client does not explicitly provide those parameters.

Changes:

  • When do_sample is enabled, default top_k to 40 if not provided in the request.
  • When do_sample is enabled, randomize rng_seed if not provided in the request and the current seed is 0.
  • Add <random> include to support seed randomization.

Comment thread src/llm/io_processing/base_generation_config_builder.cpp Outdated
Comment thread src/llm/io_processing/base_generation_config_builder.cpp
Comment thread src/llm/io_processing/base_generation_config_builder.cpp
@mzegla mzegla changed the title Multinomial defaults change - random seed + top_k=40 Multinomial defaults change + min_p support May 8, 2026
@mzegla mzegla requested a review from Copilot May 8, 2026 12:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_request.hpp:64

  • OpenAIRequest::seed is typed as std::optional<int>, but the API parser treats seed as an unsigned integer and the generation config uses 0 as a sentinel for “not set”. Using a signed type here risks overflow/negative values and complicates the sentinel semantics. Prefer an unsigned type (uint32_t/uint64_t/size_t) consistent with the parsed JSON and GenerationConfig::rng_seed.
    // Multinomial decoding specific
    std::optional<float> temperature{std::nullopt};
    std::optional<float> topP{std::nullopt};
    std::optional<float> minP{std::nullopt};
    std::optional<int> topK{std::nullopt};
    std::optional<int> seed{std::nullopt};
    std::optional<float> frequencyPenalty{std::nullopt};
    std::optional<float> presencePenalty{std::nullopt};

src/llm/apis/openai_api_handler.cpp:761

  • top_k currently accepts any JSON int (including negative values other than -1). Because top_k is later assigned into GenerationConfig::top_k (size_t), values like -2 will wrap to a huge positive number and won’t behave like “all tokens”. If only -1 is meant to disable the filter (per docs), validate top_k >= -1 here and reject other negative values.
    // top_k: int; optional - defaults to 0
    // Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
    it = doc.FindMember("top_k");
    if (it != doc.MemberEnd() && !it->value.IsNull()) {
        if (!it->value.IsInt())
            return absl::InvalidArgumentError("top_k is not an integer");
        request.topK = it->value.GetInt();
    }

Comment thread docs/model_server_rest_api_chat.md Outdated
Comment on lines +233 to +243
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is described in "Multinomial sampling specific" section

Comment on lines +74 to +84
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.

Comment on lines 118 to 129
#### Multinomial sampling specific
| Param | OpenVINO Model Server | OpenAI /responses API | Type | Description |
|-------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p * p_max` are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ❌ | integer (default: random) | Random seed to use for the generation. Set explicitly to get reproducible results. |

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is described in "Multinomial sampling specific" section. Active multinomial sampling is an assumption here.

Comment on lines 127 to 128
if (request.seed.has_value())
config.rng_seed = request.seed.value();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it an issue? @mzegla

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, let me revert that and keep uint32 type with -1 being special value

}
if (request.minP.has_value()) {
writer.String("min_p");
writer.Double(static_cast<double>(request.minP.value()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzegla please keep in mind that we should always return min_p, even if client did not specify it.
@michalkulakowski
here we add another param that doesnt follow the API correctly

Copy link
Copy Markdown
Collaborator

@michalkulakowski michalkulakowski May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little bit different than top_p for example cause i think min_p is not a part of opnenai API. But i agree that it would be consistent to return values of all the generation parameters that OVMS support in Responses API response.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread src/llm/apis/openai_api_handler.cpp Outdated
Comment thread src/llm/apis/openai_api_handler.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment on lines +139 to +144
// Apply multinomial sampling defaults when not explicitly set
if (config.do_sample) {
if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
config.top_k = 40;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
}
Comment on lines +145 to +156
// Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
// Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
// seed produces deterministic output.
// Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
if (!request.seed.has_value()) {
static thread_local std::mt19937 rng{std::random_device{}()};
size_t seed = 0;
while (seed == 0)
seed = rng();
config.rng_seed = seed;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
}
Comment on lines +764 to +772
// seed: uint32; optional - omit to use a random seed
it = doc.FindMember("seed");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint())
return absl::InvalidArgumentError("seed is not an unsigned integer");
request.seed = it->value.GetUint();
if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
return absl::InvalidArgumentError("seed is not an integer");
const int64_t raw = it->value.IsUint64() ? static_cast<int64_t>(it->value.GetUint64()) : it->value.GetInt64();
if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
request.seed = static_cast<uint32_t>(raw);
Comment on lines +3279 to +3286
TEST_P(LLMHttpParametersValidationTest, minPOutOfRange) {
auto params = GetParam();
// min_p must be in [0.0, 1.0) — value of 1.0 is out of range
std::string requestBody = validRequestBodyWithParameter(params.modelName, "min_p", "1.0");

ASSERT_EQ(
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants