Conversation
ab0aa5b to
1c329a4
Compare
|
|
||
| #pragma warning(push) | ||
| #pragma warning(disable : 6313) | ||
| #include <rapidjson/writer.h> |
There was a problem hiding this comment.
you can probably use technique suggested by @atobiszei -> hide these headers in port directory and include these headers. inside those headers keep pragmas, this way you dont need them here
| } | ||
|
|
||
| TEST_F(Gemma4OutputParserTest, ParseToolCallWithListOfStringsAsArgument) { | ||
| std::string inputWithProperClosure = "<|tool_call>call:generate_DNA_sequence{length:100,preferences:[<|\"|>G<|\"|>,<|\"|>C<|\"|>]}<tool_call|>"; |
There was a problem hiding this comment.
is it possible that model generates spaces between these keywords? if not, it can happen when model has low accuracy or subtle acc issues, how will this parser react to it?
There was a problem hiding this comment.
I am not sure, I didn't encounter any of accuracy issues
There was a problem hiding this comment.
Pull request overview
This PR adds Gemma4 tool-call parsing support to OVMS LLM I/O processing and adjusts the VLM legacy unary path to preserve/handle special tokens by reusing the streaming text collection logic (per CVS-184756). It also updates model-prep scripts and introduces a dedicated Gemma4 output parser test suite.
Changes:
- Added a new
Gemma4ToolParserand wired it intoOutputParserselection. - Modified VLM legacy unary response preparation to reconstruct the full text from streamer callbacks (to retain special tokens) and updated OpenAI API handler interfaces accordingly.
- Added Gemma4 tokenizer/model preparation steps and extensive unit tests for Gemma4 tool-call parsing (unary + streaming scenarios).
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| windows_prepare_llm_models.bat | Adds Gemma4 tokenizer download step for Windows test model prep. |
| prepare_llm_models.sh | Adds Gemma4 tokenizer conversion/download for Linux test model prep. |
| src/test/llm/output_parsers/gemma4_output_parser_test.cpp | New unit tests covering Gemma4 tool-call parsing and streaming chunk behavior. |
| src/test/http_openai_handler_test.cpp | Updates tests to match the new VLM unary serialization API signature. |
| src/llm/visual_language_model/legacy/servable.cpp | Implements unary “streaming-style” text accumulation to preserve special tokens; passes full text into unary serialization. |
| src/llm/io_processing/utils.hpp | Adds new utility declarations used by Gemma4 parsing/serialization. |
| src/llm/io_processing/utils.cpp | Implements JSON argument writing and delimiter search respecting nested structures/quotes. |
| src/llm/io_processing/output_parser.cpp | Registers gemma4 tool parser type. |
| src/llm/io_processing/gemma4/tool_parser.hpp | New Gemma4 tool parser interface + streaming state machine declarations. |
| src/llm/io_processing/gemma4/tool_parser.cpp | New Gemma4 tool parser implementation (unary + streaming). |
| src/llm/BUILD | Adds Bazel targets/deps for Gemma4 parser and RapidJSON usage in utils. |
| src/llm/apis/openai_api_handler.hpp | Updates VLM unary serialization interface to accept explicit textResponse. |
| src/llm/apis/openai_completions.hpp | Updates VLM unary serialization signature. |
| src/llm/apis/openai_completions.cpp | Uses provided textResponse for VLM unary serialization. |
| src/llm/apis/openai_responses.hpp | Updates VLM unary serialization signature. |
| src/llm/apis/openai_responses.cpp | Uses provided textResponse for VLM unary serialization. |
| spelling-whitelist.txt | Adds the new Gemma4 test file to the whitelist list. |
Comments suppressed due to low confidence (2)
src/llm/apis/openai_completions.cpp:497
- serializeUnaryResponse(VLMDecodedResults&, textResponse) only emits a choice when
textResponseis non-empty. If the model legitimately generates an empty string, this returns an emptychoicesarray, which is not a valid OpenAI-style response (a single choice with empty content/tool_calls should still be returned). Consider always creating one choice and usingtextResponseas-is (even when empty).
if (!textResponse.empty()) {
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);
// Workaround to use OVMS unary parsers: get tokens from string
// This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
auto generatedTokens = encodeTextToTokens(textResponse);
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
jsonResponse.StartObject();
// finish_reason: "stop" in regular scenario, "tool_calls" if output contains tool calls
auto finishReason = mapFinishReason(ov::genai::GenerationFinishReason::STOP, !parsedOutput.toolCalls.empty());
jsonResponse.FinishReason(finishReason.value_or("unknown"));
// index: integer; Choice index, only n=1 supported anyway
jsonResponse.Index(index++);
// TODO: logprobs: object/null; Log probability information for the choice.
if (endpoint == Endpoint::CHAT_COMPLETIONS) {
jsonResponse.MessageObject(parsedOutput);
} else if (endpoint == Endpoint::COMPLETIONS) {
jsonResponse.Text(parsedOutput);
}
// finish message object
jsonResponse.EndObject();
}
// finish choices array
jsonResponse.EndArray();
src/llm/apis/openai_responses.cpp:676
- serializeUnaryResponse(VLMDecodedResults&, textResponse) skips adding any ParsedOutput when
textResponseis empty, so serializeUnaryResponseImpl() is called with an empty vector. If the model output is legitimately empty, this likely produces a response without any output items/content, which is invalid for the Responses API. Consider emitting a single empty ParsedOutput (or otherwise ensuring one output item exists) even whentextResponseis empty.
std::vector<ParsedOutput> parsedOutputs;
if (!textResponse.empty()) {
if (outputParser != nullptr) {
// Same workaround as in chat completions
auto generatedTokens = encodeTextToTokens(textResponse);
parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
} else {
// Fast path: no output parser, use decoded text directly.
ParsedOutput output;
output.content = textResponse;
parsedOutputs.push_back(std::move(output));
}
}
return serializeUnaryResponseImpl(parsedOutputs);
| this->streamingPosition = toolCallEndTagPos + TOOL_CALL_END_TAG.length(); | ||
| this->currentState = State::AfterToolCall; | ||
| SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Detected end of tool call at position: {}, returning to content state", toolCallEndTagPos); |
| std::string Gemma4ToolParser::parseArrayParameter(std::string argumentStr) { | ||
| size_t pos = 1; | ||
| std::string parsedArguments = "["; | ||
|
|
||
| while (pos != std::string::npos) { | ||
| size_t stringStartPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, pos); | ||
| if (stringStartPos == std::string::npos) { | ||
| break; | ||
| } | ||
| stringStartPos += TOOL_ARGS_STRING_INDICATOR.size(); | ||
| size_t stringEndPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, stringStartPos); | ||
| if (stringEndPos == std::string::npos) { | ||
| break; | ||
| } | ||
|
|
||
| std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos); | ||
| size_t quotePos = 0; | ||
| while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) { | ||
| originalStr.insert(quotePos, "\\"); | ||
| quotePos += 2; | ||
| } | ||
| parsedArguments += "\"" + originalStr + "\","; | ||
|
|
||
| pos = stringEndPos + TOOL_ARGS_STRING_INDICATOR.size() + 1; | ||
| } | ||
|
|
||
| parsedArguments.back() = ']'; | ||
|
|
||
| return parsedArguments; |
| void writeArgumentOfAnyType(const rapidjson::Value& arg, rapidjson::Writer<rapidjson::StringBuffer>& writer) { | ||
| if (arg.IsString()) { | ||
| writer.String(arg.GetString()); | ||
| } else if (arg.IsInt64()) { | ||
| writer.Int64(arg.GetInt64()); | ||
| } else if (arg.IsDouble()) { | ||
| writer.Double(arg.GetDouble()); | ||
| } else if (arg.IsBool()) { | ||
| writer.Bool(arg.GetBool()); | ||
| } else if (arg.IsArray()) { | ||
| writer.StartArray(); | ||
| for (auto& elem : arg.GetArray()) { | ||
| writeArgumentOfAnyType(elem, writer); | ||
| } | ||
| writer.EndArray(); | ||
| } else if (arg.IsObject()) { | ||
| writer.StartObject(); | ||
| for (auto it = arg.MemberBegin(); it != arg.MemberEnd(); ++it) { | ||
| writer.Key(it->name.GetString()); | ||
| writeArgumentOfAnyType(it->value, writer); | ||
| } | ||
| writer.EndObject(); | ||
| } else { | ||
| writer.String(""); | ||
| } |
| std::optional<rapidjson::Document> Gemma4ToolParser::parseChunk(const std::string& chunk, ov::genai::GenerationFinishReason finishReason) { | ||
| if (chunk.empty()) { | ||
| return std::nullopt; | ||
| } | ||
|
|
||
| this->streamingContent += chunk; | ||
|
|
||
| if (parseNewContent()) { | ||
| if (this->currentState == State::ToolCallParameters) { | ||
| return BaseOutputParser::wrapFirstDelta(this->toolCall.name, toolCallIndex); | ||
| } | ||
| if (this->currentState == State::ToolCallEnded) { | ||
| return wrapDeltaArgs(this->toolCall.arguments, toolCallIndex); | ||
| } | ||
| if (this->currentState == State::Content) { | ||
| size_t contentEnd = this->streamingContent.find(TOOL_CALL_START_TAG, this->streamingPosition); | ||
| std::string content; | ||
| if (contentEnd != std::string::npos) { | ||
| content = this->streamingContent.substr(this->streamingPosition, contentEnd - this->streamingPosition); | ||
| } else { | ||
| content = this->streamingContent.substr(this->streamingPosition); | ||
| } | ||
| this->streamingPosition += content.size(); | ||
| if (!content.empty()) { | ||
| return wrapDeltaContent(content); | ||
| } | ||
| } | ||
| if (this->currentState == State::AfterToolCall) { | ||
| this->currentState = State::Content; | ||
| } | ||
| } |
dea7d61 to
c6f3835
Compare
| size_t endPos = std::find(generatedTokens.begin(), generatedTokens.end(), reasoningEndTokenId) - generatedTokens.begin(); | ||
|
|
||
| if (startPos != std::string::npos && endPos != std::string::npos && startPos < endPos) { | ||
| size_t reasoningStart = startPos + 3; // deleting "<|channel>thought\n" |
There was a problem hiding this comment.
ensure these tokens were really the ones you assumed. otherwise we might not detect issues early
| // Remove reasoning from content | ||
| std::string contentWithoutReasoning = tokenizer.decode(std::vector<int64_t>(generatedTokens.begin() + endPos + 1, generatedTokens.end()), ov::genai::skip_special_tokens(true)); | ||
| parsedOutput.content = contentWithoutReasoning; | ||
| } |
There was a problem hiding this comment.
what if there was no reasoning? what will take care of content parsing?
There was a problem hiding this comment.
probably tool parser will take care of that
| Gemma4ReasoningParser() = delete; | ||
| explicit Gemma4ReasoningParser(ov::genai::Tokenizer& tokenizer) : | ||
| Qwen3ReasoningParser(tokenizer) {} | ||
| void parse(ParsedOutput& parsedOutput, const std::vector<int64_t>& generatedTokens) override; |
There was a problem hiding this comment.
are we missing support for parseChunk? - does it work with streaming?
|
|
||
| std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos); | ||
| size_t quotePos = 0; | ||
| while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) { |
There was a problem hiding this comment.
Please check if we dont need multi-level escape, like we used to need in qwen3-coder parser
| If the tag is shorter than the buffer, we check: | ||
| a) if the tag is a substring of the buffer (tag is fully matched) | ||
| b) if the buffer and tag overlap (part of the tag is matched) | ||
| in the first case we return FOUND_COMPLETE, in the second FOUND_INCOMPLETE |
| Using appropriate parser based on the current processing phase | ||
| Call to this method should return either result from parserContentChunk, parseToolCallChunk, parseReasoningChunk when we can determine the phase | ||
| or std::nullopt when we are waiting for more chunks to determine if we should switch phase or not. | ||
| Note that mentioned methods do not take chunk as argument, they read it from streamOutputCache and are responsible for clearing the cache, |
|
|
||
| ASSERT_EQ(apiHandler->parseRequest(maxTokensLimit, bestOfLimit, maxModelLength), absl::OkStatus()); | ||
|
|
||
| std::string toolCallContent = "I will call a tool.<tool_call>{\"name\":\"get_weather\",\"arguments\":{\"location\":\"Paris\"}}</tool_call>"; |
There was a problem hiding this comment.
this is some content from rebasing main , i will remove it
| } | ||
|
|
||
| // TODO(mzegla): Usage of streaming flow here due to GenAI generate limitations. | ||
| // This diverges from the general flow - we should have unified systematic approach. |
There was a problem hiding this comment.
keep the comment, why are you removing isDisconnected?
| int singleQuoteDepth = 0; | ||
|
|
||
| for (size_t i = startPos; i < str.length(); ++i) { | ||
| if (bracketDepth == 0 && braceDepth == 0 && quoteDepth == 0 && singleQuoteDepth == 0 && |
There was a problem hiding this comment.
is it also used by other parsers? we should also run bfcl for other parsers that use it so we are sure there is no regression (i think lfm uses it?)
🛠 Summary
CVS-184756
Enabling gemma4 model, changing vlm pipeline to accept special tokens in streaming
🧪 Checklist
``