Gemma4 parser by przepeck · Pull Request #4179 · openvinotoolkit/model_server

przepeck · 2026-05-05T09:39:09Z

🛠 Summary

CVS-184756
Enabling gemma4 model, changing vlm pipeline to accept special tokens in streaming

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

dkalinowski · 2026-05-06T11:48:11Z


+#pragma warning(push)
+#pragma warning(disable : 6313)
+#include <rapidjson/writer.h>


you can probably use technique suggested by @atobiszei -> hide these headers in port directory and include these headers. inside those headers keep pragmas, this way you dont need them here

dkalinowski · 2026-05-06T11:55:09Z

+}
+
+TEST_F(Gemma4OutputParserTest, ParseToolCallWithListOfStringsAsArgument) {
+    std::string inputWithProperClosure = "<|tool_call>call:generate_DNA_sequence{length:100,preferences:[<|\"|>G<|\"|>,<|\"|>C<|\"|>]}<tool_call|>";


is it possible that model generates spaces between these keywords? if not, it can happen when model has low accuracy or subtle acc issues, how will this parser react to it?

I am not sure, I didn't encounter any of accuracy issues

dkalinowski · 2026-05-06T12:13:06Z

+}
+
+TEST_F(Gemma4OutputParserTest, ParseToolCallOutputWithThreeToolCalls) {
+    std::string inputWithProperClosure = "<|tool_call>call:example_tool{arg1:<|\"|>value1<|\"|>,arg2:42}<tool_call|>"


did you test if the model is capable of generating multiple tool calls? (so called "parallel tool calling")
if so, is this behavior observed or from docs?
im just worried if it always is single tool call betweem <tool_call> or maybe it generates all in one? (we have seen such models)
also, do we support all white spaces between tool? space/newline/\r?

I observed, that model can generate multiple tool calls, it separates it doesn't have special separators next function just starts with call
eg. call:solve{equation:<|"|>2(x + 5) = 13<|"|>}call:sort{array:[<|"|>hello, it's me<|"|>, <|"|>hi<|"|>]}

I think we are missing such unit test

Copilot

Pull request overview

This PR adds Gemma4 tool-call parsing support to OVMS LLM I/O processing and adjusts the VLM legacy unary path to preserve/handle special tokens by reusing the streaming text collection logic (per CVS-184756). It also updates model-prep scripts and introduces a dedicated Gemma4 output parser test suite.

Changes:

Added a new Gemma4ToolParser and wired it into OutputParser selection.
Modified VLM legacy unary response preparation to reconstruct the full text from streamer callbacks (to retain special tokens) and updated OpenAI API handler interfaces accordingly.
Added Gemma4 tokenizer/model preparation steps and extensive unit tests for Gemma4 tool-call parsing (unary + streaming scenarios).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
windows_prepare_llm_models.bat	Adds Gemma4 tokenizer download step for Windows test model prep.
prepare_llm_models.sh	Adds Gemma4 tokenizer conversion/download for Linux test model prep.
src/test/llm/output_parsers/gemma4_output_parser_test.cpp	New unit tests covering Gemma4 tool-call parsing and streaming chunk behavior.
src/test/http_openai_handler_test.cpp	Updates tests to match the new VLM unary serialization API signature.
src/llm/visual_language_model/legacy/servable.cpp	Implements unary “streaming-style” text accumulation to preserve special tokens; passes full text into unary serialization.
src/llm/io_processing/utils.hpp	Adds new utility declarations used by Gemma4 parsing/serialization.
src/llm/io_processing/utils.cpp	Implements JSON argument writing and delimiter search respecting nested structures/quotes.
src/llm/io_processing/output_parser.cpp	Registers `gemma4` tool parser type.
src/llm/io_processing/gemma4/tool_parser.hpp	New Gemma4 tool parser interface + streaming state machine declarations.
src/llm/io_processing/gemma4/tool_parser.cpp	New Gemma4 tool parser implementation (unary + streaming).
src/llm/BUILD	Adds Bazel targets/deps for Gemma4 parser and RapidJSON usage in utils.
src/llm/apis/openai_api_handler.hpp	Updates VLM unary serialization interface to accept explicit `textResponse`.
src/llm/apis/openai_completions.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_completions.cpp	Uses provided `textResponse` for VLM unary serialization.
src/llm/apis/openai_responses.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_responses.cpp	Uses provided `textResponse` for VLM unary serialization.
spelling-whitelist.txt	Adds the new Gemma4 test file to the whitelist list.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:497

serializeUnaryResponse(VLMDecodedResults&, textResponse) only emits a choice when textResponse is non-empty. If the model legitimately generates an empty string, this returns an empty choices array, which is not a valid OpenAI-style response (a single choice with empty content/tool_calls should still be returned). Consider always creating one choice and using textResponse as-is (even when empty).

    if (!textResponse.empty()) {
        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);

        // Workaround to use OVMS unary parsers: get tokens from string
        // This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
        auto generatedTokens = encodeTextToTokens(textResponse);

        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
        ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
        jsonResponse.StartObject();
        // finish_reason: "stop" in regular scenario, "tool_calls" if output contains tool calls
        auto finishReason = mapFinishReason(ov::genai::GenerationFinishReason::STOP, !parsedOutput.toolCalls.empty());
        jsonResponse.FinishReason(finishReason.value_or("unknown"));
        // index: integer; Choice index, only n=1 supported anyway
        jsonResponse.Index(index++);
        // TODO: logprobs: object/null; Log probability information for the choice.

        if (endpoint == Endpoint::CHAT_COMPLETIONS) {
            jsonResponse.MessageObject(parsedOutput);
        } else if (endpoint == Endpoint::COMPLETIONS) {
            jsonResponse.Text(parsedOutput);
        }

        // finish message object
        jsonResponse.EndObject();
    }
    // finish choices array
    jsonResponse.EndArray();

src/llm/apis/openai_responses.cpp:676

serializeUnaryResponse(VLMDecodedResults&, textResponse) skips adding any ParsedOutput when textResponse is empty, so serializeUnaryResponseImpl() is called with an empty vector. If the model output is legitimately empty, this likely produces a response without any output items/content, which is invalid for the Responses API. Consider emitting a single empty ParsedOutput (or otherwise ensuring one output item exists) even when textResponse is empty.

    std::vector<ParsedOutput> parsedOutputs;
    if (!textResponse.empty()) {
        if (outputParser != nullptr) {
            // Same workaround as in chat completions
            auto generatedTokens = encodeTextToTokens(textResponse);
            parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
        } else {
            // Fast path: no output parser, use decoded text directly.
            ParsedOutput output;
            output.content = textResponse;
            parsedOutputs.push_back(std::move(output));
        }
    }
    return serializeUnaryResponseImpl(parsedOutputs);

+        this->streamingPosition = toolCallEndTagPos + TOOL_CALL_END_TAG.length();
+        this->currentState = State::AfterToolCall;
+        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Detected end of tool call at position: {}, returning to content state", toolCallEndTagPos);


+std::string Gemma4ToolParser::parseArrayParameter(std::string argumentStr) {
+    size_t pos = 1;
+    std::string parsedArguments = "[";
+
+    while (pos != std::string::npos) {
+        size_t stringStartPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, pos);
+        if (stringStartPos == std::string::npos) {
+            break;
+        }
+        stringStartPos += TOOL_ARGS_STRING_INDICATOR.size();
+        size_t stringEndPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, stringStartPos);
+        if (stringEndPos == std::string::npos) {
+            break;
+        }
+
+        std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
+        size_t quotePos = 0;
+        while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {
+            originalStr.insert(quotePos, "\\");
+            quotePos += 2;
+        }
+        parsedArguments += "\"" + originalStr + "\",";
+
+        pos = stringEndPos + TOOL_ARGS_STRING_INDICATOR.size() + 1;
+    }
+
+    parsedArguments.back() = ']';
+
+    return parsedArguments;


+                break;
+                SPDLOG_LOGGER_TRACE(llm_calculator_logger, "No more tool strings found in the decoded string: {}", toolCallStr);


+void writeArgumentOfAnyType(const rapidjson::Value& arg, rapidjson::Writer<rapidjson::StringBuffer>& writer) {
+    if (arg.IsString()) {
+        writer.String(arg.GetString());
+    } else if (arg.IsInt64()) {
+        writer.Int64(arg.GetInt64());
+    } else if (arg.IsDouble()) {
+        writer.Double(arg.GetDouble());
+    } else if (arg.IsBool()) {
+        writer.Bool(arg.GetBool());
+    } else if (arg.IsArray()) {
+        writer.StartArray();
+        for (auto& elem : arg.GetArray()) {
+            writeArgumentOfAnyType(elem, writer);
+        }
+        writer.EndArray();
+    } else if (arg.IsObject()) {
+        writer.StartObject();
+        for (auto it = arg.MemberBegin(); it != arg.MemberEnd(); ++it) {
+            writer.Key(it->name.GetString());
+            writeArgumentOfAnyType(it->value, writer);
+        }
+        writer.EndObject();
+    } else {
+        writer.String("");
+    }


+std::optional<rapidjson::Document> Gemma4ToolParser::parseChunk(const std::string& chunk, ov::genai::GenerationFinishReason finishReason) {
+    if (chunk.empty()) {
+        return std::nullopt;
+    }
+
+    this->streamingContent += chunk;
+
+    if (parseNewContent()) {
+        if (this->currentState == State::ToolCallParameters) {
+            return BaseOutputParser::wrapFirstDelta(this->toolCall.name, toolCallIndex);
+        }
+        if (this->currentState == State::ToolCallEnded) {
+            return wrapDeltaArgs(this->toolCall.arguments, toolCallIndex);
+        }
+        if (this->currentState == State::Content) {
+            size_t contentEnd = this->streamingContent.find(TOOL_CALL_START_TAG, this->streamingPosition);
+            std::string content;
+            if (contentEnd != std::string::npos) {
+                content = this->streamingContent.substr(this->streamingPosition, contentEnd - this->streamingPosition);
+            } else {
+                content = this->streamingContent.substr(this->streamingPosition);
+            }
+            this->streamingPosition += content.size();
+            if (!content.empty()) {
+                return wrapDeltaContent(content);
+            }
+        }
+        if (this->currentState == State::AfterToolCall) {
+            this->currentState = State::Content;
+        }
+    }


dkalinowski · 2026-05-06T12:14:59Z

+    std::vector<int64_t> generatedTokens(generatedTensor.data<int64_t>(), generatedTensor.data<int64_t>() + generatedTensor.get_size());
+    ParsedOutput parsedOutput = outputParserWithRegularToolParsing->parse(generatedTokens, true);
+    ASSERT_EQ(parsedOutput.toolCalls.size(), 1);
+    EXPECT_EQ(parsedOutput.toolCalls[0].name, "no_args_tool");


Suggested change

EXPECT_EQ(parsedOutput.toolCalls[0].name, "no_args_tool");

EXPECT_EQ(parsedOutput.toolCalls[0].name, "no_args_tool");

EXPECT_EQ(parsedOutput.toolCalls[0].arguments, "");

dkalinowski · 2026-05-06T12:18:41Z

+    std::vector<std::tuple<std::string, ov::genai::GenerationFinishReason, std::optional<std::string>>> chunkToDeltaVec{
+        {"SOME_CONTENT", ov::genai::GenerationFinishReason::NONE, R"({"delta":{"content":"SOME_CONTENT"}})"},
+        {"MORE_CONTENT<|tool_call>", ov::genai::GenerationFinishReason::NONE, R"({"delta":{"content":"MORE_CONTENT"}})"},
+        {"call:sort", ov::genai::GenerationFinishReason::NONE, std::nullopt},


do we need tests for whitespaces between keywords? do we treat such requests as valid?

dkalinowski · 2026-05-06T12:21:17Z

+
+// Tests with special characters
+TEST_F(Gemma4OutputParserTest, ParseToolCallWithStringArgumentsContainingComparison) {
+    std::string input = R"x(<|tool_call>call:search{query:<|"|>price >= 100, (sale)<|"|>,limit:5}<tool_call|>)x";


why its not supported? isnt it normal thing that whitespace is part of the string? and also > < =

dkalinowski · 2026-05-06T13:21:39Z

+}
+
+TEST_F(Gemma4OutputParserTest, ParseToolCallWithStringArgumentsContainingSpecialCharacters) {
+    std::string impl = "import package\nimport package2\n\ndef func(a, b):\n\td={\"python\": \"dict\"}\n\tl = [\"list \\\"with escaped text\\\"\", 123, []]\n\treturn f\"formatted {a} and {b}\"";


did you test if the model really escapes everything? I remember @atobiszei doing qwen-coder parser and there was additional work needed to be made to support nested \

przepeck added 12 commits April 15, 2026 12:54

save

518d64b

adding tests

784c81f

fixed build

0423003

save

0c55d5a

save

73a55b8

save

964d9e6

fix object parsing

249dfc9

streaming and tests fixes

5eced77

enabling all tests

fcedc2d

streaming workaround for unary

9cb5821

style

5edecc6

cpplint

1c329a4

przepeck force-pushed the przepeck/gemma_parser branch from ab0aa5b to 1c329a4 Compare May 6, 2026 09:53