openvinotoolkit · mzegla · May 8, 2026 · May 6, 2026 · May 7, 2026 · May 7, 2026
diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md
@@ -221,6 +221,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
 | tool_choice | ✅ | ✅ | ✅ | string or object | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular tool via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice) for more details. |
 | response_format | ✅ | ✅ | ✅ | object | An object specifying the format that the model must output. Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will match your supplied JSON schema according to [OpenAI reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). Learn more in the [Structured Outputs demo](../demos/continuous_batching/structured_output/README.md). Additionally, `response_format` can accept [XGrammar structural tags format](https://github.com/mlc-ai/xgrammar/blob/main/docs/tutorials/structural_tag.md#format-types) (not part of OpenAI API). For example: `{ "type": "const_string", "value": "Hello World!" }`. **Note** that if model server fails to process the format, the request will still be processed, but the format will not be imposed. |
 | chat_template_kwargs | ✅ | ❌ | ✅ |  object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. |
+| skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
 
 #### Beam search sampling specific
 | Param | OpenVINO Model Server | OpenAI /chat/completions API | vLLM Serving Sampling Params | Type | Description |
@@ -281,7 +282,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
 - min_tokens
 - prompt_logprobs
 - detokenize
-- skip_special_tokens
 - spaces_between_special_tokens
 - logits_processors
 - truncate_prompt_tokens

diff --git a/docs/model_server_rest_api_completions.md b/docs/model_server_rest_api_completions.md
@@ -62,6 +62,7 @@ curl http://localhost/v3/completions \
 | include_stop_str_in_output | ✅ | ❌ | ✅ | bool (default: `false` if `stream=false`, `true` if `stream=true`) | Whether to include matched stop string in output. Setting it to false when `stream=true` is invalid configuration and will result in error. |
 | logprobs | ⚠️ | ✅ | ✅ | integer (optional) | Include the log probabilities on the logprob of the returned output token. **_ in stream mode logprobs are not returned. Only value 1 is accepted which returns logarithm or the chosen token _** |
 | echo | ✅ | ✅ | ✅ | boolean (optional) | Echo back the prompt in addition to the completion |
+| skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
 
 #### Beam search sampling specific
 | Param | OpenVINO Model Server | OpenAI /completions API | vLLM Serving Sampling Params | Type | Description |
@@ -112,7 +113,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
 - min_tokens
 - prompt_logprobs
 - detokenize
-- skip_special_tokens
 - spaces_between_special_tokens
 - logits_processors
 - truncate_prompt_tokens

diff --git a/docs/model_server_rest_api_responses.md b/docs/model_server_rest_api_responses.md
@@ -105,6 +105,7 @@ curl http://localhost/v3/responses \
 | tool_choice | ✅ | ✅ | string or object (optional) | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular function via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. |
 | reasoning | ⚠️ | ✅ | object (optional) | Configuration for reasoning/thinking mode. The `effort` field accepts `"low"`, `"medium"`, or `"high"` — any value enables thinking mode (`enable_thinking: true` is injected into chat template kwargs). The `summary` field is accepted but ignored. |
 | chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. |
+| skip_special_tokens | ✅ | ❌ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
 | stream_options | ❌ | ❌ | | Not supported in Responses API. Usage statistics are always included in the `response.completed` event. |
 
 #### Beam search sampling specific

diff --git a/src/llm/apis/openai_api_handler.cpp b/src/llm/apis/openai_api_handler.cpp
@@ -530,7 +530,7 @@ ParsedOutput OpenAIApiHandler::parseOutputIfNeeded(const std::vector<int64_t>& g
     OVMS_PROFILE_FUNCTION();
     ParsedOutput parsedOutput;
     if ((endpoint != Endpoint::CHAT_COMPLETIONS && endpoint != Endpoint::RESPONSES) || outputParser == nullptr) {
-        parsedOutput.content = this->tokenizer.decode(generatedIds);
+        parsedOutput.content = this->tokenizer.decode(generatedIds, ov::genai::skip_special_tokens(request.skipSpecialTokens));
     } else {
         parsedOutput = outputParser->parse(generatedIds, this->areToolsAvailable());
     }
@@ -853,6 +853,17 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
     if (maxNgramSizeItHasValue) {
         request.maxNgramSize = maxNgramSizeIt->value.GetUint();
     }
+
+    it = doc.FindMember("skip_special_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsBool())
+            return absl::InvalidArgumentError("skip_special_tokens is not a bool");
+        request.skipSpecialTokens = it->value.GetBool();
+    }
+    if (!request.skipSpecialTokens && outputParser != nullptr) {
+        outputParser.reset();
+    }
+
     request.maxModelLength = maxModelLength;
 
     // TODO: logit_bias

diff --git a/src/llm/apis/openai_api_handler.hpp b/src/llm/apis/openai_api_handler.hpp
@@ -164,7 +164,7 @@ class OpenAIApiHandler {
     // Serialization - pure virtual, each handler produces its own response format
     virtual std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) = 0;
     virtual std::string serializeUnaryResponse(ov::genai::EncodedResults& results) = 0;
-    virtual std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) = 0;
+    virtual std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) = 0;
     virtual std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) = 0;
     virtual std::string serializeStreamingUsageChunk() = 0;
     virtual std::string serializeStreamingHandshakeChunk() = 0;

diff --git a/src/llm/apis/openai_completions.cpp b/src/llm/apis/openai_completions.cpp
@@ -315,7 +315,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
                 jsonResponse.StartArray("content");
 
                 for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
-                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
+                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
                     float logprob = generationOutput.generated_log_probs[i];
                     jsonResponse.LogprobObject(token, logprob);
                 }
@@ -324,7 +324,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
             if (endpoint == Endpoint::COMPLETIONS) {
                 jsonResponse.StartArray("tokens");
                 for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
-                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
+                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
                     jsonResponse.String(token);
                 }
                 jsonResponse.EndArray();
@@ -339,7 +339,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
                 jsonResponse.StartArray("top_logprobs");
                 for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
                     jsonResponse.StartObject();
-                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
+                    std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
                     float logprob = generationOutput.generated_log_probs[i];
                     jsonResponse.Logprob(token, logprob);
                     jsonResponse.EndObject();
@@ -351,7 +351,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
                     if (i == 0) {
                         jsonResponse.TextOffsetValue(0);
                     } else {
-                        std::string textBeforeToken = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids.begin(), generationOutput.generated_ids.begin() + i}));
+                        std::string textBeforeToken = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids.begin(), generationOutput.generated_ids.begin() + i}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
                         jsonResponse.TextOffsetValue(textBeforeToken.size());
                     }
                 }
@@ -458,7 +458,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::Enco
     return jsonResponse.ToString();
 }
 
-std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results) {
+std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) {
     OVMS_PROFILE_FUNCTION();
     usage.promptTokens = results.perf_metrics.get_num_input_tokens();
     usage.completionTokens = results.perf_metrics.get_num_generated_tokens();
@@ -470,13 +470,12 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMD
     jsonResponse.StartArray("choices");
     int index = 0;
 
-    for (int i = 0; i < results.texts.size(); i++) {
-        const std::string& text = results.texts[i];
-        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", text);
+    if (!textResponse.empty()) {
+        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);
 
         // Workaround to use OVMS unary parsers: get tokens from string
         // This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
-        auto generatedTokens = encodeTextToTokens(text);
+        auto generatedTokens = encodeTextToTokens(textResponse);
 
         SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
         ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);

diff --git a/src/llm/apis/openai_completions.hpp b/src/llm/apis/openai_completions.hpp
@@ -39,7 +39,7 @@ class OpenAIChatCompletionsHandler : public OpenAIApiHandler {
 
     std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) override;
     std::string serializeUnaryResponse(ov::genai::EncodedResults& results) override;
-    std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) override;
+    std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) override;
     std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) override;
     std::string serializeStreamingUsageChunk() override;
     std::string serializeStreamingHandshakeChunk() override;

diff --git a/src/llm/apis/openai_request.hpp b/src/llm/apis/openai_request.hpp
@@ -81,6 +81,8 @@ struct OpenAIRequest {
     // Holds value for tool_choice field as described in https://platform.openai.com/docs/api-reference/chat/create#chat_create-tool_choice
     std::string toolChoice;
 
+    bool skipSpecialTokens{true};
+
     OpenAIRequest() = default;
     ~OpenAIRequest() = default;
 };

diff --git a/src/llm/apis/openai_responses.cpp b/src/llm/apis/openai_responses.cpp
@@ -655,21 +655,21 @@ std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::EncodedRes
     return serializeUnaryResponseImpl(parsedOutputs);
 }
 
-std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results) {
+std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) {
     OVMS_PROFILE_FUNCTION();
     usage.promptTokens = results.perf_metrics.get_num_input_tokens();
     usage.completionTokens = results.perf_metrics.get_num_generated_tokens();
     // Usage is already correctly set from perf_metrics above — no need for updateUsage.
     std::vector<ParsedOutput> parsedOutputs;
-    for (const std::string& text : results.texts) {
+    if (!textResponse.empty()) {
         if (outputParser != nullptr) {
             // Same workaround as in chat completions
-            auto generatedTokens = encodeTextToTokens(text);
+            auto generatedTokens = encodeTextToTokens(textResponse);
             parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
         } else {
             // Fast path: no output parser, use decoded text directly.
             ParsedOutput output;
-            output.content = text;
+            output.content = textResponse;
             parsedOutputs.push_back(std::move(output));
         }
     }

diff --git a/src/llm/apis/openai_responses.hpp b/src/llm/apis/openai_responses.hpp
@@ -97,7 +97,7 @@ class OpenAIResponsesHandler : public OpenAIApiHandler {
 
     std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) override;
     std::string serializeUnaryResponse(ov::genai::EncodedResults& results) override;
-    std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) override;
+    std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) override;
     std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) override;
     std::string serializeStreamingUsageChunk() override;
     std::string serializeStreamingHandshakeChunk() override;

diff --git a/src/llm/language_model/legacy/servable.cpp b/src/llm/language_model/legacy/servable.cpp
@@ -114,8 +114,9 @@ absl::Status LegacyServable::parseRequest(std::shared_ptr<GenAiServableExecution
     };
     ov::AnyMap streamerConfig;
     if (legacyExecutionContext->apiHandler->isStream() &&
-        legacyExecutionContext->apiHandler->getOutputParser() != nullptr &&
-        (legacyExecutionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens())) {
+        ((legacyExecutionContext->apiHandler->getOutputParser() != nullptr &&
+             legacyExecutionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens()) ||
+            !legacyExecutionContext->apiHandler->getRequest().skipSpecialTokens)) {
         streamerConfig.insert(ov::genai::skip_special_tokens(false));
     }
     legacyExecutionContext->textStreamer = std::make_shared<ov::genai::TextStreamer>(getProperties()->tokenizer, callback, streamerConfig);

diff --git a/src/llm/servable.cpp b/src/llm/servable.cpp
@@ -146,8 +146,9 @@ absl::Status GenAiServable::parseRequest(std::shared_ptr<GenAiServableExecutionC
             return ov::genai::StreamingStatus::RUNNING;
         };
         ov::AnyMap streamerConfig;
-        if (executionContext->apiHandler->getOutputParser() != nullptr &&
-            (executionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens())) {
+        if ((executionContext->apiHandler->getOutputParser() != nullptr &&
+                executionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens()) ||
+            !executionContext->apiHandler->getRequest().skipSpecialTokens) {
             streamerConfig.insert(ov::genai::skip_special_tokens(false));
         }
         executionContext->textStreamer = std::make_shared<ov::genai::TextStreamer>(getProperties()->tokenizer, callback, streamerConfig);