Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/model_server_rest_api_chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
| tool_choice | ✅ | ✅ | ✅ | string or object | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular tool via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice) for more details. |
| response_format | ✅ | ✅ | ✅ | object | An object specifying the format that the model must output. Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will match your supplied JSON schema according to [OpenAI reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). Learn more in the [Structured Outputs demo](../demos/continuous_batching/structured_output/README.md). Additionally, `response_format` can accept [XGrammar structural tags format](https://github.com/mlc-ai/xgrammar/blob/main/docs/tutorials/structural_tag.md#format-types) (not part of OpenAI API). For example: `{ "type": "const_string", "value": "Hello World!" }`. **Note** that if model server fails to process the format, the request will still be processed, but the format will not be imposed. |
| chat_template_kwargs | ✅ | ❌ | ✅ | object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. |
| skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |

#### Beam search sampling specific
| Param | OpenVINO Model Server | OpenAI /chat/completions API | vLLM Serving Sampling Params | Type | Description |
Expand Down Expand Up @@ -281,7 +282,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
- min_tokens
- prompt_logprobs
- detokenize
- skip_special_tokens
- spaces_between_special_tokens
- logits_processors
- truncate_prompt_tokens
Expand Down
2 changes: 1 addition & 1 deletion docs/model_server_rest_api_completions.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ curl http://localhost/v3/completions \
| include_stop_str_in_output | ✅ | ❌ | ✅ | bool (default: `false` if `stream=false`, `true` if `stream=true`) | Whether to include matched stop string in output. Setting it to false when `stream=true` is invalid configuration and will result in error. |
| logprobs | ⚠️ | ✅ | ✅ | integer (optional) | Include the log probabilities on the logprob of the returned output token. **_ in stream mode logprobs are not returned. Only value 1 is accepted which returns logarithm or the chosen token _** |
| echo | ✅ | ✅ | ✅ | boolean (optional) | Echo back the prompt in addition to the completion |
| skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |

#### Beam search sampling specific
| Param | OpenVINO Model Server | OpenAI /completions API | vLLM Serving Sampling Params | Type | Description |
Expand Down Expand Up @@ -112,7 +113,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
- min_tokens
- prompt_logprobs
- detokenize
- skip_special_tokens
- spaces_between_special_tokens
- logits_processors
- truncate_prompt_tokens
Expand Down
1 change: 1 addition & 0 deletions docs/model_server_rest_api_responses.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ curl http://localhost/v3/responses \
| tool_choice | ✅ | ✅ | string or object (optional) | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular function via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. |
| reasoning | ⚠️ | ✅ | object (optional) | Configuration for reasoning/thinking mode. The `effort` field accepts `"low"`, `"medium"`, or `"high"` — any value enables thinking mode (`enable_thinking: true` is injected into chat template kwargs). The `summary` field is accepted but ignored. |
| chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. |
| skip_special_tokens | ✅ | ❌ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
| stream_options | ❌ | ❌ | | Not supported in Responses API. Usage statistics are always included in the `response.completed` event. |

#### Beam search sampling specific
Expand Down
13 changes: 12 additions & 1 deletion src/llm/apis/openai_api_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -530,7 +530,7 @@ ParsedOutput OpenAIApiHandler::parseOutputIfNeeded(const std::vector<int64_t>& g
OVMS_PROFILE_FUNCTION();
ParsedOutput parsedOutput;
if ((endpoint != Endpoint::CHAT_COMPLETIONS && endpoint != Endpoint::RESPONSES) || outputParser == nullptr) {
parsedOutput.content = this->tokenizer.decode(generatedIds);
parsedOutput.content = this->tokenizer.decode(generatedIds, ov::genai::skip_special_tokens(request.skipSpecialTokens));
} else {
parsedOutput = outputParser->parse(generatedIds, this->areToolsAvailable());
}
Expand Down Expand Up @@ -853,6 +853,17 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
if (maxNgramSizeItHasValue) {
request.maxNgramSize = maxNgramSizeIt->value.GetUint();
}

it = doc.FindMember("skip_special_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsBool())
return absl::InvalidArgumentError("skip_special_tokens is not a bool");
request.skipSpecialTokens = it->value.GetBool();
}
if (!request.skipSpecialTokens && outputParser != nullptr) {
outputParser.reset();
}

request.maxModelLength = maxModelLength;

// TODO: logit_bias
Expand Down
2 changes: 1 addition & 1 deletion src/llm/apis/openai_api_handler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ class OpenAIApiHandler {
// Serialization - pure virtual, each handler produces its own response format
virtual std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) = 0;
virtual std::string serializeUnaryResponse(ov::genai::EncodedResults& results) = 0;
virtual std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) = 0;
virtual std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) = 0;
virtual std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) = 0;
virtual std::string serializeStreamingUsageChunk() = 0;
virtual std::string serializeStreamingHandshakeChunk() = 0;
Expand Down
17 changes: 8 additions & 9 deletions src/llm/apis/openai_completions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
jsonResponse.StartArray("content");

for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
float logprob = generationOutput.generated_log_probs[i];
jsonResponse.LogprobObject(token, logprob);
}
Expand All @@ -324,7 +324,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
if (endpoint == Endpoint::COMPLETIONS) {
jsonResponse.StartArray("tokens");
for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
jsonResponse.String(token);
}
jsonResponse.EndArray();
Expand All @@ -339,7 +339,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
jsonResponse.StartArray("top_logprobs");
for (int i = 0; i < generationOutput.generated_ids.size(); i++) {
jsonResponse.StartObject();
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}));
std::string token = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids[i]}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
float logprob = generationOutput.generated_log_probs[i];
jsonResponse.Logprob(token, logprob);
jsonResponse.EndObject();
Expand All @@ -351,7 +351,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(const std::vect
if (i == 0) {
jsonResponse.TextOffsetValue(0);
} else {
std::string textBeforeToken = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids.begin(), generationOutput.generated_ids.begin() + i}));
std::string textBeforeToken = tokenizer.decode(std::vector<int64_t>({generationOutput.generated_ids.begin(), generationOutput.generated_ids.begin() + i}), ov::genai::skip_special_tokens(this->request.skipSpecialTokens));
jsonResponse.TextOffsetValue(textBeforeToken.size());
}
}
Expand Down Expand Up @@ -458,7 +458,7 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::Enco
return jsonResponse.ToString();
}

std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results) {
std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) {
OVMS_PROFILE_FUNCTION();
usage.promptTokens = results.perf_metrics.get_num_input_tokens();
usage.completionTokens = results.perf_metrics.get_num_generated_tokens();
Expand All @@ -470,13 +470,12 @@ std::string OpenAIChatCompletionsHandler::serializeUnaryResponse(ov::genai::VLMD
jsonResponse.StartArray("choices");
int index = 0;

for (int i = 0; i < results.texts.size(); i++) {
const std::string& text = results.texts[i];
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", text);
if (!textResponse.empty()) {
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);

// Workaround to use OVMS unary parsers: get tokens from string
// This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
auto generatedTokens = encodeTextToTokens(text);
auto generatedTokens = encodeTextToTokens(textResponse);

SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
Expand Down
2 changes: 1 addition & 1 deletion src/llm/apis/openai_completions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ class OpenAIChatCompletionsHandler : public OpenAIApiHandler {

std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) override;
std::string serializeUnaryResponse(ov::genai::EncodedResults& results) override;
std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) override;
std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) override;
std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) override;
std::string serializeStreamingUsageChunk() override;
std::string serializeStreamingHandshakeChunk() override;
Expand Down
2 changes: 2 additions & 0 deletions src/llm/apis/openai_request.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ struct OpenAIRequest {
// Holds value for tool_choice field as described in https://platform.openai.com/docs/api-reference/chat/create#chat_create-tool_choice
std::string toolChoice;

bool skipSpecialTokens{true};

OpenAIRequest() = default;
~OpenAIRequest() = default;
};
Expand Down
8 changes: 4 additions & 4 deletions src/llm/apis/openai_responses.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -655,21 +655,21 @@ std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::EncodedRes
return serializeUnaryResponseImpl(parsedOutputs);
}

std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results) {
std::string OpenAIResponsesHandler::serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) {
OVMS_PROFILE_FUNCTION();
usage.promptTokens = results.perf_metrics.get_num_input_tokens();
usage.completionTokens = results.perf_metrics.get_num_generated_tokens();
// Usage is already correctly set from perf_metrics above — no need for updateUsage.
std::vector<ParsedOutput> parsedOutputs;
for (const std::string& text : results.texts) {
if (!textResponse.empty()) {
if (outputParser != nullptr) {
// Same workaround as in chat completions
auto generatedTokens = encodeTextToTokens(text);
auto generatedTokens = encodeTextToTokens(textResponse);
parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
} else {
// Fast path: no output parser, use decoded text directly.
ParsedOutput output;
output.content = text;
output.content = textResponse;
parsedOutputs.push_back(std::move(output));
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/llm/apis/openai_responses.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ class OpenAIResponsesHandler : public OpenAIApiHandler {

std::string serializeUnaryResponse(const std::vector<ov::genai::GenerationOutput>& generationOutputs) override;
std::string serializeUnaryResponse(ov::genai::EncodedResults& results) override;
std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results) override;
std::string serializeUnaryResponse(ov::genai::VLMDecodedResults& results, const std::string& textResponse) override;
std::string serializeStreamingChunk(const std::string& chunkResponse, ov::genai::GenerationFinishReason finishReason) override;
std::string serializeStreamingUsageChunk() override;
std::string serializeStreamingHandshakeChunk() override;
Expand Down
5 changes: 3 additions & 2 deletions src/llm/language_model/legacy/servable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,9 @@ absl::Status LegacyServable::parseRequest(std::shared_ptr<GenAiServableExecution
};
ov::AnyMap streamerConfig;
if (legacyExecutionContext->apiHandler->isStream() &&
legacyExecutionContext->apiHandler->getOutputParser() != nullptr &&
(legacyExecutionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens())) {
((legacyExecutionContext->apiHandler->getOutputParser() != nullptr &&
legacyExecutionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens()) ||
!legacyExecutionContext->apiHandler->getRequest().skipSpecialTokens)) {
streamerConfig.insert(ov::genai::skip_special_tokens(false));
}
legacyExecutionContext->textStreamer = std::make_shared<ov::genai::TextStreamer>(getProperties()->tokenizer, callback, streamerConfig);
Expand Down
5 changes: 3 additions & 2 deletions src/llm/servable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,9 @@ absl::Status GenAiServable::parseRequest(std::shared_ptr<GenAiServableExecutionC
return ov::genai::StreamingStatus::RUNNING;
};
ov::AnyMap streamerConfig;
if (executionContext->apiHandler->getOutputParser() != nullptr &&
(executionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens())) {
if ((executionContext->apiHandler->getOutputParser() != nullptr &&
Comment thread
dkalinowski marked this conversation as resolved.
executionContext->apiHandler->getOutputParser()->requiresStreamingWithSpecialTokens()) ||
!executionContext->apiHandler->getRequest().skipSpecialTokens) {
streamerConfig.insert(ov::genai::skip_special_tokens(false));
}
executionContext->textStreamer = std::make_shared<ov::genai::TextStreamer>(getProperties()->tokenizer, callback, streamerConfig);
Expand Down
Loading