Skip to content

[llm][2/4] Echo-gated special-token filtering and EOS metadata merge#19534

Open
seyeong-han wants to merge 2 commits into
pytorch:mainfrom
seyeong-han:chat-runner-token-filter
Open

[llm][2/4] Echo-gated special-token filtering and EOS metadata merge#19534
seyeong-han wants to merge 2 commits into
pytorch:mainfrom
seyeong-han:chat-runner-token-filter

Conversation

@seyeong-han
Copy link
Copy Markdown
Contributor

Summary

Part 2 of the chat-template support stack split out of #16987 per @kirklandsign's request.

This PR adds two runner-behavior changes to TextLLMRunner that affect all users:

  1. Echo-gated special-token filtering (so chat-template tokens don't leak into clean output)
  2. EOS metadata merge (instead of clearing the tokenizer's primary EOS token)

Stack overview

PR Subject
1/4 #19533 Library + tests
2/4 (this PR) TextLLMRunner echo gating + EOS merge
3/4 Python bindings + Python LlamaRunner integration
4/4 llama_main CLI flags + chat_formatter wrapper + universal Jinja docs

What this PR adds

Echo-gated special-token filtering (text_llm_runner.cpp)

Adds is_special_token() with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s> / </s> / <pad> / <unk> tokens, plus a regex-style match for Llama-format <|...|> tokens.

wrapped_callback now suppresses these from the printed stream only when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged — this preserves backward compatibility for users who explicitly want to see raw tokens.

if (config.echo || !is_special_token(piece)) {
  llm::safe_printf(piece.c_str());
  fflush(stdout);
}

EOS metadata merge (llm_runner_helper.cpp)

get_eos_ids() now merges the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clear()-ing the set when metadata is present.

This is the correct behavior for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary EOS token and only logs metadata IDs that are newly inserted.

Why this is split out

These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for:

  • Backward-compat impact (echo gating)
  • EOS-set semantics (merge vs. clear)

Test Plan

  • Existing TextLLMRunner tests still pass
  • Verify special tokens filtered when --echo=false (clean output)
  • Verify special tokens emitted when --echo=true (raw output)
  • Verify EOS set contains both tokenizer primary EOS and model-metadata EOS IDs

Depends on

  • PR-A: #19533 (only for stack ordering; this PR has no #include or symbol dependency on the JinjaChatFormatter library)

Original PR

Splitting #16987 into 4 reviewable PRs.

cc @kirklandsign @larryliu0820 @metascroy

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19534

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 103 New Failures, 1 Cancelled Job, 1 Unrelated Failure, 6 Unclassified Failures

As of commit 0b6a51d with merge base 2ea50ac (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based
JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3
templates, build glue (CMake/Buck), and a focused C++ unit-test suite.
This PR is reviewable in isolation — it has no behavior change for any
existing runner; downstream PRs (B/C/D) plug it in.

This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request:

  1/4 (this PR)  Library + tests
  2/4            TextLLMRunner echo-gated special-token filter + EOS merge
  3/4            Python bindings + Python LlamaRunner integration
  4/4            llama_main CLI flags + chat_formatter wrapper + docs

What this PR adds
-----------------
* extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt,
  targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the
  ChatTemplateType enum + ModelTokens. The CMake file FetchContent's
  Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE
  FetchContent_MakeAvailable so it propagates correctly, plus header
  staging for nonstd headers that some Jinja2Cpp installations omit.
  Installs chat_templates.h so SDK consumers can include it.
* extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the
  Universal Jinja chat formatter that supports any HuggingFace / vLLM
  chat template, not just the embedded ones. Loadable via fromTemplate
  (built-in), fromString (any string), or fromFile (any .jinja file).
  formatConversation injects vLLM/HuggingFace-standard params (tools=[],
  tool_choice=None, date_string, chat_template_kwargs) so any template
  that references those variables renders correctly.
* normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp:
  notably, 'not tools is none' maps to 'tools' (truthy check), preserving
  the intent of 'tools is not none' for empty-list defaults.
* extension/llm/runner/{CMakeLists.txt, targets.bzl} — link
  extension_llm_runner against jinja2cpp (PRIVATE) and define
  EXECUTORCH_USE_JINJA2CPP.
* extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt,
  targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3
  embedded templates, parseChatTemplateType (case-insensitive), and
  three universal-Jinja regression tests:
    - generic HuggingFace-style template (proves it's not Llama-specific)
    - tools-aware template (validates the tools=[] default)
    - 'not tools is none' normalization regression test
* CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template)
  guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER.
* shim_et/xplat/executorch/build/build_variables.bzl — adds
  jinja_chat_formatter.cpp to the runner sources.

Notes
-----
* No behavior change for existing TextLLMRunner / MultimodalRunner users:
  the formatter is opt-in, only invoked when downstream code calls it.
* Sample vLLM templates are NOT checked in (per reviewer feedback);
  documentation in the follow-up CLI PR points users to vLLM's examples
  directory and HuggingFace tokenizer_config.json files.

Original PR (full stack): pytorch#16987
Part 2 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()'
  with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and
  generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for
  Llama-format <|...|> tokens. wrapped_callback now suppresses these
  from the printed stream when GenerationConfig.echo == false. When
  echo == true, raw model output (including chat-template tokens) is
  emitted unchanged - this preserves backward compatibility for users
  who explicitly want to see raw tokens.

* extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES
  the tokenizer's primary eos_tok() with any additional EOS IDs the
  model metadata exports under kEosIds, instead of clearing the set
  when metadata is present. This is correct for HF-tokenizer models
  (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model
  also wants <|eot_id|> as a stop token. Also logs the primary tok
  and only logs metadata IDs that are newly inserted.

Why this is split out
---------------------
These are runner-behavior changes that affect ALL TextLLMRunner users,
not just the new chat-template path. They deserve focused review for
backward-compat impact (echo gating) and EOS-set semantics (merge vs
clear).

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library) — only for stack ordering; this PR has no
            include or symbol dependency on that library.

Original PR (full stack): pytorch#16987
@seyeong-han seyeong-han force-pushed the chat-runner-token-filter branch from 0a20e9a to 0b6a51d Compare May 13, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant