Wire multimodal prefillXxx() to C++ runner->prefill(), remove buffer by kirklandsign · Pull Request #17727 · pytorch/executorch

kirklandsign · 2026-02-26T00:39:56Z

Summary

Each prefillXxx() JNI method now calls multi_modal_runner_->prefill()
directly instead of appending to prefill_inputs_ buffer. This means
prefillImages() actually runs the vision encoder, prefillAudio() runs
the audio encoder, etc. — real model computation at call time.

The generate() multimodal path no longer drains the buffer; it only
passes its prompt parameter to multi_modal_runner_->generate().

Remove prefill_inputs_ member entirely — no buffer exists anywhere.
Rename all append_xxx methods to prefill_xxx to reflect the new
behavior.

Test plan

CI

Fix prompt double-prefill in the image generate() overload: the method called prefillPrompt(prompt) which appends to the JNI buffer, then passed the same prompt to native generate() which appends it again on the multimodal path, corrupting KV cache state. Also fix all 5 null-input checks in JNI append methods that incorrectly returned Error::EndOfMethod (means "method execution finished") instead of Error::InvalidArgument. Replace deep copy of prefill_inputs_ with std::move to avoid unnecessary allocation for large image/audio inputs.

Previously, prefillPrompt() only appended to a JNI-side buffer that was never consumed on the text-only code path. This was a silent no-op: the buffered data sat there forever while generate() passed its own prompt directly to the runner. Now for text-only models (MODEL_TYPE_CATEGORY_LLM), prefillPrompt() calls runner_->prefill() directly, which runs real model computation and populates the KV cache. This enables chat history reload: module.prefillPrompt("system: ..."); // fills KV cache module.prefillPrompt("user: hello"); // fills KV cache module.generate("user: new q", cb); // generates from pos_ Add prefill() to IRunner with a default NotSupported return so vendor runners (QNN, MediaTek) are unaffected. Mark the existing TextLLMRunner::prefill() as override. Also clear prefill_inputs_ in resetContext() so stale multimodal buffer data doesn't persist across context resets.

The method now invokes real C++ prefill for text-only models, so "append" no longer describes what it does.

…fill-to-runner

Each prefillXxx() JNI method now calls multi_modal_runner_->prefill() directly instead of appending to prefill_inputs_ buffer. This means prefillImages() actually runs the vision encoder, prefillAudio() runs the audio encoder, etc. — real model computation at call time. The generate() multimodal path no longer drains the buffer; it only passes its prompt parameter to multi_modal_runner_->generate(). Remove prefill_inputs_ member entirely — no buffer exists anywhere. Rename all append_xxx methods to prefill_xxx to reflect the new behavior.

pytorch-bot · 2026-02-26T00:40:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17727

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit 7b477ad with merge base a29539d ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
RuntimeError: Command docker exec -t c3de97898e34d8bbc0efe64708d463bc14941f6c6ffeaa83d22e29f37b18212f /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t 39ea04d77f2a465ec230bf32aa8193de29182b0a9520a41883b59700d5bf9154 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t 3274a8da1263790f4aeaab1fc866ac70e4d4c9cb6b5ff74170761d0c3ce655ef /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.13) / linux-job (gh)
RuntimeError: Command docker exec -t dc5b1374d2a35c6856c2540cae8a3d9af5ffcdd8169b68f217240248aa51084d /exec failed with exit code 1
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t ca938101310e910f5c486b1480ad254334ed824e000975e2278a85969cae3113 /exec failed with exit code 3
pull / unittest-buck / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 80389521ca8f0f4c9b1163b7a6b6b880d2b430dce04a273326e5e2c07c1bb30c /exec failed with exit code 134
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh)
Process completed with exit code 1.
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-26T00:40:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR refactors the multimodal input handling in the Android JNI layer to remove the buffering approach and enable immediate execution of encoders. Instead of collecting inputs in a prefill_inputs_ buffer and processing them during generate(), each prefillXxx() method now directly calls the C++ runner's prefill() method to execute the corresponding encoder (vision, audio, etc.) immediately. The PR also adds a virtual prefill() method to the IRunner interface and marks the TextLLMRunner implementation with the override keyword.

Changes:

Removed prefill_inputs_ buffer and changed all append_xxx methods to prefill_xxx methods that execute encoders immediately
Added virtual prefill() method to IRunner interface with default implementation returning Error::NotSupported
Updated Java and C++ method names from appendXxx to prefillXxx throughout the Android layer

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File	Description
extension/llm/runner/irunner.h	Added virtual `prefill()` method to base `IRunner` interface with documentation and default implementation
extension/llm/runner/text_llm_runner.h	Added `override` keyword to existing `prefill()` method declaration
extension/android/jni/jni_layer_llama.cpp	Removed `prefill_inputs_` buffer, renamed all `append_xxx` to `prefill_xxx`, and changed implementation to call runner `prefill()` immediately; updated `generate()` to no longer drain buffer
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.java	Renamed native method declarations from `appendXxx` to `prefillXxx` to match C++ changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T00:49:00Z

-      prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)});
+      std::vector<llm::MultimodalInput> inputs;
+      inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
+      auto err = multi_modal_runner_->prefill(inputs);


Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot · 2026-02-26T00:49:01Z

-      prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)});
+      std::vector<llm::MultimodalInput> inputs;
+      inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
+      auto err = multi_modal_runner_->prefill(inputs);


Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot · 2026-02-26T00:49:01Z

+      auto err = multi_modal_runner_->prefill(inputs);
+      return static_cast<jint>(err);
+    }
    return 0;


Inconsistent error handling: this method returns 0 (success) when neither runner_ nor multi_modal_runner_ is available, instead of returning an error code. This silently succeeds when it should fail, making debugging difficult. Consider returning Error::InvalidState or Error::NotSupported to indicate that no runner is available.

Suggested change

return 0;

return static_cast<jint>(Error::InvalidState);

Copilot · 2026-02-26T00:49:01Z

+      return static_cast<jint>(err);
    }
-
    return 0;


Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot · 2026-02-26T00:49:01Z

+      return static_cast<jint>(err);
    }
-
    return 0;


Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot · 2026-02-26T00:49:02Z

+      auto err = multi_modal_runner_->prefill(inputs);
+      return static_cast<jint>(err);
    }
    return 0;


Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot · 2026-02-26T00:49:02Z

+      auto err = multi_modal_runner_->prefill(inputs);
+      return static_cast<jint>(err);
    }
    return 0;


Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot · 2026-02-26T00:49:02Z

+      auto err = multi_modal_runner_->prefill(inputs);
+      return static_cast<jint>(err);
    }
    return 0;


Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot · 2026-02-26T00:49:02Z

-          llm::MultimodalInput{std::move(image_runner)});
+      std::vector<llm::MultimodalInput> inputs;
+      inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)});
+      auto err = multi_modal_runner_->prefill(inputs);


Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot · 2026-02-26T00:49:03Z

-          llm::MultimodalInput{std::move(image_runner)});
+      std::vector<llm::MultimodalInput> inputs;
+      inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)});
+      auto err = multi_modal_runner_->prefill(inputs);


Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Make MultimodalRunner extend IRunner so both text-only and multimodal runners share the same interface. This enables the JNI layer to use a single IRunner pointer instead of branching on model_type_category_. Key changes: - IRunner::prefill() now takes vector<MultimodalInput> as the primary virtual method; the string overload becomes a non-virtual convenience - MultimodalRunner extends IRunner with generate(string) and prefill(vector, config) overrides - TextLLMRunner::generate() supports empty prompt for prefill-then- generate workflows via prefill_next_token_ state - MultimodalRunner::prefill() fixes BOS handling when first input is non-text (prepends BOS token via metadata lookup) - JNI layer simplified from dual runner + model_type branching to single runner_ pointer with uniform dispatch

kirklandsign added 5 commits February 25, 2026 15:57

Rename appendTextInput to prefillTextInput

c4fa8fb

The method now invokes real C++ prefill for text-only models, so "append" no longer describes what it does.

Merge remote-tracking branch 'origin/main' into android/wire-text-pre…

bf12659

…fill-to-runner

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 26, 2026

Lint

b861301

kirklandsign marked this pull request as ready for review February 26, 2026 00:42

kirklandsign requested a review from larryliu0820 as a code owner February 26, 2026 00:42

Copilot AI review requested due to automatic review settings February 26, 2026 00:42

kirklandsign requested a review from mergennachin as a code owner February 26, 2026 00:42

Copilot started reviewing on behalf of kirklandsign February 26, 2026 00:43 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

kirklandsign closed this Feb 27, 2026

kirklandsign deleted the android/wire-multimodal-prefill-to-runner branch February 27, 2026 07:12

Conversation

kirklandsign commented Feb 26, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17727

❌ 9 New Failures

Uh oh!

github-actions Bot commented Feb 26, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Feb 26, 2026 •

edited

Loading

This PR needs a `release notes:` label