Skip to content

Wire multimodal prefillXxx() to C++ runner->prefill(), remove buffer#17727

Closed
kirklandsign wants to merge 7 commits intomainfrom
android/wire-multimodal-prefill-to-runner
Closed

Wire multimodal prefillXxx() to C++ runner->prefill(), remove buffer#17727
kirklandsign wants to merge 7 commits intomainfrom
android/wire-multimodal-prefill-to-runner

Conversation

@kirklandsign
Copy link
Copy Markdown
Contributor

Summary

Each prefillXxx() JNI method now calls multi_modal_runner_->prefill()
directly instead of appending to prefill_inputs_ buffer. This means
prefillImages() actually runs the vision encoder, prefillAudio() runs
the audio encoder, etc. — real model computation at call time.

The generate() multimodal path no longer drains the buffer; it only
passes its prompt parameter to multi_modal_runner_->generate().

Remove prefill_inputs_ member entirely — no buffer exists anywhere.
Rename all append_xxx methods to prefill_xxx to reflect the new
behavior.

Test plan

CI

Fix prompt double-prefill in the image generate() overload:
the method called prefillPrompt(prompt) which appends to the
JNI buffer, then passed the same prompt to native generate()
which appends it again on the multimodal path, corrupting KV
cache state.

Also fix all 5 null-input checks in JNI append methods that
incorrectly returned Error::EndOfMethod (means "method execution
finished") instead of Error::InvalidArgument.

Replace deep copy of prefill_inputs_ with std::move to avoid
unnecessary allocation for large image/audio inputs.
Previously, prefillPrompt() only appended to a JNI-side buffer
that was never consumed on the text-only code path. This was a
silent no-op: the buffered data sat there forever while generate()
passed its own prompt directly to the runner.

Now for text-only models (MODEL_TYPE_CATEGORY_LLM), prefillPrompt()
calls runner_->prefill() directly, which runs real model computation
and populates the KV cache. This enables chat history reload:

  module.prefillPrompt("system: ...");  // fills KV cache
  module.prefillPrompt("user: hello");  // fills KV cache
  module.generate("user: new q", cb);   // generates from pos_

Add prefill() to IRunner with a default NotSupported return so
vendor runners (QNN, MediaTek) are unaffected. Mark the existing
TextLLMRunner::prefill() as override.

Also clear prefill_inputs_ in resetContext() so stale multimodal
buffer data doesn't persist across context resets.
The method now invokes real C++ prefill for text-only models,
so "append" no longer describes what it does.
Each prefillXxx() JNI method now calls multi_modal_runner_->prefill()
directly instead of appending to prefill_inputs_ buffer. This means
prefillImages() actually runs the vision encoder, prefillAudio() runs
the audio encoder, etc. — real model computation at call time.

The generate() multimodal path no longer drains the buffer; it only
passes its prompt parameter to multi_modal_runner_->generate().

Remove prefill_inputs_ member entirely — no buffer exists anywhere.
Rename all append_xxx methods to prefill_xxx to reflect the new
behavior.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17727

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit 7b477ad with merge base a29539d (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 26, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@kirklandsign kirklandsign marked this pull request as ready for review February 26, 2026 00:42
Copilot AI review requested due to automatic review settings February 26, 2026 00:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the multimodal input handling in the Android JNI layer to remove the buffering approach and enable immediate execution of encoders. Instead of collecting inputs in a prefill_inputs_ buffer and processing them during generate(), each prefillXxx() method now directly calls the C++ runner's prefill() method to execute the corresponding encoder (vision, audio, etc.) immediately. The PR also adds a virtual prefill() method to the IRunner interface and marks the TextLLMRunner implementation with the override keyword.

Changes:

  • Removed prefill_inputs_ buffer and changed all append_xxx methods to prefill_xxx methods that execute encoders immediately
  • Added virtual prefill() method to IRunner interface with default implementation returning Error::NotSupported
  • Updated Java and C++ method names from appendXxx to prefillXxx throughout the Android layer

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
extension/llm/runner/irunner.h Added virtual prefill() method to base IRunner interface with documentation and default implementation
extension/llm/runner/text_llm_runner.h Added override keyword to existing prefill() method declaration
extension/android/jni/jni_layer_llama.cpp Removed prefill_inputs_ buffer, renamed all append_xxx to prefill_xxx, and changed implementation to call runner prefill() immediately; updated generate() to no longer drain buffer
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.java Renamed native method declarations from appendXxx to prefillXxx to match C++ changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)});
std::vector<llm::MultimodalInput> inputs;
inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
auto err = multi_modal_runner_->prefill(inputs);
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot uses AI. Check for mistakes.
prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)});
std::vector<llm::MultimodalInput> inputs;
inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
auto err = multi_modal_runner_->prefill(inputs);
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot uses AI. Check for mistakes.
auto err = multi_modal_runner_->prefill(inputs);
return static_cast<jint>(err);
}
return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when neither runner_ nor multi_modal_runner_ is available, instead of returning an error code. This silently succeeds when it should fail, making debugging difficult. Consider returning Error::InvalidState or Error::NotSupported to indicate that no runner is available.

Suggested change
return 0;
return static_cast<jint>(Error::InvalidState);

Copilot uses AI. Check for mistakes.
return static_cast<jint>(err);
}

return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot uses AI. Check for mistakes.
return static_cast<jint>(err);
}

return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot uses AI. Check for mistakes.
auto err = multi_modal_runner_->prefill(inputs);
return static_cast<jint>(err);
}
return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot uses AI. Check for mistakes.
auto err = multi_modal_runner_->prefill(inputs);
return static_cast<jint>(err);
}
return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot uses AI. Check for mistakes.
auto err = multi_modal_runner_->prefill(inputs);
return static_cast<jint>(err);
}
return 0;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.

Copilot uses AI. Check for mistakes.
llm::MultimodalInput{std::move(image_runner)});
std::vector<llm::MultimodalInput> inputs;
inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)});
auto err = multi_modal_runner_->prefill(inputs);
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot uses AI. Check for mistakes.
llm::MultimodalInput{std::move(image_runner)});
std::vector<llm::MultimodalInput> inputs;
inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)});
auto err = multi_modal_runner_->prefill(inputs);
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.

Copilot uses AI. Check for mistakes.
Make MultimodalRunner extend IRunner so both text-only and multimodal
runners share the same interface. This enables the JNI layer to use a
single IRunner pointer instead of branching on model_type_category_.

Key changes:
- IRunner::prefill() now takes vector<MultimodalInput> as the primary
  virtual method; the string overload becomes a non-virtual convenience
- MultimodalRunner extends IRunner with generate(string) and
  prefill(vector, config) overrides
- TextLLMRunner::generate() supports empty prompt for prefill-then-
  generate workflows via prefill_next_token_ state
- MultimodalRunner::prefill() fixes BOS handling when first input is
  non-text (prepends BOS token via metadata lookup)
- JNI layer simplified from dual runner + model_type branching to
  single runner_ pointer with uniform dispatch
@kirklandsign kirklandsign deleted the android/wire-multimodal-prefill-to-runner branch February 27, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants