Wire multimodal prefillXxx() to C++ runner->prefill(), remove buffer#17727
Wire multimodal prefillXxx() to C++ runner->prefill(), remove buffer#17727kirklandsign wants to merge 7 commits intomainfrom
Conversation
Fix prompt double-prefill in the image generate() overload: the method called prefillPrompt(prompt) which appends to the JNI buffer, then passed the same prompt to native generate() which appends it again on the multimodal path, corrupting KV cache state. Also fix all 5 null-input checks in JNI append methods that incorrectly returned Error::EndOfMethod (means "method execution finished") instead of Error::InvalidArgument. Replace deep copy of prefill_inputs_ with std::move to avoid unnecessary allocation for large image/audio inputs.
Previously, prefillPrompt() only appended to a JNI-side buffer
that was never consumed on the text-only code path. This was a
silent no-op: the buffered data sat there forever while generate()
passed its own prompt directly to the runner.
Now for text-only models (MODEL_TYPE_CATEGORY_LLM), prefillPrompt()
calls runner_->prefill() directly, which runs real model computation
and populates the KV cache. This enables chat history reload:
module.prefillPrompt("system: ..."); // fills KV cache
module.prefillPrompt("user: hello"); // fills KV cache
module.generate("user: new q", cb); // generates from pos_
Add prefill() to IRunner with a default NotSupported return so
vendor runners (QNN, MediaTek) are unaffected. Mark the existing
TextLLMRunner::prefill() as override.
Also clear prefill_inputs_ in resetContext() so stale multimodal
buffer data doesn't persist across context resets.
The method now invokes real C++ prefill for text-only models, so "append" no longer describes what it does.
Each prefillXxx() JNI method now calls multi_modal_runner_->prefill() directly instead of appending to prefill_inputs_ buffer. This means prefillImages() actually runs the vision encoder, prefillAudio() runs the audio encoder, etc. — real model computation at call time. The generate() multimodal path no longer drains the buffer; it only passes its prompt parameter to multi_modal_runner_->generate(). Remove prefill_inputs_ member entirely — no buffer exists anywhere. Rename all append_xxx methods to prefill_xxx to reflect the new behavior.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17727
Note: Links to docs will display an error until the docs builds have been completed. ❌ 9 New FailuresAs of commit 7b477ad with merge base a29539d ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR refactors the multimodal input handling in the Android JNI layer to remove the buffering approach and enable immediate execution of encoders. Instead of collecting inputs in a prefill_inputs_ buffer and processing them during generate(), each prefillXxx() method now directly calls the C++ runner's prefill() method to execute the corresponding encoder (vision, audio, etc.) immediately. The PR also adds a virtual prefill() method to the IRunner interface and marks the TextLLMRunner implementation with the override keyword.
Changes:
- Removed
prefill_inputs_buffer and changed allappend_xxxmethods toprefill_xxxmethods that execute encoders immediately - Added virtual
prefill()method toIRunnerinterface with default implementation returningError::NotSupported - Updated Java and C++ method names from
appendXxxtoprefillXxxthroughout the Android layer
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| extension/llm/runner/irunner.h | Added virtual prefill() method to base IRunner interface with documentation and default implementation |
| extension/llm/runner/text_llm_runner.h | Added override keyword to existing prefill() method declaration |
| extension/android/jni/jni_layer_llama.cpp | Removed prefill_inputs_ buffer, renamed all append_xxx to prefill_xxx, and changed implementation to call runner prefill() immediately; updated generate() to no longer drain buffer |
| extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.java | Renamed native method declarations from appendXxx to prefillXxx to match C++ changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)}); | ||
| std::vector<llm::MultimodalInput> inputs; | ||
| inputs.emplace_back(llm::MultimodalInput{std::move(audio)}); | ||
| auto err = multi_modal_runner_->prefill(inputs); |
There was a problem hiding this comment.
Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.
| prefill_inputs_.emplace_back(llm::MultimodalInput{std::move(audio)}); | ||
| std::vector<llm::MultimodalInput> inputs; | ||
| inputs.emplace_back(llm::MultimodalInput{std::move(audio)}); | ||
| auto err = multi_modal_runner_->prefill(inputs); |
There was a problem hiding this comment.
Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.
| auto err = multi_modal_runner_->prefill(inputs); | ||
| return static_cast<jint>(err); | ||
| } | ||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when neither runner_ nor multi_modal_runner_ is available, instead of returning an error code. This silently succeeds when it should fail, making debugging difficult. Consider returning Error::InvalidState or Error::NotSupported to indicate that no runner is available.
| return 0; | |
| return static_cast<jint>(Error::InvalidState); |
| return static_cast<jint>(err); | ||
| } | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.
| return static_cast<jint>(err); | ||
| } | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.
| auto err = multi_modal_runner_->prefill(inputs); | ||
| return static_cast<jint>(err); | ||
| } | ||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.
| auto err = multi_modal_runner_->prefill(inputs); | ||
| return static_cast<jint>(err); | ||
| } | ||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.
| auto err = multi_modal_runner_->prefill(inputs); | ||
| return static_cast<jint>(err); | ||
| } | ||
| return 0; |
There was a problem hiding this comment.
Inconsistent error handling: this method returns 0 (success) when the array size is 0, instead of returning an error code. This silently succeeds when called with empty input, which may not be the intended behavior. Consider returning Error::InvalidArgument for empty arrays to match the null check above.
| llm::MultimodalInput{std::move(image_runner)}); | ||
| std::vector<llm::MultimodalInput> inputs; | ||
| inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)}); | ||
| auto err = multi_modal_runner_->prefill(inputs); |
There was a problem hiding this comment.
Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.
| llm::MultimodalInput{std::move(image_runner)}); | ||
| std::vector<llm::MultimodalInput> inputs; | ||
| inputs.emplace_back(llm::MultimodalInput{std::move(image_runner)}); | ||
| auto err = multi_modal_runner_->prefill(inputs); |
There was a problem hiding this comment.
Missing null check for multi_modal_runner_ before dereferencing. If multi_modal_runner_ is null, this will cause a null pointer dereference. Add a check similar to prefill_text_input() which guards the multi_modal_runner_ usage.
Make MultimodalRunner extend IRunner so both text-only and multimodal runners share the same interface. This enables the JNI layer to use a single IRunner pointer instead of branching on model_type_category_. Key changes: - IRunner::prefill() now takes vector<MultimodalInput> as the primary virtual method; the string overload becomes a non-virtual convenience - MultimodalRunner extends IRunner with generate(string) and prefill(vector, config) overrides - TextLLMRunner::generate() supports empty prompt for prefill-then- generate workflows via prefill_next_token_ state - MultimodalRunner::prefill() fixes BOS handling when first input is non-text (prepends BOS token via metadata lookup) - JNI layer simplified from dual runner + model_type branching to single runner_ pointer with uniform dispatch
Summary
Each prefillXxx() JNI method now calls multi_modal_runner_->prefill()
directly instead of appending to prefill_inputs_ buffer. This means
prefillImages() actually runs the vision encoder, prefillAudio() runs
the audio encoder, etc. — real model computation at call time.
The generate() multimodal path no longer drains the buffer; it only
passes its prompt parameter to multi_modal_runner_->generate().
Remove prefill_inputs_ member entirely — no buffer exists anywhere.
Rename all append_xxx methods to prefill_xxx to reflect the new
behavior.
Test plan
CI