Add baremetal RISC-V smoke tests (rv32, rv64)#4
Draft
luhenry wants to merge 103 commits into
Draft
Conversation
Differential Revision: D105973185 Pull Request resolved: pytorch#19736
Add model tests of currently not supported models - yolo11 - wav2letter - silero_vad cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>
Differential Revision: D102880053 Pull Request resolved: pytorch#19211
Differential Revision: D106123930 Pull Request resolved: pytorch#19742
pytorch#19746) pytorch#18476 clone version due to bot crash
…ackend (pytorch#19747) clone pytorch#18477 due to bot crash
clone pytorch#18728 due to bot crash
This was referenced May 23, 2026
Differential Revision: D106162684 Pull Request resolved: pytorch#19749
### Summary Add tests verifying correct support for add.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar
…#19752) Differential Revision: D106254596 Pull Request resolved: pytorch#19752
Treat BUCK and TARGETS files as build metadata in the Arm pre-push license check so they do not need copyright headers. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70
### Summary Add tests verifying correct support for sub.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani
…opy (pytorch#19751) Follow-up to pytorch#17097, which added BF16 support to the TOSA GATHER op. `aten.index_select` and `aten.unfold_copy` both lower via TOSA GATHER but their support checks were not updated at the time. In both decompositions(`DecomposeIndexSelectToGatherPass()` and `DecomposeUnfoldToGatherPass()`), the bf16 values tensor flows through dtype-agnostic reshape ops and `tosa.GATHER`, which accepts `BF16`. The support check was the only blocker. | Op | bf16 before | bf16 after | |---------------------|:-----------:|:----------:| | `aten.gather` | ✅ | ✅ | | `aten.index.Tensor` | ✅ | ✅ | | `aten.slice_copy` | ✅ | ✅ | | `aten.index_select` | ❌ | ✅ | | `aten.unfold_copy` | ❌ | ✅ | Changes: - `index_select_support.py`, `unfold_copy_support.py`: extend float branch to include `bfloat16`; add bf16 extension guard; update rejection message. - `test_index_select.py`, `test_unfold_copy.py`: add isolated `_tosa_FP_bf16` test functions using `TosaPipelineFP(..., tosa_extensions=["bf16"])`. ### Test plan `test_index_select_tosa_FP_bf16` and `test_unfold_copy_tosa_FP_bf16` exercise the bf16 path end-to-end through `TosaPipelineFP` with the bf16 extension enabled, following the same pattern of the existing `test_slice_tensor_tosa_FP_bf16` from pytorch#17492
This is done for conv, depthwise conv, transpose conv, and bmm.
Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.
Introduce `required_cmsis_buffer_size`which computes the buffer
size from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass
Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro
CORTEX_M_ENABLE_RUNTIME_CHECKS
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.
cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo
@oscarandersson8218 @mansnils @Sebastian-Larsson @robell
---------
Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Co-authored-by: Måns Nilsson <mans.nilsson@arm.com>
…es (pytorch#19146) ### Summary To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating‑point models from running. This PR focuses on removing the hardcode dtype for them. #### Key changes - Remove template parameter <typename T> from KVManager, LhdTokenGenerator, MultimodalPromptProcessor, and related runner classes - Detect kv_cache and attention_mask dtypes dynamically from MethodMeta at construction time instead of compile-time bitwidth detection - Switch to std::byte* pointer arithmetic with getDtypeSize() for all buffer offsets; add fill_mask() helper for multi-dtype attention mask filling - Update spec_prop pass for custom llama op for sharding case greater than 1 ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16 ``` <img width="1977" height="468" alt="image" src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a" /> cc @cccclai @cbilgin @abhinaykukkadapu
Summary: Pull Request resolved: pytorch#19764 Reviewed By: kirklandsign Differential Revision: D106332819
As documented at https://vkdoc.net/man/VkDataGraphPipelineSessionBindPointRequirementARM .stype of VkDataGraphPipelineSessionBindPointRequirementARM should alway be set to VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_SESSION_BIND_POINT_REQUIREMENT_ARM cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Enable CPPCHECK for Cortex-M sources and headers. The Cortex-M kernels are registered through generated wrappers, so cppcheck cannot see direct call sites for the exported *_out entry points and reports them as unused. Keep narrow unusedFunction suppressions for those registration-visible functions. The scratch buffer context header is linted as a standalone header but currently exposes helper API without in-tree call sites, so suppress unusedFunction at file scope there instead of dropping Cortex-M header coverage. Keep the quantize and dequantize context parameters non-const to match the generated kernel ABI; changing them to const changes the mangled symbols used by registration. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I3bcb6e5d3f125ae400005d1b033b24a07eb7924f
### Summary It relates to pytorch#18833. It doesn't add Yolo on baremetal, but it at least makes sure that it works using Portable Kernels and XNNPACK backends. ### Test plan It's only adding a model to CI, so the CI is the test plan.
Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark, LlmModelRunner, and ModelRunner from Java to Kotlin. Differential Revision: D106195816
…rch#19731) ### Summary Extend the Cortex-M cross-CPU build pipeline to Armv6-M by patching two upstream issues that block the Corstone-300 target source and the CMSIS Cortex DFP from building for `cortex-m0plus`: * `core_platform/0003-*.patch` guards the `HardFault_Handler` in `targets/corstone-300/target.cpp`. The handler uses an `ite eq` IT-block in inline asm and dereferences the SCB CFSR/BFAR/MMFAR fault-status registers; both are Armv7-M / Armv8-M Mainline only. The patch wraps the rich handler in `__ARM_ARCH_7M__ / 7EM / 8M_MAIN / 8_1M_MAIN` and falls back to a minimal stub on Armv6-M / Armv8-M Baseline (M0/M0+/M23). * `core_software/0002-*.patch` fixes `cmsis.cmake`'s handling of the M0+ device. The Cortex DFP names the device directory and headers `ARMCM0plus` (lowercase suffix), while the device sources (`startup_ARMCM0plus.c`, `system_ARMCM0plus.c`) gate their implementations on the `ARMCM0P` preprocessor macro — three different spellings. The previous `string(TOUPPER ...)` produced `ARMCM0PLUS`: the include path lookup failed and the source files hit their `#error device not specified!` guard. Override `ARM_CPU` to `ARMCM0plus` for the directory + filename and introduce a separate `CMSIS_DEVICE_CPU_DEFINE` set to `ARMCM0P` for the cmsis_startup and cmsis_system compile-definitions; all other cores still drive both paths from the uppercased default. Both patches are layered via the existing `patch_repo` mechanism; the `corstone_utils.cmake` TODO is updated so the deletion plan for 0002 and 0003 is documented together. ### Test Plan Locally validated end-to-end on the Corstone-300 FVP with the `qadd` model: `cortex-m0plus` build links a runner that includes `startup_ARMCM0plus.c` / `system_ARMCM0plus.c` and the patched `target.cpp`, and the FVP run prints `TEST: BundleIO index[0] Test_result: PASS` with all error stats zero. The bundled `libcmsis-nn.a` reports `Tag_CPU_arch: v6S-M` and `Tag_THUMB_ISA_use: Thumb-1` with zero DSP / MVE / saturating instructions, confirming the scalar code path was exercised. Authored with Claude. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell
Differential Revision: D106026285 Pull Request resolved: pytorch#19734
Differential Revision: D106394605 Pull Request resolved: pytorch#19775
pytorch#19772) … Registration ### Summary Docs improvement. ### Test plan Docs only. cc @robert-kalmar @JakeStevens @digantdesai @rascani
Re-upload with BUCK changes. Share TOSA RESIZE parameter validation between upsample support checks and fake RESIZE lowering so invalid nearest and bilinear resize parameters are rejected before delegation. Change-Id: I57c267aca96d733879ae90329267e44adce399c6 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Per Held <per.held@arm.com>
Differential Revision: D106408368 Pull Request resolved: pytorch#19783
### Summary In pytorch#19651, I added a global seed for pytest runs. This was intended to reduce random tolerance flakes, but didn't actually do so in practice. This is because the parallel test runners don't guarantee any ordering, so random state is unstable between runs. I've updated it to set the seed per-test. This should hopefully make the random state invariant of test execution order.
…h#19839) Add ArmPass.should_run_pass() as a reusable early-exit hook before call() starts the normal ExportPass retracing path. The default hook returns true, preserving existing behavior for ArmPass subclasses. Introduce ArmOpTargetedPass for passes that only transform a known set of operator targets. It implements should_run_pass() by scanning the current graph and nested GraphModules for matching target operators. If no matching target operator is found, the pass returns an unmodified PassResult. For passes that already gate transformations with allowed_to_transform(), allow the target pre-scan to apply the same check before deciding whether the pass needs to run. This avoids running TFA passes when all matching target nodes are marked as disallowed. The should_run_pass() hook and ArmOpTargetedPass pre-scan avoid rebuilding graphs for decomposition and rewrite passes that cannot affect the current graph. The speedup is most visible on large models. Single-run paired benchmarks on Arm backend model tests across FP32, INT, VGF no-quant, and VGF quant variants: | Model | E2E avg | Pass-manager avg | |-------------|--------:|-----------------:| | T5-small | +30.5% | +47.5% | | DeepLabV3 | +12.9% | +49.8% | | Wav2Letter | +16.9% | +51.2% | | InceptionV3 | +22.2% | +46.5% | | MobileNetV2 | +22.2% | +52.5% | | MobileNetV3 | +29.9% | +54.6% | Model rows are unweighted averages over successful variants. Unweighted average across 23 successful model/target variants: E2E speedup: +22.4% Pass-manager speedup: +50.5% Change-Id: Iaa09638473a1d6d1e2ce98f5a0e3fc3a14378143 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Yufeng Shi <yufeng.shi@arm.com> Co-authored-by: Erik Lundell <erik.lundell@arm.com>
- Export & lower the smollm2 via extensions/llm/export_llm - Build the arm_executor_runner application - Fix the propagation of select_ops_list in the CMakeLists.txt - Test the application runs on FVP in fast mode Signed-off-by: George Gekov <george.gekov@arm.com> Change-Id: I8acd87c2f5c3e6b5b189bb987ceccfe4877e2254
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary: Currently, __builtin_FUNCTION is used opportunistically if it exists. However, for heavily templated code, this results in extremely long string which adds .rodata which can be wasteful on embedded targets. This commit adds an override which uses the shorter __FUNCTION__ even if __bultin_FUNCTION exists and exposes as a BUCK constraint. Integration into CMake intentially left out for now. Differential Revision: D106668077
…ytorch#19834) Summary: The current approach use __FILE__ and opportunistically trims it if the utility is available. However, the long name is still stored in .rodata This can contribute some memory on embedded platforms. Instead, first try __FILE_NAME__ Differential Revision: D106587633
Summary: ghstack 0.15.0 changed the header URL in PR bodies from `Stack from [ghstack](https://github.com/ezyang/ghstack)` to `Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)`. The exact string match in `propose_ghstack_orig_pr.py` no longer matched, causing every ghstack_land workflow run to fail since May 14. Use `startswith("Stack from [ghstack]")` instead to be resilient to URL changes. Test Plan: Verified the new pattern matches both the old format (`https://github.com/ezyang/ghstack`) and the new format (`https://github.com/ezyang/ghstack/tree/0.15.0`). This PR was authored with the help of Claude. Reviewers:
Pull Request resolved: pytorch#19867 Some environments preserve stale failure state when tests are reported through unittest skip results. This switches currently disabled Vulkan delegate coverage to a local decorator so those tests stay discoverable, log their disabled reason, and produce an executed result. ghstack-source-id: 387629544 @exported-using-ghexport Differential Revision: [D106732141](https://our.internmc.facebook.com/intern/diff/D106732141/)
Applies the same disabled-test treatment as the prior diffs in this stack to the devtools inspector tests. Some test runners preserve stale failure state when tests report through unittest skip results, so this replaces the conditionally disabled coverage with a local decorator that keeps the tests discoverable, logs their disabled reason, and produces an executed result. Adds a disable_if decorator that mirrors unittest.skipIf (evaluating the condition at decoration time) and converts the three Windows-gated test cases to use it. Differential Revision: [D106736354](https://our.internmc.facebook.com/intern/diff/D106736354/) ghstack-source-id: 387629542 Pull-Request: pytorch#19874
Summary: AOTI tests (llama3_2_vision and select extension/llm tests) hang indefinitely on macOS CI runners after the PyTorch 2.12 pin update. The hang is in native C/C++ code (inductor compilation / dlopen), which prevents faulthandler from producing a traceback. Diagnosis is ongoing in pytorch#19886. Skip the affected tests and bump the macOS job timeout from the default 90 to 120 minutes to add margin (observed completion at ~79 min with skips applied). Co-Authored-By: Claude <noreply@anthropic.com>
Differential Revision: D106710218 Pull Request resolved: pytorch#19860
Differential Revision: D105728156 Pull Request resolved: pytorch#19726
Add TurboQuant TQ4 KV cache to the MLX backend, exposed on gemma4_31b
via --turboquant. Compresses full-attention KV cache from bf16 to a
4-bit codebook + per-vector norms, letting Gemma 4 31B-IT scale to very
long contexts. Sliding-window layers are unchanged.
What's in the PR
New cache subclass:
- backends/mlx/llm/turboquant_cache.py: MLXTurboQuantKVCache,
a drop-in subclass of TurboQuantKVCache.
Three custom ops + Metal kernels:
- mlx::tq4_compress (model_ops/tq4_compress.py): bucketize +
cast(uint8) + nibble-pack in one kernel.
- mlx::tq_norm (model_ops/tq_norm.py): L2 norm with simd_sum
cross-lane reduction in fp32 registers; bf16 in / bf16 out.
- mlx::tq_dequant (model_ops/tq_dequant.py): unpack + centroid
gather + multiply-by-norm in one kernel.
Per-op tests:
- test_tq4_compress.py, test_tq_norm.py, test_tq_dequant.py
Wiring:
- examples/models/gemma4_31b/mlx_source_transformations.py:
- examples/models/gemma4_31b/export.py: --turboquant CLI flag
- examples/models/gemma4_31b/README.md: TurboQuant subsection.
Perf on M4 Max 64GB Ram:
```
2K prompt:
bf16 cache: prefill 189.7 tok/s, decode 17.4 tok/s
TurboQuant cache: prefill 187.7 tok/s, decode 16.9 tok/s
8K prompt:
bf16 cache: prefill 170.0 tok/s, decode 17.1 tok/s
TurboQuant cache: prefill 166.0 tok/s, decode 11.9 tok/s
```
For TQ, max context length is set to 64K. On bf16 cache, max context
length is 10K.
TODO: why does decode slow more for TQ than bf16?
Summary: Add `fuse()` implementations to the remaining Cadence `QuantizationPattern` subclasses: - `MaxPool2dPattern`, `MaxPool2dWithoutIndicesPattern` — order-preserving pool on quantized values - `ReluBasePattern` (inherited by `ReluPattern0`/`1`) — relu with requantization - `ConvReluBasePattern` (inherited by `Conv1d`/`2dReluPattern0`/`1`) — conv+relu fusion with `anchor_ops()` override to match only the conv op - `SoftmaxPattern` — softmax with dummy mask/pos tensors and fake_mode metadata - `MixedW8A32LinearPattern` — weight-only quantized linear (no input/output quant) - `MixedW8A32ConvPattern` — weight-only quantized conv1d with NCL→NLC permutation - `MixedW8A32GruPattern` — weight-only quantized GRU with 4 dequantized params Reviewed By: DrJessop Differential Revision: D105728177
…19728) Summary: Both and Cadence now use the shared `QuantFusionPass` from `compiler_funcs.py`. - `QuantFusionPass` in `compiler_funcs.py` iterates patterns, matches `anchor_ops()`, calls `fuse()` on each match, with debug logging and dead code elimination - Cadence: `compiler.py` now uses `QuantFusionPass` instead of the old `QuantFusion` isinstance switch - Removed Cadence `compiler` target's dep on `:fusion_pass` (no longer imported) Reviewed By: DrJessop Differential Revision: D105728219
Differential Revision: D106957459 Pull Request resolved: pytorch#19903
Add the possibility to convert torch.nn.Linear modules to MXFP format. The feature works by replacing all torch.nn.Linear submodules inside a graph by a custom implemented MXFP counterpart: `MXFPLinearOp`. A new user API called `to_mxfp` has been added to enable this feature (located in backends/arm/ao_ext/mxfp.py). The API is tagged as experimental for now. An eager CPU and fake implementation is added to the new custom op, but lowering it TOSA is handled in a later patch. To summarize, this patch enables the following flow: ```python m = MyModule() to_mxfp(m, MXFPOpConfig()) m.forward(x) ``` Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Co-authored-by: Sebastian Larsson <sebastian.larsson@arm.com>
### Summary Enables to test Neutron delegate with int data created by quantization of generated float data and removed input and output quantization nodes. Turns model to int variant. ### Test plan Tests provided. cc @robert-kalmar
…h#19803) ### Summary Added support for `aten.slice` using new Neutron flow. ### Test plan tests can be manually run using `pytest -c /dev/null backends/nxp/tests/` cc @robert-kalmar @JakeStevens @digantdesai @rascani @MartinPavella @roman-janik-nxp @jirioc @irtrukhina @StrycekSimon
…19890) ### Summary cppcheck's unusedFunction is a whole-program check, but lintrunner analyzes files individually. Functions defined in headers are used by the .cpp files that include them, but cppcheck only sees the header in isolation and falsely reports them as never used. Suppress the check for .h/.hpp files while keeping it active for .cpp. Authored with assistance from Claude.
### Summary Add a docker build image based on Ubuntu 26.04 with gcc 15. It's necessary for the the baremetal on RISC-V use case since `libstdc++-riscv64-unknown-elf-picolibc` is only available starting Ubuntu 26.04. It also makes sure that `gcc-riscv64-unknown-elf` is at least gcc 14+ which has support for RVV ### Test plan It will be used by the baremetal testing on RISC-V. Relates to pytorch#18991 pytorch#19666
6661a84 to
7cc42fe
Compare
Cross-compiles with riscv64-unknown-elf + picolibc, embeds the .bpte into
the ELF, and runs under qemu-system-riscv{32,64} -machine virt with
semihosting carrying stdout and exit status. Same bundled-IO PASS criterion
as the existing linux runs.
7cc42fe to
00d0173
Compare
sentencepiece fails to compile on GCC 15 due to missing #include <cstdint>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add baremetal RISC-V testing on CI for rv32 and rv64.
Test plan
It's only testing on CI, no new code really, so CI is the testing.
Will submit to https://github.com/pytorch/executorch once pytorch#19741 is merged