Skip to content

Add baremetal RISC-V smoke tests (rv32, rv64)#4

Draft
luhenry wants to merge 103 commits into
riscv-testing-modelsfrom
riscv-testing-baremetal
Draft

Add baremetal RISC-V smoke tests (rv32, rv64)#4
luhenry wants to merge 103 commits into
riscv-testing-modelsfrom
riscv-testing-baremetal

Conversation

@luhenry
Copy link
Copy Markdown
Collaborator

@luhenry luhenry commented May 23, 2026

Summary

Add baremetal RISC-V testing on CI for rv32 and rv64.

Test plan

It's only testing on CI, no new code really, so CI is the testing.

Will submit to https://github.com/pytorch/executorch once pytorch#19741 is merged

metascroy and others added 7 commits May 22, 2026 19:20
Differential Revision: D105973185

Pull Request resolved: pytorch#19736
Add model tests of currently not supported models
- yolo11
- wav2letter
- silero_vad

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>
Differential Revision: D102880053

Pull Request resolved: pytorch#19211
Differential Revision: D106123930

Pull Request resolved: pytorch#19742
kirklandsign and others added 11 commits May 23, 2026 18:50
Differential Revision: D106162684

Pull Request resolved: pytorch#19749
### Summary
Add tests verifying correct support for add.tensor by the Neutron
backend using the new Neutron MLIR flow.

### Test plan
Unit tests provided.

cc @robert-kalmar
Treat BUCK and TARGETS files as build metadata in the Arm
pre-push license check so they do not need copyright headers.

Signed-off-by: Per Held <per.held@arm.com>
Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70
### Summary
Add tests verifying correct support for sub.tensor by the Neutron
backend using the new Neutron MLIR flow.

### Test plan
Unit tests provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
…opy (pytorch#19751)

Follow-up to pytorch#17097, which added BF16 support to the TOSA GATHER op.
`aten.index_select` and `aten.unfold_copy` both lower via TOSA GATHER
but their support checks were not updated at the time.

In both decompositions(`DecomposeIndexSelectToGatherPass()` and
`DecomposeUnfoldToGatherPass()`),
the bf16 values tensor flows through dtype-agnostic reshape ops and
`tosa.GATHER`, which accepts `BF16`.
The support check was the only blocker.

| Op                  | bf16 before | bf16 after |
|---------------------|:-----------:|:----------:|
| `aten.gather`       | ✅          | ✅         |
| `aten.index.Tensor` | ✅          | ✅         |
| `aten.slice_copy`   | ✅          | ✅         |
| `aten.index_select` | ❌          | ✅         |
| `aten.unfold_copy`  | ❌          | ✅         |

Changes:
- `index_select_support.py`, `unfold_copy_support.py`: extend float
branch
to include `bfloat16`; add bf16 extension guard; update rejection
message.
- `test_index_select.py`, `test_unfold_copy.py`: add isolated
  `_tosa_FP_bf16` test functions using
  `TosaPipelineFP(..., tosa_extensions=["bf16"])`.

### Test plan

`test_index_select_tosa_FP_bf16` and `test_unfold_copy_tosa_FP_bf16`
exercise the bf16 path end-to-end through `TosaPipelineFP` with the bf16
extension enabled, following the same pattern of the existing
`test_slice_tensor_tosa_FP_bf16` from pytorch#17492
This is done for conv, depthwise conv, transpose conv, and bmm.

Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.
    
Introduce `required_cmsis_buffer_size`which computes the buffer
size from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass
    
Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro
CORTEX_M_ENABLE_RUNTIME_CHECKS
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.


cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo
@oscarandersson8218 @mansnils @Sebastian-Larsson @robell

---------

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Co-authored-by: Måns Nilsson <mans.nilsson@arm.com>
…es (pytorch#19146)

### Summary
To enable GPU backend support in the Llama runner, refactoring is
required because the dtypes of kv_cache, attention_mask, and logits are
currently hardcoded, preventing floating‑point models from running.
This PR focuses on removing the hardcode dtype for them.

#### Key changes
- Remove template parameter <typename T> from KVManager,
LhdTokenGenerator,
  MultimodalPromptProcessor, and related runner classes
- Detect kv_cache and attention_mask dtypes dynamically from MethodMeta
at
  construction time instead of compile-time bitwidth detection
- Switch to std::byte* pointer arithmetic with getDtypeSize() for all
buffer
  offsets; add fill_mask() helper for multi-dtype attention mask filling
- Update spec_prop pass for custom llama op for sharding case greater
than 1


### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android  --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16
```
<img width="1977" height="468" alt="image"
src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a"
/>


cc @cccclai @cbilgin @abhinaykukkadapu
Summary: Pull Request resolved:
pytorch#19764

Reviewed By: kirklandsign

Differential Revision: D106332819
As documented at
https://vkdoc.net/man/VkDataGraphPipelineSessionBindPointRequirementARM
.stype of VkDataGraphPipelineSessionBindPointRequirementARM should alway
be set to
VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_SESSION_BIND_POINT_REQUIREMENT_ARM

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Enable CPPCHECK for Cortex-M sources and headers. The Cortex-M kernels
are registered through generated wrappers, so cppcheck cannot see
direct call sites for the exported *_out entry points and reports them
as unused. Keep narrow unusedFunction suppressions for those
registration-visible functions.

The scratch buffer context header is linted as a standalone header but
currently exposes helper API without in-tree call sites, so suppress
unusedFunction at file scope there instead of dropping Cortex-M header
coverage.

Keep the quantize and dequantize context parameters non-const to match
the generated kernel ABI; changing them to const changes the mangled
symbols used by registration.

Signed-off-by: Per Held <per.held@arm.com>

Change-Id: I3bcb6e5d3f125ae400005d1b033b24a07eb7924f
luhenry and others added 9 commits May 26, 2026 09:14
### Summary

It relates to pytorch#18833. It
doesn't add Yolo on baremetal, but it at least makes sure that it works
using Portable Kernels and XNNPACK backends.

### Test plan

It's only adding a model to CI, so the CI is the test plan.
Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark,
LlmModelRunner, and ModelRunner from Java to Kotlin.

Differential Revision: D106195816
…rch#19731)

### Summary 
Extend the Cortex-M cross-CPU build pipeline to Armv6-M by patching two
upstream issues that block the Corstone-300 target source and the CMSIS
Cortex DFP from building for `cortex-m0plus`:

* `core_platform/0003-*.patch` guards the `HardFault_Handler` in
`targets/corstone-300/target.cpp`. The handler uses an `ite eq` IT-block
in inline asm and dereferences the SCB CFSR/BFAR/MMFAR fault-status
registers; both are Armv7-M / Armv8-M Mainline only. The patch wraps the
rich handler in `__ARM_ARCH_7M__ / 7EM / 8M_MAIN / 8_1M_MAIN` and falls
back to a minimal stub on Armv6-M / Armv8-M Baseline (M0/M0+/M23).

* `core_software/0002-*.patch` fixes `cmsis.cmake`'s handling of the M0+
device. The Cortex DFP names the device directory and headers
`ARMCM0plus` (lowercase suffix), while the device sources
(`startup_ARMCM0plus.c`, `system_ARMCM0plus.c`) gate their
implementations on the `ARMCM0P` preprocessor macro — three different
spellings. The previous `string(TOUPPER ...)` produced `ARMCM0PLUS`: the
include path lookup failed and the source files hit their `#error device
not specified!` guard. Override `ARM_CPU` to `ARMCM0plus` for the
directory + filename and introduce a separate `CMSIS_DEVICE_CPU_DEFINE`
set to `ARMCM0P` for the cmsis_startup and cmsis_system
compile-definitions; all other cores still drive both paths from the
uppercased default.

Both patches are layered via the existing `patch_repo` mechanism; the
`corstone_utils.cmake` TODO is updated so the deletion plan for 0002 and
0003 is documented together.

### Test Plan
Locally validated end-to-end on the Corstone-300 FVP with the `qadd`
model: `cortex-m0plus` build links a runner that includes
`startup_ARMCM0plus.c` / `system_ARMCM0plus.c` and the patched
`target.cpp`, and the FVP run prints
`TEST: BundleIO index[0] Test_result: PASS` with all error stats zero.
The bundled `libcmsis-nn.a` reports `Tag_CPU_arch: v6S-M` and
`Tag_THUMB_ISA_use: Thumb-1` with zero DSP / MVE / saturating
instructions, confirming the scalar code path was exercised.

Authored with Claude.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Differential Revision: D106026285

Pull Request resolved: pytorch#19734
Differential Revision: D106394605

Pull Request resolved: pytorch#19775
Re-upload with BUCK changes.

Share TOSA RESIZE parameter validation between upsample support checks
and fake RESIZE lowering so invalid nearest and bilinear resize
parameters are rejected before delegation.


Change-Id: I57c267aca96d733879ae90329267e44adce399c6


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Per Held <per.held@arm.com>
Differential Revision: D106408368

Pull Request resolved: pytorch#19783
### Summary
In pytorch#19651, I added a global
seed for pytest runs. This was intended to reduce random tolerance
flakes, but didn't actually do so in practice. This is because the
parallel test runners don't guarantee any ordering, so random state is
unstable between runs.

I've updated it to set the seed per-test. This should hopefully make the
random state invariant of test execution order.
YufengShi-dudu and others added 20 commits May 29, 2026 10:05
…h#19839)

Add ArmPass.should_run_pass() as a reusable early-exit hook before
  call() starts the normal ExportPass retracing path. The default hook
  returns true, preserving existing behavior for ArmPass subclasses.

  Introduce ArmOpTargetedPass for passes that only transform a known
  set of operator targets. It implements should_run_pass() by scanning
  the current graph and nested GraphModules for matching target
  operators. If no matching target operator is found, the pass returns
  an unmodified PassResult.

  For passes that already gate transformations with
  allowed_to_transform(), allow the target pre-scan to apply the same
  check before deciding whether the pass needs to run. This avoids
  running TFA passes when all matching target nodes are marked as
  disallowed.

  The should_run_pass() hook and ArmOpTargetedPass pre-scan avoid
  rebuilding graphs for decomposition and rewrite passes that cannot
  affect the current graph. The speedup is most visible on large models.

  Single-run paired benchmarks on Arm backend model tests
  across FP32, INT, VGF no-quant, and VGF quant variants:

  | Model       | E2E avg | Pass-manager avg |
  |-------------|--------:|-----------------:|
  | T5-small    | +30.5%  | +47.5%           |
  | DeepLabV3   | +12.9%  | +49.8%           |
  | Wav2Letter  | +16.9%  | +51.2%           |
  | InceptionV3 | +22.2%  | +46.5%           |
  | MobileNetV2 | +22.2%  | +52.5%           |
  | MobileNetV3 | +29.9%  | +54.6%           |

  Model rows are unweighted averages over successful variants.
  Unweighted average across 23 successful model/target variants:
  E2E speedup: +22.4%
  Pass-manager speedup: +50.5%

Change-Id: Iaa09638473a1d6d1e2ce98f5a0e3fc3a14378143


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Yufeng Shi <yufeng.shi@arm.com>
Co-authored-by: Erik Lundell <erik.lundell@arm.com>
- Export & lower the smollm2 via extensions/llm/export_llm
- Build the arm_executor_runner application
- Fix the propagation of select_ops_list in the CMakeLists.txt
- Test the application runs on FVP in fast mode

Signed-off-by: George Gekov <george.gekov@arm.com>
Change-Id: I8acd87c2f5c3e6b5b189bb987ceccfe4877e2254
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary:
Currently, __builtin_FUNCTION is used opportunistically if it exists.


However, for heavily templated code, this results in extremely long
string which adds .rodata which can be wasteful on embedded targets.


This commit adds an override which uses the shorter __FUNCTION__ even if
__bultin_FUNCTION exists and exposes as a BUCK constraint.

Integration into CMake intentially left out for now.

Differential Revision: D106668077
…ytorch#19834)

Summary:

The current approach use __FILE__ and opportunistically trims it if the
utility is available.

However, the long name is still stored in .rodata

This can contribute some memory on embedded platforms.


Instead, first try __FILE_NAME__

Differential Revision: D106587633
Summary:

ghstack 0.15.0 changed the header URL in PR bodies from
`Stack from [ghstack](https://github.com/ezyang/ghstack)` to
`Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)`.

The exact string match in `propose_ghstack_orig_pr.py` no longer matched,
causing every ghstack_land workflow run to fail since May 14. Use
`startswith("Stack from [ghstack]")` instead to be resilient to URL changes.

Test Plan:

Verified the new pattern matches both the old format
(`https://github.com/ezyang/ghstack`) and the new format
(`https://github.com/ezyang/ghstack/tree/0.15.0`).

This PR was authored with the help of Claude.

Reviewers:
Pull Request resolved: pytorch#19867

Some environments preserve stale failure state when tests are reported through unittest skip results. This switches currently disabled Vulkan delegate coverage to a local decorator so those tests stay discoverable, log their disabled reason, and produce an executed result.

ghstack-source-id: 387629544
@exported-using-ghexport

Differential Revision: [D106732141](https://our.internmc.facebook.com/intern/diff/D106732141/)
Applies the same disabled-test treatment as the prior diffs in this stack to the devtools inspector tests. Some test runners preserve stale failure state when tests report through unittest skip results, so this replaces the conditionally disabled coverage with a local decorator that keeps the tests discoverable, logs their disabled reason, and produces an executed result.

Adds a disable_if decorator that mirrors unittest.skipIf (evaluating the condition at decoration time) and converts the three Windows-gated test cases to use it.

Differential Revision: [D106736354](https://our.internmc.facebook.com/intern/diff/D106736354/)


ghstack-source-id: 387629542
Pull-Request: pytorch#19874
Summary:
AOTI tests (llama3_2_vision and select extension/llm tests) hang
indefinitely on macOS CI runners after the PyTorch 2.12 pin update.
The hang is in native C/C++ code (inductor compilation / dlopen),
which prevents faulthandler from producing a traceback. Diagnosis
is ongoing in pytorch#19886.

Skip the affected tests and bump the macOS job timeout from the
default 90 to 120 minutes to add margin (observed completion at
~79 min with skips applied).

Co-Authored-By: Claude <noreply@anthropic.com>
Differential Revision: D106710218

Pull Request resolved: pytorch#19860
Differential Revision: D105728156

Pull Request resolved: pytorch#19726
Add TurboQuant TQ4 KV cache to the MLX backend, exposed on gemma4_31b
via --turboquant. Compresses full-attention KV cache from bf16 to a
4-bit codebook + per-vector norms, letting Gemma 4 31B-IT scale to very
long contexts. Sliding-window layers are unchanged.

What's in the PR

  New cache subclass:
    - backends/mlx/llm/turboquant_cache.py: MLXTurboQuantKVCache,
      a drop-in subclass of TurboQuantKVCache.

  Three custom ops + Metal kernels:
    - mlx::tq4_compress (model_ops/tq4_compress.py): bucketize +
      cast(uint8) + nibble-pack in one kernel.
    - mlx::tq_norm (model_ops/tq_norm.py): L2 norm with simd_sum
      cross-lane reduction in fp32 registers; bf16 in / bf16 out.
    - mlx::tq_dequant (model_ops/tq_dequant.py): unpack + centroid
      gather + multiply-by-norm in one kernel.

  Per-op tests:
    - test_tq4_compress.py, test_tq_norm.py, test_tq_dequant.py

  Wiring:
    - examples/models/gemma4_31b/mlx_source_transformations.py:
    - examples/models/gemma4_31b/export.py: --turboquant CLI flag
    - examples/models/gemma4_31b/README.md: TurboQuant subsection.
    
Perf on M4 Max 64GB Ram:

```
 2K prompt:
    bf16 cache:        prefill 189.7 tok/s,  decode 17.4 tok/s
    TurboQuant cache:  prefill 187.7 tok/s,  decode 16.9 tok/s

  8K prompt:
    bf16 cache:        prefill 170.0 tok/s,  decode 17.1 tok/s
    TurboQuant cache:  prefill 166.0 tok/s,  decode 11.9 tok/s
```   

For TQ, max context length is set to 64K. On bf16 cache, max context
length is 10K.

TODO: why does decode slow more for TQ than bf16?
Summary:

Add `fuse()` implementations to the remaining Cadence
`QuantizationPattern` subclasses:

- `MaxPool2dPattern`, `MaxPool2dWithoutIndicesPattern` —
order-preserving pool on quantized values
- `ReluBasePattern` (inherited by `ReluPattern0`/`1`) — relu with
requantization
- `ConvReluBasePattern` (inherited by `Conv1d`/`2dReluPattern0`/`1`) —
conv+relu fusion with `anchor_ops()` override to match only the conv op
- `SoftmaxPattern` — softmax with dummy mask/pos tensors and fake_mode
metadata
- `MixedW8A32LinearPattern` — weight-only quantized linear (no
input/output quant)
- `MixedW8A32ConvPattern` — weight-only quantized conv1d with NCL→NLC
permutation
- `MixedW8A32GruPattern` — weight-only quantized GRU with 4 dequantized
params

Reviewed By: DrJessop

Differential Revision: D105728177
…19728)

Summary:

Both and Cadence now use the shared `QuantFusionPass` from
`compiler_funcs.py`.

- `QuantFusionPass` in `compiler_funcs.py` iterates patterns, matches
`anchor_ops()`, calls `fuse()` on each match, with debug logging and
dead code elimination
- Cadence: `compiler.py` now uses `QuantFusionPass` instead of the old
`QuantFusion` isinstance switch
- Removed Cadence `compiler` target's dep on `:fusion_pass` (no longer
imported)

Reviewed By: DrJessop

Differential Revision: D105728219
Differential Revision: D106957459

Pull Request resolved: pytorch#19903
Add the possibility to convert torch.nn.Linear modules to MXFP format.
The feature works by replacing all torch.nn.Linear submodules inside a
graph by a custom implemented MXFP counterpart: `MXFPLinearOp`.

A new user API called `to_mxfp` has been added to enable this feature
(located in backends/arm/ao_ext/mxfp.py). The API is tagged as
experimental for now.

An eager CPU and fake implementation is added to the new custom op, but
lowering it TOSA is handled in a later patch. To summarize, this patch
enables the following flow:

```python
m = MyModule()

to_mxfp(m, MXFPOpConfig())

m.forward(x)
```

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Co-authored-by: Sebastian Larsson <sebastian.larsson@arm.com>
### Summary
Enables to test Neutron delegate with int data created by quantization
of generated float data and removed input and output quantization nodes.
Turns model to int variant.

### Test plan
Tests provided.


cc @robert-kalmar
…h#19803)

### Summary

Added support for `aten.slice` using new Neutron flow.

### Test plan

tests can be manually run using `pytest -c /dev/null
backends/nxp/tests/`

cc @robert-kalmar @JakeStevens @digantdesai @rascani @MartinPavella
@roman-janik-nxp @jirioc @irtrukhina @StrycekSimon
…19890)

### Summary
cppcheck's unusedFunction is a whole-program check, but lintrunner
analyzes files individually. Functions defined in headers are used by
the .cpp files that include them, but cppcheck only sees the header in
isolation and falsely reports them as never used. Suppress the check for
.h/.hpp files while keeping it active for .cpp.

Authored with assistance from Claude.
### Summary

Add a docker build image based on Ubuntu 26.04 with gcc 15. It's
necessary for the the baremetal on RISC-V use case since
`libstdc++-riscv64-unknown-elf-picolibc` is only available starting
Ubuntu 26.04. It also makes sure that `gcc-riscv64-unknown-elf` is at
least gcc 14+ which has support for RVV

### Test plan

It will be used by the baremetal testing on RISC-V.

Relates to pytorch#18991
pytorch#19666
@luhenry luhenry force-pushed the riscv-testing-baremetal branch from 6661a84 to 7cc42fe Compare June 1, 2026 16:42
Cross-compiles with riscv64-unknown-elf + picolibc, embeds the .bpte into
the ELF, and runs under qemu-system-riscv{32,64} -machine virt with
semihosting carrying stdout and exit status. Same bundled-IO PASS criterion
as the existing linux runs.
@luhenry luhenry force-pushed the riscv-testing-baremetal branch from 7cc42fe to 00d0173 Compare June 1, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.