- LLM library
- Prerequisites
- Quick start
- Cross Compilation for Android
- To build an executable benchmark binary
- Supported Platforms
- Configuration options
- Contributions
- Troubleshooting
- Trademarks
- License
This repo is designed for building an Arm® KleidiAI™ enabled LLM library using CMake build system. It intends to provide an abstraction for different Machine Learning frameworks/backends that Arm® KleidiAI™ kernels have been integrated into. Currently, it supports llama.cpp, mediapipe, onnxruntime-genai, and MNN backends. The backend library (selected at CMake configuration stage) is wrapped by this project's thin C++ layer that could be used directly for testing and evaluations. However, JNI bindings are also provided for developers targeting Android™ based applications.
- A Linux®-based operating system is recommended (this repo is tested on Ubuntu® 22.04.4 LTS)
- An Android™ or Linux® device with an Arm® CPU is recommended as a deployment target, but this library can be built for any native machine.
- CMake 3.28 or above installed
- Python 3.9 or above installed, python is used to download test resources and models
- Android™ NDK (if building for Android™). Minimum version: 29.0.14206865 is recommended and can be downloaded from here.
- Building on macOS requires Xcode Command Line Tools, Android Studio installed and configured (NDK, CMake as above) and Clang (tested with 16.0.0)
- Bazelisk or Bazel 7.4.1 to build mediapipe backend
- Aarch64 GNU toolchain (version 14.1 or later) if cross-compiling from a Linux® based system which can be downloaded from here
- Java Development Kit required for building JNI wrapper library necessary to utilise this module in an Android/Java application.
- Create a Hugging Face account and obtain a Hugging Face access token.
The project can be built and LLM tests exercised by simply running the following commands on supported platforms:
cmake --preset=native -B build
cmake --build ./build
ctest --test-dir ./buildThe commands above will use the default LLM framework (llama.cpp) and download a small number of LLM models. The tests exercise both vision and text queries. See LlmTest.cpp & LlmTestJNI.java for details.
ctest --test-dir ./build command above should produce results similar to those give below (timings may vary):
Internal ctest changing into directory: /home/user/llm/build
Test project /home/user/llm/build
Start 1: llm-cpp-ctest
1/2 Test #1: llm-cpp-ctest .................... Passed 4.16 sec
Start 2: llama-jni-ctest
2/2 Test #2: llama-jni-ctest .................. Passed 3.25 sec
100% tests passed, 0 tests failed out of 2Cross compilation is also supported allowing the project to build binaries targeted to an OS/CPU architecture different from the host/build machine. For example it is possible to build the project on a Linux x86_64 platform and build binaries for Android™:
export NDK_PATH=/home/username/ndk
cmake --preset=x-android-aarch64 -B build
cmake --build ./buildHowever, the binaries would need to be uploaded to an Android™ device to exercise the tests. See the section below for additional cross-compilation options.
To build a standalone benchmark binary add the configuration option -DBUILD_BENCHMARK=ON to any of the build
commands above. For example:
On Aarch-64
cmake -B build --preset=native -DCPU_ARCH=Armv8.2_4 -DBUILD_BENCHMARK=ON
cmake --build ./buildThe supported build platforms and cmake presets matrix is given below. The cmake presets (aka build target) are given in the first column and build platform are given in the first row. So for example native builds are have been tested on Linux-x86_64, Linux-aarch64 & macOS-aarch64. While x-android-aarch64 (targets Android™ devices running on aarch64) builds are only tested on Linux-x86_64 & macOS-aarch64.
| cmake-preset / Host Platform | Linux-x86_64 | Linux-aarch64 | macOS-aarch64 | Android™ |
|---|---|---|---|---|
| native | ✅ | ✅ * | ✅ | - |
| x-android-aarch64 | ✅ | - | ✅ | - |
| x-linux-aarch64 | ✅ | ✅ † | - | - |
* Linux-aarch64 requires CPU_ARCH build flag when selecting llama.cpp † Use 'native' preset
Configuration options are divided into 2 parts. The first part (what is covered in this section) is the overall project configuration. The second part covers configuration options relating to the specific LLM framework being used, e.g. llama.cpp/ ONNX or MediaPipe, these items are covered in the sections that follow.
Configuration option can be used with cmake presets.
For example aarch64 CPU hardware acceleration can be disabled by setting USE_KLEIDIAI=OFF, e.g. This is useful when testing the uplift in performance due to Arm CPU hardware acceleration.
cmake --preset=native -B build -DUSE_KLEIDIAI=OFF
cmake --build ./build
ctest --test-dir ./buildLLM_FRAMEWORK can be used to select the LLM framework, e.g.
cmake --preset=native -B build -DLLM_FRAMEWORK=onnxruntime-genai
cmake --build ./build
ctest --test-dir ./buildDetails of configurable build options are given below:
| Flag name | Default | Values | Description |
|---|---|---|---|
| LLM_FRAMEWORK | llama.cpp | llama.cpp / mediapipe / onnxruntime-genai / mnn | Specifies the backend framework to be used. |
| BUILD_DEBUG | OFF | ON/OFF | If set to ON a debug build is configured. |
| ENABLE_STREAMLINE | OFF | ON/OFF | Enables Arm Streamline timeline annotations for analyzing LLM initialization, encode, decode, and control-path performance. |
| BUILD_LLM_TESTING | ON | ON/OFF | Builds the project's functional tests when ON. |
| BUILD_BENCHMARK | OFF | ON/OFF | Builds the framework's benchmark binaries and arm-llm-bench-cli for the project when ON. |
| BUILD_JNI_LIB | ON | ON/OFF | Builds the JNI bindings for the project. |
| LOG_LEVEL | INFO/DEBUG | DEBUG, INFO, WARN & ERROR | For BUILD_DEBUG=OFF the default value is INFO. For BUILD_DEBUG=ON, the default value is DEBUG. |
| USE_KLEIDIAI | ON | ON/OFF | Build the project with KLEIDIAI CPU optimizations; if set to OFF, optimizations are turned off. |
| CPU_ARCH | Not defined | Armv8.2_1, Armv8.2_2, Armv8.2_3, Armv8.2_4, Armv8.2_5, Armv8.6_1, Armv9.0_1_1, armv9.2_1_1, armv9.2_2_1 | Sets the target ISA architecture (AArch64) to ensure SVE is not enabled when LLM_FRAMEWORK=llama.cpp (issue affects aarch64 only). |
| GGML_METAL | OFF | ON/OFF | macOS specific. Enables Apple Metal backend in ggml for GPU acceleration (Apple Silicon only). |
| GGML_BLAS | OFF | ON/OFF | macOS specific. Enables Accelerate/BLAS backend in ggml for CPU-optimized linear algebra kernels. |
Currently there are issues with a specific architecture (SVE) integration in llama.cpp backend on aarch64. To ensure this feature is not enabled we enforce using one of our provided CPU_ARCH flag presets that ensure compiler flags do not enable SVE at build time. The table below gives the mapping of our preset CPU_ARCH flags to some common CPU feature flag sets. Other permutations are also supported and can be tailored accordingly. If you intend to use specific features you must ensure your specific CPU implements them e.g. i8mm as this was optional in v8.2 for example. Compilers also need to support any chosen features.
| CPU_ARCH | C/C++ compiler flags |
|---|---|
| Armv8.2_1 | -march=armv8.2-a+dotprod |
| Armv8.2_2 | -march=armv8.2-a+dotprod+fp16 |
| Armv8.2_3 | -march=armv8.2-a+dotprod+fp16+sve |
| Armv8.2_4 | -march=armv8.2-a+dotprod+i8mm |
| Armv8.2_5 | -march=armv8.2-a+dotprod+i8mm+sve+sme |
| Armv8.6_1 | -march=armv8.6-a+dotprod+fp16+i8mm |
| Armv9.0_1_1 | -march=armv8.6-a+dotprod+fp16+i8mm+nosve |
| *armv9.2_1_1 | -march=armv9.2-a+dotprod+fp16+nosve+i8mm+sme |
| *armv9.2_2_1 | -march=armv9.2-a+dotprod+fp16+nosve+i8mm+sme |
- Note: Different capitalisation for v9.2 presets.
NOTE: If you need specific version of Java set the path in
JAVA_HOMEenvironment variable.export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64Failure to locate "jni.h" occurs if compatible JDK is not on the system path. If you want to experiment with the repository without JNI libs, turn the
BUILD_JNI_LIBoption off by configuring with-DBUILD_JNI_LIB=OFF.
DOWNLOADS_LOCK_TIMEOUT: A timeout value in seconds indicating how much time a lock should be tried for when downloading resources. This is a one-time download that CMake configuration will initiate unless it has been run by the user directly or another prior CMake configuration. The lock prevents multiple CMake configuration processes running in parallel downloading files to the same location.LLM_LOG_LEVELChoose appropriate logging level ("ERROR","WARN","INFO","DEBUG") with this flag, if LLM_LOG_LEVEL is not provided, it will be inferred from the CMAKE-BUILD-TYPE.
There are different conditional options for different frameworks.
For llama.cpp as framework, these configuration parameters can be set:
LLAMA_SRC_DIR: Source directory path that will be populated by CMake configuration.LLAMA_GIT_URL: Git URL to clone the sources from.LLAMA_GIT_SHA: Git SHA for checkout.LLAMA_BUILD_COMMON: Build llama's dependency Common, enabled by default.LLAMA_CURL: Enable HTTP transport via libcurl for remote models or features requiring network communication, disabled by default.
When using onnxruntime-genai, the onnxruntime dependency will be built from source. To customize
the versions of both onnxruntime and onnxruntime-genai, the following configuration parameters
can be used:
onnxruntime:
ONNXRUNTIME_SRC_DIR: Source directory path that will be populated by CMake configuration.ONNXRUNTIME_GIT_URL: Git URL to clone the sources from.ONNXRUNTIME_GIT_TAG: Git SHA for checkout.
onnxruntime-genai:
ONNXRT_GENAI_SRC_DIR: Source directory path that will be populated by CMake configuration.ONNXRT_GENAI_GIT_URL: Git URL to clone the sources from.ONNXRT_GENAI_GIT_TAG: Git SHA for checkout.
NOTE: This repository has been tested with
onnxruntimeversionv1.24.2andonnxruntime-genaiversionv0.12.0.
For customising mediapipe framework , following parameters can be used:
MEDIAPIPE_SRC_DIR: Source directory path that will be populated by CMake configuration.MEDIAPIPE_GIT_URL: Git URL to clone the sources from.MEDIAPIPE_GIT_TAG: Git SHA for checkout
Building mediapipe for aarch64 in x86_64 linux based requires downloading Aarch64 GNU toolchain from here, following configuration flags need to provided for building
BASE_PATH: Provides the top level directory of aarch64 GNU toolchain, if not provided the build script will download the latest ARM GNU toolchain for cross-compilation.
NOTE: Support for mediapipe is experimental and current focus is to support Android™ platform. Please note that latest ARM GNU Toolchain version(14.3) may depend on libraries present in Ubuntu® 24.04.4 LTS when cross-compiled.
Support for macOS® and Windows is not added in this release.
For customising MNN framework , following parameters can be used:
MNN_SRC_DIR: Source directory path that will be populated by CMake configuration.MNN_GIT_URL: Git URL to clone the sources from.MNN_GIT_TAG: Git SHA for checkout
NOTE: This repository has been tested with
MNNversionv3.3.0.
KleidiAI™ NOTE: : Although MNN can be built with USE_KLEIDIAI defined, the current MNN implementation does not fully enable KleidiAI™ optimizations at runtime. This limitation is due to the current MNN runtime initialization logic and will be resolved once full support is implemented upstream in MNN.
When targeting the llama.cpp LLM backend and Android (--preset=x-android-aarch64), BUILD_SHARED_LIBS=ON is automatically configured. This ensures the build generates shared libraries, allowing the optimal hardware accelerated libraries to be loaded for the particular device at runtime.
| Framework / Backend | Supported Models | Licenses |
|---|---|---|
| llama.cpp | phi-2qwen-2-VLllama-3.2-1B |
mit apache-2.0 Llama-3.2-1B |
| onnxruntime-genai | phi4-mini-instruct |
mit |
| mediapipe | gemma-2B |
Gemma |
| mnn | qwen-2.5-VLllama-3.2-1B |
apache-2.0 Llama-3.2-1B |
This project uses the phi-2 model as its default network for llama.cpp framework.
The model is distributed using the Q4_0 quantization format, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.
- You can access the model from Hugging Face.
- The default model configuration is declared in the
requirements.jsonfile.
However, any model supported by the backend library could be used.
NOTE: Currently only Q4_0 models are accelerated by Arm® KleidiAI™ kernels in
llama.cpp.
The llama.cpp backend also supports multimodal (image + text) inference in this project.
What you need
- A compatible text model (GGUF).
- A matching vision projection (mmproj) file (GGUF) for your chosen text model
How to enable Use these fields in your configuration file:
llmModelName— text model (GGUF)llmMmProjModelName— vision projection (GGUF) for multimodalisvision— set"true"to enable multimodal
If "isVision" is true, a valid llmMmProjModelName is required; omitting "image" runs the backend in text-only mode.
You can find an example of multimodal settings in llamaVisionConfig-qwen2-vl-2B.json.
This project uses the Phi-4-mini-instruct-onnx as its default network for onnxruntime-genai framework.
The model is distributed using int4 quantization format with the block size: 32, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.
- You can access the model from Hugging Face.
- The default model configuration is declared in the
requirements.jsonfile.
However, any model supported by the backend library could be used.
To use an ONNX model with this framework, the following files are required:
genai_config.json: Configuration filemodel_name.onnx: ONNX modelmodel_name.onnx.data: ONNX model datatokenizer.json: Tokenizer filetokenizer_config.json: Tokenizer config file
These files are essential for loading and running ONNX models effectively.
NOTE: Currently only int4 and block size 32 models are accelerated by Arm® KleidiAI™ kernels in
onnxruntime-genai.
To use the Gemma 2B model, add your Hugging Face access token to the build environment after accepting the Gemma license .
export HF_TOKEN=<your hugging-face access token>or Append the following lines to your ~/.netrc file:
machine huggingface.co
login <your-username-or-email>
password <your-huggingface-access-token>
Ensure the .netrc file is secured with the correct permissions.
Alternatively, you can quantize other models listed in conversion colab from Hugging Face to TensorFlow Lite™ (.tflite) format. Copy the resulting 4-bit models to resources_downloaded/models/mediapipe.
It is recommended to use mediapipe python package version 0.10.15 for stable conversion to 4-bit models.
This project uses the Llama 3.2 1B model as its default network for the MNN framework. The model is distributed using the 4-bit quantization format, which is highly recommended as it delivers efficient inference performance while maintaining strong text generation quality on Arm® CPUs.
- You can access the text model from Hugging Face
- The model configuration is declared in the
requirements.json
However, any model supported by the MNN backend library can be used.
To use an MNN model with this framework, the following files are required:
config.json: Model configuration filellm.mnn: Main MNN model filellm.mnn.json: Model metadata file generated by the MNN conversion processllm.mnn.weight: Model weight file (used when weights are stored separately)llm_config.json: Model-specific configuration filetokenizer.txt: Tokenizer definition fileembeddings_bf16.bin: (optional) Used by some models that store embeddings separately. If this file exists, download it; otherwise, embeddings are already included in the main weights.
These files are essential for loading and running MNN models effectively.
The MNN backend also supports multimodal (image + text) inference in this project.
- You can access the vision model from Hugging Face
What you need
visual.mnn: Vision model metadata file generated by the MNN conversion processvisual.mnn.weight: Vision model weight file (used when weights are stored separately)
NOTE: The MNN backend determines whether multimodal mode is active from the
is_visualfield inside the model’sllm_config.json.
You can find an example multimodal configuration in mnnVisionConfig-qwen2.5-3B.json
To build for aarch64 Linux system
cmake -B build --preset=native -DCPU_ARCH=Armv8.2_5
cmake --build ./buildOnce built, a standalone application can be executed to get performance.
If FEAT_SME is available on deployment target, environment variable GGML_KLEIDIAI_SME can be used to
toggle the use of SME kernels during execution for llama.cpp. For example:
GGML_KLEIDIAI_SME=1 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"To run without invoking SME kernels, set GGML_KLEIDIAI_SME=0 during execution:
GGML_KLEIDIAI_SME=0 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"NOTE: In some cases, it may be desirable to build a statically linked executable. For llama.cpp backend this can be done by adding these configuration parameters to the CMake command for Clang or GNU toolchains:
-DCMAKE_EXE_LINKER_FLAGS="-static" \ -DGGML_OPENMP=OFF
To build for the CPU backend on macOS®, you can use the native CMake toolchain.
cmake -B build --preset=native
cmake --build ./buildNOTE: If you need specific version of Java set the path in
JAVA_HOMEenvironment variable.export JAVA_HOME=$(/usr/libexec/java_home)
Once built, a standalone application can be executed to get performance.
If FEAT_SME is available on deployment target, environment variable GGML_KLEIDIAI_SME can be used to
toggle the use of SME kernels during execution for llama.cpp. For example:
GGML_KLEIDIAI_SME=1 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"To run without invoking SME kernels, set GGML_KLEIDIAI_SME=0 during execution:
GGML_KLEIDIAI_SME=0 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"You can run either executable from command line and add your prompt for example the following:
./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/phi-2/phi2_Q4_model.gguf --prompt "What is the capital of France"
More information can be found at llama.cpp/examples/main/README.md on how this executable can be run.
You can run model_benchmark executable from command line:
./build/bin/model_benchmark -i resources_downloaded/models/onnxruntime-genai/phi-4-mini/
More information can be found at onnxruntime-genai/benchmark/c/readme.md on how this executable can be run.
You can run llm_bench executable from command line:
./build/bin/llm_bench -m resources_downloaded/models/mnn/llama-3.2-1b/config.json -t 4 -p 128 -n 64
The Arm LLM Benchmark tool (arm-llm-bench-cli) is a framework-agnostic, standalone executable designed to measure both prompt-processing and token-generation performance across all supported LLM backends.
Supported Frameworks
llama.cpponnxruntime-genaiMNNmediapipe
Instead of writing your own prompts or relying on framework-specific benchmarking tools, arm-llm-bench-cli provides a unified benchmarking pipeline. It automatically detects the backend specified in the LLM configuration file and benchmarks it consistently. The tool repeatedly runs the LLM prompt-processing and token-generation operations and reports timing and throughput metrics in a standardized format.
NOTE: To build
arm-llm-bench-cli, enable the benchmarking flag in CMake by setting-DBUILD_BENCHMARK=ON.
Measures
Encode time and encode tokens/sDecode time and decode tokens/sTime-to-first-token (TTFT)Total latency per iterationSupports warm-up iterations (ignored in statistics)
Usage
./build/bin/arm-llm-bench-cli \
--model <model_path> | -m <model_path> \
--input <tokens> | -i <tokens> \
--output <tokens> | -o <tokens> \
--threads <num_threads> | -t <num_threads> \
--iterations <num_iterations> | -n <num_iterations> \
[ --context <tokens> | -c <tokens> ] \
[ --json-output <path> | -j <path> ] \
[ --warmup <warmup_iterations> | -w <warmup_iterations> ]
NOTE: On-device execution requires that
arm-llm-bench-cliand its backend shared libraries reside in the same directory. Builds usingGGML_OPENMP=ONadditionally requirelibomp.soto be placed in that directory as well.
Example
./build/bin/arm-llm-bench-cli \
-m ./resources_downloaded/models/llama.cpp/llama-3.2-1b/Llama-3.2-1B-Instruct-Q4_0.gguf \
-i 128 \
-o 64 \
-c 2048 \
-t 4 \
-n 3 \
-w 1
Terminal Output:
INFO : Running 1 warmup iteration(s) (results ignored)...
=== ARM LLM Benchmark ===
Parameters:
model_path : ./resources_downloaded/models/llama.cpp/llama-3.2-1b/Llama-3.2-1B-Instruct-Q4_0.gguf
num_input_tokens : 128
num_output_tokens : 64
context_size : 2048
num_threads : 4
num_iterations : 3
num_warmup : 1
======= Results =========
| Framework | Threads | Test | Performance |
| ------------------ | ------- | ------ | -------------------------- |
| llama.cpp | 5 | pp128 | 204.149 ± 4.316 (t/s) |
| llama.cpp | 5 | tg64 | 48.029 ± 0.080 (t/s) |
| llama.cpp | 5 | TTFT | 648.401 ± 13.798 (ms) |
| llama.cpp | 5 | Total | 1959.827 ± 14.433 (ms) |
For a list of common errors and their fixes, see TROUBLESHOOTING.md.
The LLM-Runner welcomes contributions. For more details on contributing to the repo please see the contributors guide.
- Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or elsewhere.
- Android™ and TensorFlow Lite™ are trademarks of Google LLC.
- macOS® is a trademark of Apple Inc.
This project is distributed under the software licenses in LICENSES directory. The licenses of supported models can be seen in Supported Models section.