CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
-
Updated
Mar 23, 2026 - Python
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
Run a 2-min local benchmark → predict how long your AI job will take on cloud GPU (T4/V100/A100). No guessing, no wasted money.
Standalone LLM inference benchmarking pipelines on AMD GPUs using ROCm, vLLM, MAD, and data visualization scripts.
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
benchHUB is a Python-based project to parse, aggregate, and visualize system and performance benchmarks. It includes a Streamlit dashboard to display and compare results.
One-shot script to audit GPU, CUDA, PyTorch, CPU, and disk performance before debugging a slow or broken ML environment.
GPT-2 (124M) fixed-work distributed training benchmark on NYU BigPurple (Slurm) scaling 1→8× V100 across 2 nodes using DeepSpeed ZeRO-1 + FP16/AMP. Built a reproducible harness that writes training_metrics.json + RUN_COMPLETE.txt + launcher metadata per run, plus NCCL topology/log artifacts and Nsight Systems traces/summaries (NVTX + NCCL ranges).
Add a description, image, and links to the gpu-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the gpu-benchmarking topic, visit your repo's landing page and select "manage topics."