Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769WenqingLan1 wants to merge 11 commits intomicrosoft:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=======================================
Coverage 85.70% 85.70%
=======================================
Files 102 102
Lines 7703 7704 +1
=======================================
+ Hits 6602 6603 +1
Misses 1101 1101
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.
Changes:
- Add
--data_type <float|double>to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests. - Refactor CUDA kernels to use 128-bit vectorized accesses (
double2/float4) and move template kernel implementations into a header for cross-TU instantiation. - Adjust execution/output to single visible GPU (device 0 via
CUDA_VISIBLE_DEVICES) and update metric/tag formats (removinggpu_id) plus docs/examples/test log.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/data/gpu_stream.log |
Updates golden log output to include data type and new tag format (no gpu_id). |
tests/benchmarks/micro_benchmarks/test_gpu_stream.py |
Extends command-generation assertions to include --data_type (currently only covers double). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp |
Removes NUMA/GPU iteration fields from args and adds Opts::data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp |
Adds CLI parsing/printing for --data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp |
New entry point replacing the previous main file. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp |
Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu |
Keeps a CUDA compilation unit and moves template implementations to the header. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp |
Expands bench-args variant to support float and double. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu |
Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0. |
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt |
Switches target sources to the new gpu_stream_main.cpp. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py |
Adds --data_type argument and forwards it to the binary. |
examples/benchmarks/gpu_stream.py |
Updates example invocation to include --data_type double. |
docs/user-tutorial/benchmarks/micro-benchmarks.md |
Updates gpu-stream metric patterns to include `(double |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| args->num_loops = opts_.num_loops; | ||
| args->size = opts_.size; | ||
| args->check_data = opts_.check_data; | ||
| bench_args_.emplace_back(std::move(args)); |
There was a problem hiding this comment.
Can we make it a template function?
|
|
||
| // Set the NUMA node | ||
| ret = numa_run_on_node(curr_args->numa_id); | ||
| if (ret != 0) { |
There was a problem hiding this comment.
numa_alloc_local allocates on the current NUMA node of the calling thread at allocation time, which is correct. But if the OS later migrates the thread to a remote NUMA node, remote memory accesses introduce latency. Should we pin the thread to the fix NUMA node?
| #if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) | ||
| v = *p; | ||
| #else | ||
| if constexpr (std::is_same<T, float>::value) { |
There was a problem hiding this comment.
Should handle the "unsupported type" here and other similar places?
Refinements:
New config:
New rule:
Example results:
Processed by rules: