Adding average usec cost for timestamp collection on GPU#309
Open
gilbertlee-amd wants to merge 1 commit into
Open
Adding average usec cost for timestamp collection on GPU#309gilbertlee-amd wants to merge 1 commit into
gilbertlee-amd wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the wallclock preset to empirically estimate the GPU-side cost of collecting wallclock timestamps by running an additional kernel that reads timestamps in a tight loop and reports an averaged per-timestamp cost in microseconds.
Changes:
- Added a new
GetTimestampCostGPU kernel to measure timestamp-collection overhead. - Introduced
NUM_TIMESTAMPSenv var and collected per-iteration cost data alongside existing timestamp results. - Extended the results table to include a new “TS cost(usec)” column.
Comments suppressed due to low confidence (2)
src/client/Presets/WallClock.hpp:39
- In GetTimestampCost,
tempis uninitialized when numTimestamps==0 (the loop doesn't execute), but it is still read inif (temp != 0), which is undefined behavior. Initializetempor restructure the kernel so it doesn't read an uninitialized value when NUM_TIMESTAMPS=0.
uint64_t temp;
for (int i = 0; i < numTimestamps; i++) {
temp = GetTimestamp();
}
auto stop = GetTimestamp();
// temp will never be 0, but query to ensure that compiler doesn't optimize out the loop
if (temp != 0) {
cycleCount[blockIdx.x] = (stop - start);
src/client/Presets/WallClock.hpp:40
- GetTimestampCost only writes
cycleCount[blockIdx.x]whentemp != 0. If that condition ever fails (or if NUM_TIMESTAMPS=0 is handled differently later),cycleCountretains stale data from the previous use of the timestamps buffer and will be reported as a valid cost. Prefer writing an explicit value unconditionally (or explicitly setting cycleCount to 0 when the loop is skipped).
// temp will never be 0, but query to ensure that compiler doesn't optimize out the loop
if (temp != 0) {
cycleCount[blockIdx.x] = (stop - start);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Check for env var consistency across ranks | ||
| IS_UNIFORM(numGpuDevices, "NUM_GPU_DEVICES"); | ||
| IS_UNIFORM(ev.numIterations, "NUM_ITERATIONS"); | ||
| IS_UNIFORM(ev.numIterations, "NUM_TIMESTAMPS"); |
Comment on lines
+284
to
285
| double overallAvgUsecPerTimsetamp = 0; | ||
|
|
Comment on lines
+31
to
+35
| uint64_t temp; | ||
| for (int i = 0; i < numTimestamps; i++) { | ||
| temp = GetTimestamp(); | ||
| } | ||
| auto stop = GetTimestamp(); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Collecting a GPU wallclock timestamp itself takes time.
This PR adds a kernel to the
wallclockpreset that tries to empirically calculate this time by collecting timestamps in a loop within a kernel then reporting the final averaged time.