Skip to content

Adding average usec cost for timestamp collection on GPU#309

Open
gilbertlee-amd wants to merge 1 commit into
ROCm:candidatefrom
gilbertlee-amd:WallClockUpgrade
Open

Adding average usec cost for timestamp collection on GPU#309
gilbertlee-amd wants to merge 1 commit into
ROCm:candidatefrom
gilbertlee-amd:WallClockUpgrade

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

Motivation

Collecting a GPU wallclock timestamp itself takes time.
This PR adds a kernel to the wallclock preset that tries to empirically calculate this time by collecting timestamps in a loop within a kernel then reporting the final averaged time.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the wallclock preset to empirically estimate the GPU-side cost of collecting wallclock timestamps by running an additional kernel that reads timestamps in a tight loop and reports an averaged per-timestamp cost in microseconds.

Changes:

  • Added a new GetTimestampCost GPU kernel to measure timestamp-collection overhead.
  • Introduced NUM_TIMESTAMPS env var and collected per-iteration cost data alongside existing timestamp results.
  • Extended the results table to include a new “TS cost(usec)” column.
Comments suppressed due to low confidence (2)

src/client/Presets/WallClock.hpp:39

  • In GetTimestampCost, temp is uninitialized when numTimestamps==0 (the loop doesn't execute), but it is still read in if (temp != 0), which is undefined behavior. Initialize temp or restructure the kernel so it doesn't read an uninitialized value when NUM_TIMESTAMPS=0.
  uint64_t temp;
  for (int i = 0; i < numTimestamps; i++) {
    temp = GetTimestamp();
  }
  auto stop = GetTimestamp();

  // temp will never be 0, but query to ensure that compiler doesn't optimize out the loop
  if (temp != 0) {
    cycleCount[blockIdx.x] = (stop - start);

src/client/Presets/WallClock.hpp:40

  • GetTimestampCost only writes cycleCount[blockIdx.x] when temp != 0. If that condition ever fails (or if NUM_TIMESTAMPS=0 is handled differently later), cycleCount retains stale data from the previous use of the timestamps buffer and will be reported as a valid cost. Prefer writing an explicit value unconditionally (or explicitly setting cycleCount to 0 when the loop is skipped).
  // temp will never be 0, but query to ensure that compiler doesn't optimize out the loop
  if (temp != 0) {
    cycleCount[blockIdx.x] = (stop - start);
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Check for env var consistency across ranks
IS_UNIFORM(numGpuDevices, "NUM_GPU_DEVICES");
IS_UNIFORM(ev.numIterations, "NUM_ITERATIONS");
IS_UNIFORM(ev.numIterations, "NUM_TIMESTAMPS");
Comment on lines +284 to 285
double overallAvgUsecPerTimsetamp = 0;

Comment on lines +31 to +35
uint64_t temp;
for (int i = 0; i < numTimestamps; i++) {
temp = GetTimestamp();
}
auto stop = GetTimestamp();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants