Skip to content

perf: Expand Benchmarks vs Upstream OpenTelemetry & CI Regression#1500

Merged
JacksonWeber merged 5 commits into
microsoft:mainfrom
JacksonWeber:perf/expand-tests-and-regression-gate
May 22, 2026
Merged

perf: Expand Benchmarks vs Upstream OpenTelemetry & CI Regression#1500
JacksonWeber merged 5 commits into
microsoft:mainfrom
JacksonWeber:perf/expand-tests-and-regression-gate

Conversation

@JacksonWeber
Copy link
Copy Markdown
Contributor

@JacksonWeber JacksonWeber commented May 22, 2026

Adds four new perf scenarios so we can measure overhead of this package against equivalent upstream OpenTelemetry calls:

  • AzureMonitorSpanTest / AzureMonitorLogTest (useAzureMonitor + direct OTel API)

  • OtelSpanTest / OtelLogTest (plain @opentelemetry/sdk-trace-base & sdk-logs reference, informational only)

Introduces a deterministic benchmark runner (bench.mjs + runBenchmarks.mjs) that bypasses the @azure-tools/test-perf worker pool, runs each scenario in a fresh Node child process to avoid OTel global-state contamination, and emits structured JSON with median/mean/stdev across N samples.

Adds .github/workflows/performance.yml: packs both PR and base branch as tarballs via npm pack, installs each in turn under the PR''s perf harness, runs the benchmark suite, and fails the job (blocking merge when set as a required check) if any gating scenario regresses beyond the configured threshold. Posts a sticky PR comment with the comparison table.

Regression limits

The gate is driven by PERF_REGRESSION_THRESHOLD (percent, set in .github/workflows/performance.yml).

  • Default threshold: 15%. A gating scenario fails the build only when its median throughput (ops/s) drops by more than 15% relative to the base branch. Anything from 0% down to -15% (inclusive) is treated as within acceptable noise and passes.
  • The threshold is compared against the median ops/s across samples (not mean), to reduce sensitivity to single-run outliers / GC jitter.
  • Improvements (positive Δ%) never fail the gate, regardless of magnitude.
  • Only scenarios marked tier: "gating" can fail the build:
    • TrackDependencyTest (gating)
    • TrackTraceTest (gating)
    • AzureMonitorSpanTest (gating)
    • AzureMonitorLogTest (gating)
    • OtelSpanTest / OtelLogTest are informational only — they are reported in the PR comment for like-for-like comparison against upstream OpenTelemetry but never block merge, since regressions there are not owned by this repo.
  • The threshold can be tightened or loosened per-branch by editing PERF_REGRESSION_THRESHOLD in the workflow env (e.g. set it to 10 for a stricter 10% gate); no code change to the runner is required.

… gate

Adds four new perf scenarios so we can measure overhead of this package against equivalent upstream OpenTelemetry calls:

- AzureMonitorSpanTest / AzureMonitorLogTest (useAzureMonitor + direct OTel API)

- OtelSpanTest / OtelLogTest (plain @opentelemetry/sdk-trace-base & sdk-logs reference, informational only)

Introduces a deterministic benchmark runner (bench.mjs + runBenchmarks.mjs) that bypasses the @azure-tools/test-perf worker pool, runs each scenario in a fresh Node child process to avoid OTel global-state contamination, and emits structured JSON with median/mean/stdev across N samples.

Adds .github/workflows/performance.yml: packs both PR and base branch as tarballs via npm pack, installs each in turn under the PR's perf harness, runs the benchmark suite, and fails the job (blocking merge when set as a required check) if any gating scenario regresses by more than PERF_REGRESSION_THRESHOLD percent (default 15%). Posts a sticky PR comment with the comparison table.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JacksonWeber JacksonWeber requested a review from Copilot May 22, 2026 02:36
@JacksonWeber JacksonWeber changed the title perf: expand benchmarks vs upstream OpenTelemetry + add CI regression… perf: Expand Benchmarks vs Upstream OpenTelemetry & CI Regression May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the test/performanceTests harness to add new OpenTelemetry-based span/log scenarios, introduces a deterministic multi-scenario benchmark runner (per-scenario child process + JSON output), and adds a GitHub Actions workflow to compare candidate vs baseline performance and gate PRs on regressions.

Changes:

  • Add four new perf scenarios: AzureMonitorSpanTest/AzureMonitorLogTest (gating) and OtelSpanTest/OtelLogTest (informational baseline).
  • Add bench.mjs, runBenchmarks.mjs, and comparePerf.mjs to run isolated benchmarks and produce/compare structured results.
  • Add .github/workflows/performance.yml to run baseline vs candidate benchmarks on PRs and post a sticky comparison comment.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/performanceTests/test/otelSpan.spec.ts New upstream OTel span baseline perf scenario.
test/performanceTests/test/otelLog.spec.ts New upstream OTel log baseline perf scenario.
test/performanceTests/test/azureMonitorSpan.spec.ts New useAzureMonitor() + OTel span perf scenario (gating).
test/performanceTests/test/azureMonitorLog.spec.ts New useAzureMonitor() + OTel log perf scenario (gating).
test/performanceTests/test/index.spec.ts Registers new scenarios and updates perf output capture + telemetry reporting.
test/performanceTests/test/appInsightsShim.spec.ts Makes shim startup idempotent across multiple test instantiations.
test/performanceTests/bench.mjs Single-scenario deterministic benchmark runner (no worker pool).
test/performanceTests/runBenchmarks.mjs Multi-scenario runner: per-scenario child process + sampling + JSON summary.
test/performanceTests/comparePerf.mjs Compares baseline vs candidate JSON and returns a gating exit code + Markdown.
test/performanceTests/README.md Documents scenarios, tiers, manual runs, and regression CI behavior.
test/performanceTests/package.json Updates perf harness deps and adds benchmark/compare scripts.
test/performanceTests/package-lock.json Lockfile updates for new/updated perf harness dependencies.
.github/workflows/performance.yml Adds PR perf regression workflow (pack+install both versions, benchmark, compare, comment).
Files not reviewed (1)
  • test/performanceTests/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/performance.yml
Comment thread test/performanceTests/package.json Outdated
Comment thread test/performanceTests/test/index.spec.ts Outdated
Comment thread test/performanceTests/runBenchmarks.mjs
Comment thread .github/workflows/performance.yml
JacksonWeber and others added 3 commits May 21, 2026 21:36
Fixes CI: add explicit npm run build of the perf harness before running benchmarks; previous run died with ERR_MODULE_NOT_FOUND because dist-esm was never produced.

Review feedback:

- workflow: switch perf harness install to npm ci; add --no-package-lock to tarball installs so the lockfile is not rewritten mid-run

- AzureMonitor scenarios: acquire @opentelemetry/api and @opentelemetry/api-logs via createRequire resolved from the installed applicationinsights, so the Tracer/Logger we benchmark is backed by the SAME api / api-logs instance that useAzureMonitor() mutated (otherwise a duplicate hoisted copy at the harness level would yield a no-op proxy and we'd silently measure nothing)

- OTel reference scenarios: use provider.getTracer / provider.getLogger directly instead of going through the global registry, eliminating dual-instance concerns for these

- index.spec.ts: capture console.log with rest args + util.format so multi-arg / non-string calls are formatted the same way Node would print them

- runBenchmarks.mjs: propagate child.error and child.signal in failure messages (spawnSync status can be null on spawn error or signal exit)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous run cancelled at 25min job timeout: candidate ran 12min (5 samples x 6 scenarios x ~24s/sample on CI runners; AzureMonitor scenarios pay a 5-10s SDK init cost per fresh child process), baseline got 12min in before cancellation.

Cut samples 5 -> 3, duration 8s -> 5s, warmup 2s -> 1s. New estimate: ~5min per side, ~12min total with install/build. Bumped job timeout 25 -> 40 min for safety margin.

Median of 3 samples is still robust to a single outlier (the main source of CI flake), and 5s is enough sustained measurement time for even the slowest scenario (AzureMonitorLog at ~12k ops/s yields ~60k ops per sample).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…api-logs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants