Add benchmark pipeline with Rust-native A/B validation by YuanyuanTian-hh · Pull Request #912 · microsoft/DiskANN

YuanyuanTian-hh · 2026-04-07T08:02:26Z

Add benchmark regression pipeline with Rust-native A/B validation

Summary

Adds an automated benchmark regression pipeline to GitHub Actions. This PR builds on PR #900 (benchmark A/B test framework) by implementing the Regression trait for disk-index benchmarks and wiring it into CI workflows. The pipeline builds and searches two public 100K ANN datasets, compares before/after performance, and validates against configurable tolerances all in typed Rust code with no Python dependencies.

What's Changed

Benchmark Runner (from PR #900)

Rust-native A/B test framework in diskann-benchmark-runner
Regression trait, tolerance matching, check run CLI subcommand

Disk-Index Regression Support (this PR)

File	Description
`diskann-benchmark/src/backend/disk_index/benchmarks.rs`	Implements `Regression` trait for `DiskIndex<T>` with typed before/after comparison of 7 metrics
`diskann-benchmark/src/backend/disk_index/search.rs`	Added `Deserialize` to `DiskSearchStats`, `DiskSearchResult`
`diskann-benchmark/src/backend/disk_index/build.rs`	Added `Deserialize` to `DiskBuildStats`, exposed `build_time_seconds()`
`diskann-benchmark/perf_test_inputs/disk-index-tolerances.json`	Tolerance config: 10% build/QPS, 1% recall/IOs/comps, 15% latency
`diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json`	Benchmark config: 768-dim, inner_product, search_list=[200]
`diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json`	Benchmark config: 1536-dim, squared_l2, SQ_1_2.0, search_list=[200]

CI Workflows

File	Description
`.github/workflows/benchmarks.yml`	PR regression workflow triggers on `pull_request` to main, validates with `cargo ... check run`
`.github/workflows/benchmarks-aa.yml`	Daily A/A stability test (main vs main) with GitHub issue on failure

How It Works

Checkout both current branch and baseline (defaults to main)
Download datasets from big-ann-benchmarks v0.4.0
Build & search disk index on both branches
Validate with cargo run -p diskann-benchmark --features disk-index --release -- check run --tolerances ... --before ... --after ...
Upload JSON artifacts for 30-day retention

Regression Checks

The Regression trait implementation compares 7 metrics between before and after runs:

Metric	Direction	Tolerance	Rationale
build_time	Lower is better	10%	CPU-bound, moderate noise
QPS	Higher is better	10%	CPU-bound, moderate noise
recall	Higher is better	1%	Algorithmic, near-deterministic
mean_ios	Lower is better	1%	Algorithmic, near-deterministic
mean_comparisons	Lower is better	1%	Algorithmic, near-deterministic
mean_latency	Lower is better	15%	Timing, noisy on shared runners
p95_latency	Lower is better	15%	Timing, noisy on shared runners

All checks use relative_change() which properly handles zero-baseline (returns error, not 0%).

Datasets

Dataset	Dimensions	Distance	Vectors	Queries	search_list
Wikipedia-100K	768	inner_product	100K	5,000	200
OpenAI ArXiv-100K	1,536	squared_l2	100K	20,000	200

Improvements over PR #857

This PR addresses all blocking review comments from PR #857:

Concern	PR #857 (Python)	PR #912 (Rust)
Phantom thresholds / silent success	Orphaned categories silently skipped	Runner errors if tolerances don't match inputs
Bad direction values pass silently	Falls through to else branch	Type-safe, no string directions
Division by zero masked as 0%	Returns 0 when baseline=0	Returns error ("before must be > 0")
Missing fields default to 0	`.get(field, 0)`	Rust deserialization fails on missing fields
Manual trigger only	workflow_dispatch only	pull_request trigger with path filters
CI time ~70 min	search_list=2000	search_list=200, ~10 min per job
Hardcoded Rust version	rust_stable: "1.92"	toolchain: stable, reads rust-toolchain.toml
Python dependencies	benchmark_validate.py	Zero Python, pure Rust

- Add benchmarks.yml workflow using workflow_dispatch, comparing current branch against a configurable baseline ref - Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs into a CSV suitable for benchmark_result_parse.py - Add benchmark_result_parse.py for validating results and posting PR comments - Add wikipedia-100K-disk-index.json benchmark config using the public Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings, 768-dim, cosine distance) to replace internal ADO datasets

…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.

…mit)

…-normalized, metric is inner product)

…p, not cosine similarity)

…chmark-data)

Replaces the previous 3-step pipeline (JSONCSVMarkdownvalidate) with a single script that reads both JSONs directly, compares metrics, writes Markdown to step summary, checks thresholds, and posts PR comments. Removed: - compare_disk_index_json_output.py (JSON diff CSV) - csv_to_markdown.py (CSV Markdown) - benchmark_result_parse.py (CSV threshold check) Also removes pip install csvtomd/numpy/scipy all scripts now use stdlib only.

…ES runners)

…rk JSON

…ersion, fix missing-field handling, clean up orphaned thresholds, switch data source to BAB v0.4.0

…pipeline' into user/tianyuanyuan/benchmark-regression

- Add Deserialize to DiskIndexStats, DiskSearchStats, DiskSearchResult, DiskBuildStats - Implement Regression trait for DiskIndex<T> with typed before/after comparison - Add DiskIndexTolerance type with configurable thresholds for 7 metrics - Create disk-index-tolerances.json (10% build/QPS, 1% recall/IOs/comps, 15% latency) - Switch registration from register() to register_regression() - Replace Python benchmark_validate.py with Rust-native check run in both workflows - Delete benchmark_validate.py (no longer needed)

microsoft-github-policy-service · 2026-04-07T08:32:21Z

@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

Definitions.
“Code” means the computer software code, whether in human-readable or machine-executable form,
that is delivered by You to Microsoft under this Agreement.
“Project” means any of the projects owned or managed by Microsoft and offered under a license
approved by the Open Source Initiative (www.opensource.org).
“Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
Project, including but not limited to communication on electronic mailing lists, source code control
systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
discussing and improving that Project, but excluding communication that is conspicuously marked or
otherwise designated in writing by You as “Not a Submission.”
“Submission” means the Code and any other copyrightable material Submitted by You, including any
associated comments and documentation.
Your Submission. You must agree to the terms of this Agreement before making a Submission to any
Project. This Agreement covers any and all Submissions that You, now or in the future (except as
described in Section 4 below), Submit to any Project.
Originality of Work. You represent that each of Your Submissions is entirely Your original work.
Should You wish to Submit materials that are not Your original work, You may Submit them separately
to the Project if You (a) retain all copyright and license information that was in the materials as You
received them, (b) in the description accompanying Your Submission, include the phrase “Submission
containing materials of a third party:” followed by the names of the third party and any licenses or other
restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
guidelines concerning Submissions.
Your Employer. References to “employer” in this Agreement include Your employer or anyone else
for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
Submission is made in the course of Your work for an employer or Your employer has intellectual
property rights in Your Submission by contract or applicable law, You must secure permission from Your
employer to make the Submission before signing this Agreement. In that case, the term “You” in this
Agreement will refer to You and the employer collectively. If You change employers in the future and
desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
and secure permission from the new employer before Submitting those Submissions.
Licenses.

Copyright License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
parties.
Patent License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
Your patent claims that are necessarily infringed by the Submission or the combination of the
Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
import or otherwise dispose of the Submission alone or with the Project.
Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
granted by implication, exhaustion, estoppel or otherwise.

Representations and Warranties. You represent that You are legally entitled to grant the above
licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
have disclosed under Section 3). You represent that You have secured permission from Your employer to
make the Submission in cases where Your Submission is made in the course of Your work for Your
employer or Your employer has intellectual property rights in Your Submission by contract or applicable
law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
You later become aware that would make Your representations in this Agreement inaccurate in any
respect.
Information about Submissions. You agree that contributions to Projects and information about
contributions may be maintained indefinitely and disclosed publicly, including Your name and other
information that You submit with Your Submission.
Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
defenses of lack of personal jurisdiction and forum non-conveniens.
Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
supersedes any and all prior agreements, understandings or communications, written or oral, between
the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Copilot

Pull request overview

Adds a Rust-native benchmark regression pipeline that compares “before vs after” benchmark outputs in CI using typed deserialization + a Regression trait, removing the need for external Python validation.

Changes:

Implement regression checking for disk-index benchmarks (tolerance type + metric comparisons) and make disk-index build/search stats deserializable for A/B comparison.
Extend diskann-benchmark-runner with check CLI subcommands and internal tolerance matching/dispatch + numerous regression/UX tests.
Add GitHub Actions workflows and benchmark/tolerance JSON inputs to run and validate regressions on PRs and via daily A/A runs.

Reviewed changes

Copilot reviewed 111 out of 151 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
diskann-benchmark/src/backend/disk_index/search.rs	Adds `Deserialize` to disk search stats/results for typed A/B validation.
diskann-benchmark/src/backend/disk_index/build.rs	Adds `Deserialize` and exposes build time in seconds for regression checks.
diskann-benchmark/src/backend/disk_index/benchmarks.rs	Registers disk-index benchmarks as regression-capable and implements metric-based regression checking + tolerance input type.
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json	Adds Wikipedia-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json	Adds OpenAI ArXiv-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/disk-index-tolerances.json	Adds default tolerance thresholds for disk-index regression validation.
diskann-benchmark-simd/src/lib.rs	Wires SIMD benchmarks into regression framework + adds tolerance type and regression check implementation + tests.
diskann-benchmark-simd/src/bin.rs	Updates SIMD binary tests to include `check verify` invocation.
diskann-benchmark-simd/examples/tolerance.json	Adds example tolerance file for SIMD regression checks.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json	Adds regression UX fixture for incompatible input/tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt	Expected output for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt	Command script for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md	Describes the check-verify-4 scenario.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json	Input fixture for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json	Adds regression UX fixture for ambiguous/uncovered/orphaned tolerance matching.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt	Expected output for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt	Command script for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json	Input fixture for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json	Adds regression UX fixture for “no matching benchmark” during verify.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt	Expected output for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt	Command script for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md	Describes the check-verify-2 scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json	Input fixture for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json	Adds regression UX fixture for unrecognized tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt	Expected output for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt	Command script for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md	Describes the check-verify-1 scenario.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json	Input fixture for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json	Adds regression UX fixture for successful verify (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt	Command script for check-verify-0.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md	Describes the check-verify-0 scenario.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json	Input fixture for check-verify-0.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt	Expected output for requesting a nonexistent tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt	Command script for check-tolerances-2.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md	Describes the check-tolerances-2 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt	Expected output for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt	Command script for check-tolerances-1.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md	Describes the check-tolerances-1 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt	Expected output for listing all tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt	Command script for check-tolerances-0.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md	Describes the check-tolerances-0 scenario.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt	Expected output for tolerance skeleton printing.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt	Command script for check-skeleton-0.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md	Describes the check-skeleton-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json	Adds regression UX fixture for successful `check run` execution.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt	Expected output for successful `check run`.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt	Command script for pass-case `check run`.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md	Describes the check-run-pass-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json	Output fixture used as both before/after.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json	Input fixture for pass-case run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json	Expected JSON output from pass-case checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json	Adds regression UX fixture for a failing `check run` result.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt	Expected output for failing `check run`.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt	Command script for fail-case `check run`.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md	Describes the check-run-fail-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json	Output fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json	Input fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json	Expected JSON output from fail-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json	Adds regression UX fixture for before/after schema drift error reporting.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt	Expected output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt	Command script for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json	Regression input fixture to force schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md	Describes the check-run-error-3 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json	Output fixture used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json	Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json	Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json	Adds regression UX fixture for “input drift” dispatch failure in `check run`.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt	Expected output for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt	Command script for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json	Regression input fixture that drifts to unsupported type.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md	Describes the check-run-error-2 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json	Output fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json	Input fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json	Adds regression UX fixture for before/after length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt	Expected output for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt	Command script for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json	Regression input fixture with different job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md	Describes the check-run-error-1 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json	Output fixture with mismatched job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json	Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json	Adds regression UX fixture for infrastructure error propagation.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt	Expected output for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt	Command script for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md	Describes the check-run-error-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json	Output fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json	Input fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json	Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt	Adds expected output for `run --dry-run` success.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt	Adds command script for `run --dry-run`.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md	Describes the dry-run behavior expectation.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json	Input fixture for dry-run test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt	Updates expected stdout for successful run output text changes.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt	Adds command script for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md	Describes benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json	Adds expected output.json for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json	Adds input fixture for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt	Adds expected output for overload/dispatch scoring test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt	Adds command script for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md	Describes overload/dispatch selection behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json	Adds expected output.json for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json	Adds input fixture for overload test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt	Adds expected diagnostics for mismatch description paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt	Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md	Describes mismatch diagnostics scenario.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json	Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt	Adds expected diagnostics for “closest matches” reporting.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt	Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md	Describes mismatch “closest matches” behavior.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json	Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt	Adds expected output for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt	Adds command script for deserialization error test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md	Describes deserialization error behavior.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json	Adds input fixture with invalid enum value.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt	Updates benchmark listing output to include new simple bench.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt	Adds command script for benchmark listing.
diskann-benchmark-runner/tests/benchmark/test-4/README.md	Describes benchmark listing test.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt	Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt	Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-3/README.md	Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt	Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt	Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/README.md	Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt	Adds expected output for listing available input kinds.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt	Adds command script for input listing.
diskann-benchmark-runner/tests/benchmark/test-1/README.md	Describes input listing test.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt	Adds expected output for skeleton input printing.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt	Adds command script for skeleton test.
diskann-benchmark-runner/tests/benchmark/test-0/README.md	Describes skeleton test.
diskann-benchmark-runner/src/ux.rs	Adds `scrub_path` helper and improves backtrace stripping logic for deterministic test output.
diskann-benchmark-runner/src/utils/percentiles.rs	Adds `minimum` percentile field and marks `Percentiles` non-exhaustive.
diskann-benchmark-runner/src/utils/num.rs	Adds constrained numeric deserialization + `relative_change` helper for regression checks.
diskann-benchmark-runner/src/utils/mod.rs	Exposes new `num` utilities module.
diskann-benchmark-runner/src/utils/fmt.rs	Adds clippy expectation annotation for bounds-checked panic.
diskann-benchmark-runner/src/test/typed.rs	Refactors test benches and adds regression-capable typed benchmark checks.
diskann-benchmark-runner/src/test/mod.rs	Centralizes registration of test inputs/benchmarks including regression variants.
diskann-benchmark-runner/src/test/dim.rs	Adds dimensional test benchmarks including a non-regression “simple bench”.
diskann-benchmark-runner/src/result.rs	Adds `RawResult` loader for reuse in regression checking pipeline.
diskann-benchmark-runner/src/registry.rs	Extends registry with regression benchmark registration + tolerance discovery.
diskann-benchmark-runner/src/lib.rs	Exposes `benchmark` module publicly and adds `internal` module plumbing.
diskann-benchmark-runner/src/jobs.rs	Refactors job loading/parsing and improves error messages + exposes raw job accessors.
diskann-benchmark-runner/src/internal/regression.rs	Implements tolerance parsing, subset matching, regression job assembly, and execution reporting.
diskann-benchmark-runner/src/internal/mod.rs	Adds shared `load_from_disk` helper and internal module structure.
diskann-benchmark-runner/src/input.rs	Adds const `INSTANCE` for `Input` wrapper to support regression tolerance typing.
diskann-benchmark-runner/src/checker.rs	Adds clippy expectation annotation for internal tag invariants.
diskann-benchmark-runner/src/benchmark.rs	Introduces `Regression` trait + internal object-safe regression plumbing for the runner.
diskann-benchmark-runner/src/app.rs	Adds `check` subcommands (skeleton/tolerances/verify/run) and upgrades UX test harness.
diskann-benchmark-runner/Cargo.toml	Adjusts clippy lint configuration for unwrap/expect/panic, etc.
diskann-benchmark-runner/.clippy.toml	Allows unwrap/expect/panic in tests for this crate.
.github/workflows/benchmarks.yml	Adds PR-triggered and manual benchmark regression workflow for two datasets with Rust-native validation.
.github/workflows/benchmarks-aa.yml	Adds daily scheduled A/A stability workflow and issue creation on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T08:36:36Z

diskann-benchmark/src/backend/disk_index/benchmarks.rs

+/// Aggregated result of a disk-index regression check.
+#[derive(Debug, Serialize)]
+struct DiskIndexCheckResult {
+    search_l: u32,
+    comparisons: Vec<MetricComparison>,
+}


DiskIndexCheckResult stores a single search_l value, but comparisons aggregates metrics across all search_results_per_l entries. If the input contains multiple search_l values, the rendered output will be ambiguous/misleading because individual rows don't indicate which search_l they correspond to. Consider grouping comparisons per search_l (e.g., Vec<PerSearchLResult>) or include search_l on each MetricComparison row and print it.

Copilot · 2026-04-07T08:36:36Z

diskann-benchmark/src/backend/disk_index/benchmarks.rs

+    // Flip before/after so that a decrease becomes a positive relative_change
+    let (change_pct, remark, metric_passed) = match relative_change(before, after) {
+        Ok(change) => {
+            // For higher-is-better, a negative change is a regression
+            let ok = -change <= tolerance.get();
+            if !ok {


The comment says the code flips before/after so that a decrease becomes a positive relative_change, but the call is still relative_change(before, after). Either update the comment to reflect the actual sign convention (negative = regression for higher-is-better) or change the computation to match the comment to avoid future confusion.

Copilot · 2026-04-07T08:36:36Z

diskann-benchmark-runner/src/utils/percentiles.rs

 #[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
+#[non_exhaustive]
 pub struct Percentiles<T> {
+    pub minimum: T,
    pub mean: f64,


Percentiles<T> is a public type and this change adds a new required public field (minimum). Even with #[non_exhaustive], downstream users constructing Percentiles { ... } will break unless they update, so this is a semver-impacting API change. If you want to avoid forcing downstream changes, consider keeping the struct shape stable and exposing the minimum via an accessor/newtype, or make construction go through a constructor function.

Copilot · 2026-04-07T08:36:36Z

diskann-benchmark-simd/src/bin.rs

+    #[test]
+    fn check_verify() {
+        let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
+            .join("examples")
+            .join("simd-scalar.json");
+        let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
+            .join("examples")
+            .join("tolerance.json");
+
+        let stdout = run_check_test(&input_path, &tolerance_path);
+        println!("stdout = {}", stdout);
+    }


check_verify currently just prints the verifier output and has no assertions, so it will pass even if verification fails or produces unexpected output. Consider asserting that run_check_test(...) returns an empty string (or whatever the expected success output is), and avoid println! noise in tests unless the assertion fails.

Copilot · 2026-04-07T08:36:37Z

.github/workflows/benchmarks.yml

+      - name: Checkout baseline (${{ inputs.baseline_ref || 'main' }})
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ inputs.baseline_ref || 'main' }}


The workflow is triggered by both workflow_dispatch and pull_request, but it references ${{ inputs.baseline_ref }} when checking out the baseline. The inputs context is only defined for workflow_dispatch, so this can fail on PR runs. Prefer using github.event.inputs.baseline_ref (which safely resolves to null on non-dispatch events) or an explicit conditional on github.event_name with a main default.

Suggested change

- name: Checkout baseline (${{ inputs.baseline_ref || 'main' }})

uses: actions/checkout@v4

with:

ref: ${{ inputs.baseline_ref || 'main' }}

- name: Checkout baseline (${{ github.event.inputs.baseline_ref || 'main' }})

uses: actions/checkout@v4

with:

ref: ${{ github.event.inputs.baseline_ref || 'main' }}

Copilot · 2026-04-07T08:36:37Z

.github/workflows/benchmarks.yml

+# DiskANN Benchmarks Workflow
+#
+# This workflow runs macro benchmarks comparing the current branch against a baseline.
+# It is manually triggered and requires a baseline reference (branch, tag, or commit).


Top-of-file comments say the workflow is "manually triggered" and "requires a baseline reference", but the workflow now also runs on pull_request (where the baseline is implicitly main). Updating these comments will prevent future confusion about how/when the workflow runs.

Suggested change

# It is manually triggered and requires a baseline reference (branch, tag, or commit).

# It can be triggered manually with a baseline reference (branch, tag, or commit),

# or automatically on pull requests to `main`, where the baseline is implicitly `main`.

Copilot · 2026-04-07T08:36:37Z

.github/workflows/benchmarks.yml

+
+permissions:
+  contents: read
+  pull-requests: write  # Required for posting PR comments


pull-requests: write permission is declared, but this workflow doesn't appear to post PR comments or otherwise write to the PR. Consider dropping it (or scoping it to the specific step that needs it) to follow least-privilege, especially since this runs on pull_request.

Suggested change

pull-requests: write # Required for posting PR comments

codecov-commenter · 2026-04-07T09:07:21Z

Codecov Report

❌ Patch coverage is 93.73434% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.40%. Comparing base (0ced23d) to head (c9ebf6c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-benchmark-runner/src/app.rs	86.02%	19 Missing ⚠️
diskann-benchmark-simd/src/lib.rs	93.47%	12 Missing ⚠️
diskann-benchmark-runner/src/benchmark.rs	89.42%	11 Missing ⚠️
diskann-benchmark-runner/src/test/dim.rs	87.91%	11 Missing ⚠️
diskann-benchmark-runner/src/registry.rs	84.61%	10 Missing ⚠️
...iskann-benchmark-runner/src/internal/regression.rs	97.70%	9 Missing ⚠️
diskann-benchmark-runner/src/jobs.rs	91.66%	1 Missing ⚠️
diskann-benchmark-runner/src/test/typed.rs	97.61%	1 Missing ⚠️
diskann-benchmark-runner/src/ux.rs	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #912      +/-   ##
==========================================
+ Coverage   89.31%   89.40%   +0.09%     
==========================================
  Files         445      449       +4     
  Lines       84095    85057     +962     
==========================================
+ Hits        75113    76049     +936     
- Misses       8982     9008      +26

Flag	Coverage Δ
miri	`89.40% <93.73%> (+0.09%)`	⬆️
unittests	`89.25% <93.73%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-runner/src/checker.rs	`72.66% <ø> (ø)`
diskann-benchmark-runner/src/input.rs	`78.37% <100.00%> (ø)`
diskann-benchmark-runner/src/internal/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/result.rs	`97.84% <100.00%> (+0.03%)`	⬆️
diskann-benchmark-runner/src/test/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/utils/fmt.rs	`97.50% <ø> (ø)`
diskann-benchmark-runner/src/utils/num.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/utils/percentiles.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-simd/src/bin.rs	`87.93% <100.00%> (+6.35%)`	⬆️
diskann-benchmark-runner/src/jobs.rs	`96.82% <91.66%> (+0.33%)`	⬆️
... and 8 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Yuanyuan Tian (from Dev Box) added 30 commits March 19, 2026 16:08

Fix openai-100K distance metric and add gitignore patterns

2217353

Add push trigger to benchmarks workflow for pre-merge testing

6b7b250

Fix baseline run: use input config from current branch checkout

d4a6abd

Fix markdown conversion: replace broken csvtomd with inline Python

c6d54a9

Update benchmark configs: align build/search params to fix low recall

5775198

Fix recall_at: set to 100 to match groundtruth file K=100

7ac841e

Remove stale absolute contracts (qps/recall/total_time): calibrated f…

7d5a72e

…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.

remove comments

c294c16

Fix build_ram_limit_gb: reduce 10->4 to fit GitHub runner RAM (7GB li…

b8e1d6d

…mit)

Fix wikipedia distance: cosine_normalized->cosine (vectors are not L2…

137cae0

…-normalized, metric is inner product)

Fix wikipedia distance: cosine->inner_product (groundtruth uses raw i…

96b63a8

…p, not cosine similarity)

Align build threads

1b4cfc8

Speed up benchmarks: build threads 1->4, openai pq_chunks 384->192

5e6a6e0

Temp: disable concurrency cancellation for A/A batch testing

59d25b9

revert A/A test settings, update OpenAI config to SQ_1_2.0

751e775

Remove micro-benchmark-iai comments

2c4d235

use GitHub Release assets for benchmark datasets

f58e084

extract csv-to-markdown into reusable script

0f5c277

calibrate contract thresholds from GitHub runner data

b8cddb5

add daily A/A benchmark stability test with failure notification

accdf2c

widen mean_cpus threshold to 15% for shared-runner CPU noise

5b0e7f7

move benchmark datasets to separate repo (YuanyuanTian-hh/diskann-ben…

639d4bb

…chmark-data)

switch benchmark jobs to self-hosted 1ES runner pool (diskann-github)

9779e3c

replace gh CLI with curl for dataset downloads (gh not available on 1…

db7e97a

…ES runners)

fix latency_95: read p95_latency instead of p999_latency from benchma…

ea91c62

…rk JSON

revert to ubuntu-latest runners, switch dataset source to BAB v0.4.0

360cdc7

address PR review: reduce search_list to 200, remove hardcoded Rust v…

9d635be

…ersion, fix missing-field handling, clean up orphaned thresholds, switch data source to BAB v0.4.0

widen latency_95 threshold to 15% for shared-runner noise

4ca80d1

Yuanyuan Tian (from Dev Box) and others added 9 commits April 2, 2026 12:12

replace push trigger with pull_request trigger targeting main

d168932

Add dedicated A/B testing functionality to diskann-benchmark-runner.

9b02f0e

Small addition of test coverage to num.rs.

2c220f2

Fix backtrace stripping to handle multiple errors.

a9dc976

Fix stacktrace stripping to remove a preceding newline.

d2a9b48

Fix path normalization in UX tests.

912f8bc

Minor fixes.

25bd4db

Merge remote-tracking branch 'origin/user/tianyuanyuan/add-benchmark-…

df0e529

…pipeline' into user/tianyuanyuan/benchmark-regression

YuanyuanTian-hh changed the title ~~User/tianyuanyuan/benchmark regression~~ Add benchmark regression pipeline with Rust-native A/B validation Apr 7, 2026

YuanyuanTian-hh marked this pull request as ready for review April 7, 2026 08:31

YuanyuanTian-hh requested review from a team and Copilot April 7, 2026 08:31

YuanyuanTian-hh changed the title ~~Add benchmark regression pipeline with Rust-native A/B validation~~ Add benchmark pipeline with Rust-native A/B validation Apr 7, 2026

Copilot started reviewing on behalf of YuanyuanTian-hh April 7, 2026 08:32 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Yuanyuan Tian (from Dev Box) added 2 commits April 7, 2026 16:50

test: set strict tolerances (0.1%) to verify pipeline failure detection

670649c

revert tolerances to production values

c9ebf6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark pipeline with Rust-native A/B validation#912

Add benchmark pipeline with Rust-native A/B validation#912
YuanyuanTian-hh wants to merge 41 commits intomainfrom
user/tianyuanyuan/benchmark-regression

YuanyuanTian-hh commented Apr 7, 2026 •

edited

Loading

Uh oh!

microsoft-github-policy-service bot commented Apr 7, 2026

Contribution License Agreement

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

codecov-commenter commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# It is manually triggered and requires a baseline reference (branch, tag, or commit).
	# It can be triggered manually with a baseline reference (branch, tag, or commit),
	# or automatically on pull requests to `main`, where the baseline is implicitly `main`.

Conversation

YuanyuanTian-hh commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add benchmark regression pipeline with Rust-native A/B validation

Summary

What's Changed

How It Works

Regression Checks

Datasets

Improvements over PR #857

Uh oh!

microsoft-github-policy-service bot commented Apr 7, 2026

Contribution License Agreement

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanyuanTian-hh commented Apr 7, 2026 •

edited

Loading

codecov-commenter commented Apr 7, 2026 •

edited

Loading