Skip to content

Add benchmark pipeline with Rust-native A/B validation#912

Open
YuanyuanTian-hh wants to merge 41 commits intomainfrom
user/tianyuanyuan/benchmark-regression
Open

Add benchmark pipeline with Rust-native A/B validation#912
YuanyuanTian-hh wants to merge 41 commits intomainfrom
user/tianyuanyuan/benchmark-regression

Conversation

@YuanyuanTian-hh
Copy link
Copy Markdown

@YuanyuanTian-hh YuanyuanTian-hh commented Apr 7, 2026

Add benchmark regression pipeline with Rust-native A/B validation

Summary

Adds an automated benchmark regression pipeline to GitHub Actions. This PR builds on PR #900 (benchmark A/B test framework) by implementing the Regression trait for disk-index benchmarks and wiring it into CI workflows. The pipeline builds and searches two public 100K ANN datasets, compares before/after performance, and validates against configurable tolerances all in typed Rust code with no Python dependencies.

What's Changed

Benchmark Runner (from PR #900)

  • Rust-native A/B test framework in diskann-benchmark-runner
  • Regression trait, tolerance matching, check run CLI subcommand

Disk-Index Regression Support (this PR)

File Description
diskann-benchmark/src/backend/disk_index/benchmarks.rs Implements Regression trait for DiskIndex<T> with typed before/after comparison of 7 metrics
diskann-benchmark/src/backend/disk_index/search.rs Added Deserialize to DiskSearchStats, DiskSearchResult
diskann-benchmark/src/backend/disk_index/build.rs Added Deserialize to DiskBuildStats, exposed build_time_seconds()
diskann-benchmark/perf_test_inputs/disk-index-tolerances.json Tolerance config: 10% build/QPS, 1% recall/IOs/comps, 15% latency
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Benchmark config: 768-dim, inner_product, search_list=[200]
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json Benchmark config: 1536-dim, squared_l2, SQ_1_2.0, search_list=[200]

CI Workflows

File Description
.github/workflows/benchmarks.yml PR regression workflow triggers on pull_request to main, validates with cargo ... check run
.github/workflows/benchmarks-aa.yml Daily A/A stability test (main vs main) with GitHub issue on failure

How It Works

  1. Checkout both current branch and baseline (defaults to main)
  2. Download datasets from big-ann-benchmarks v0.4.0
  3. Build & search disk index on both branches
  4. Validate with cargo run -p diskann-benchmark --features disk-index --release -- check run --tolerances ... --before ... --after ...
  5. Upload JSON artifacts for 30-day retention

Regression Checks

The Regression trait implementation compares 7 metrics between before and after runs:

Metric Direction Tolerance Rationale
build_time Lower is better 10% CPU-bound, moderate noise
QPS Higher is better 10% CPU-bound, moderate noise
recall Higher is better 1% Algorithmic, near-deterministic
mean_ios Lower is better 1% Algorithmic, near-deterministic
mean_comparisons Lower is better 1% Algorithmic, near-deterministic
mean_latency Lower is better 15% Timing, noisy on shared runners
p95_latency Lower is better 15% Timing, noisy on shared runners

All checks use relative_change() which properly handles zero-baseline (returns error, not 0%).

Datasets

Dataset Dimensions Distance Vectors Queries search_list
Wikipedia-100K 768 inner_product 100K 5,000 200
OpenAI ArXiv-100K 1,536 squared_l2 100K 20,000 200

Improvements over PR #857

This PR addresses all blocking review comments from PR #857:

Concern PR #857 (Python) PR #912 (Rust)
Phantom thresholds / silent success Orphaned categories silently skipped Runner errors if tolerances don't match inputs
Bad direction values pass silently Falls through to else branch Type-safe, no string directions
Division by zero masked as 0% Returns 0 when baseline=0 Returns error ("before must be > 0")
Missing fields default to 0 .get(field, 0) Rust deserialization fails on missing fields
Manual trigger only workflow_dispatch only pull_request trigger with path filters
CI time ~70 min search_list=2000 search_list=200, ~10 min per job
Hardcoded Rust version rust_stable: "1.92" toolchain: stable, reads rust-toolchain.toml
Python dependencies benchmark_validate.py Zero Python, pure Rust

Yuanyuan Tian (from Dev Box) added 30 commits March 19, 2026 16:08
- Add benchmarks.yml workflow using workflow_dispatch, comparing current
  branch against a configurable baseline ref
- Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs
  into a CSV suitable for benchmark_result_parse.py
- Add benchmark_result_parse.py for validating results and posting PR comments
- Add wikipedia-100K-disk-index.json benchmark config using the public
  Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings,
  768-dim, cosine distance) to replace internal ADO datasets
…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.
Replaces the previous 3-step pipeline (JSONCSVMarkdownvalidate)
with a single script that reads both JSONs directly, compares metrics,
writes Markdown to step summary, checks thresholds, and posts PR comments.

Removed:
- compare_disk_index_json_output.py (JSON diff  CSV)
- csv_to_markdown.py (CSV  Markdown)
- benchmark_result_parse.py (CSV  threshold check)

Also removes pip install csvtomd/numpy/scipy  all scripts now use stdlib only.
…ersion, fix missing-field handling, clean up orphaned thresholds, switch data source to BAB v0.4.0
Yuanyuan Tian (from Dev Box) and others added 9 commits April 2, 2026 12:12
…pipeline' into user/tianyuanyuan/benchmark-regression
- Add Deserialize to DiskIndexStats, DiskSearchStats, DiskSearchResult, DiskBuildStats
- Implement Regression trait for DiskIndex<T> with typed before/after comparison
- Add DiskIndexTolerance type with configurable thresholds for 7 metrics
- Create disk-index-tolerances.json (10% build/QPS, 1% recall/IOs/comps, 15% latency)
- Switch registration from register() to register_regression()
- Replace Python benchmark_validate.py with Rust-native check run in both workflows
- Delete benchmark_validate.py (no longer needed)
@YuanyuanTian-hh YuanyuanTian-hh changed the title User/tianyuanyuan/benchmark regression Add benchmark regression pipeline with Rust-native A/B validation Apr 7, 2026
@YuanyuanTian-hh YuanyuanTian-hh marked this pull request as ready for review April 7, 2026 08:31
@YuanyuanTian-hh YuanyuanTian-hh requested review from a team and Copilot April 7, 2026 08:31
@YuanyuanTian-hh YuanyuanTian-hh changed the title Add benchmark regression pipeline with Rust-native A/B validation Add benchmark pipeline with Rust-native A/B validation Apr 7, 2026
@microsoft-github-policy-service
Copy link
Copy Markdown

@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

  1. Definitions.
    “Code” means the computer software code, whether in human-readable or machine-executable form,
    that is delivered by You to Microsoft under this Agreement.
    “Project” means any of the projects owned or managed by Microsoft and offered under a license
    approved by the Open Source Initiative (www.opensource.org).
    “Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
    Project, including but not limited to communication on electronic mailing lists, source code control
    systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
    discussing and improving that Project, but excluding communication that is conspicuously marked or
    otherwise designated in writing by You as “Not a Submission.”
    “Submission” means the Code and any other copyrightable material Submitted by You, including any
    associated comments and documentation.
  2. Your Submission. You must agree to the terms of this Agreement before making a Submission to any
    Project. This Agreement covers any and all Submissions that You, now or in the future (except as
    described in Section 4 below), Submit to any Project.
  3. Originality of Work. You represent that each of Your Submissions is entirely Your original work.
    Should You wish to Submit materials that are not Your original work, You may Submit them separately
    to the Project if You (a) retain all copyright and license information that was in the materials as You
    received them, (b) in the description accompanying Your Submission, include the phrase “Submission
    containing materials of a third party:” followed by the names of the third party and any licenses or other
    restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
    guidelines concerning Submissions.
  4. Your Employer. References to “employer” in this Agreement include Your employer or anyone else
    for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
    Submission is made in the course of Your work for an employer or Your employer has intellectual
    property rights in Your Submission by contract or applicable law, You must secure permission from Your
    employer to make the Submission before signing this Agreement. In that case, the term “You” in this
    Agreement will refer to You and the employer collectively. If You change employers in the future and
    desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
    and secure permission from the new employer before Submitting those Submissions.
  5. Licenses.
  • Copyright License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
    Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
    the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
    parties.
  • Patent License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
    Your patent claims that are necessarily infringed by the Submission or the combination of the
    Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
    import or otherwise dispose of the Submission alone or with the Project.
  • Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
    No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
    granted by implication, exhaustion, estoppel or otherwise.
  1. Representations and Warranties. You represent that You are legally entitled to grant the above
    licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
    have disclosed under Section 3). You represent that You have secured permission from Your employer to
    make the Submission in cases where Your Submission is made in the course of Your work for Your
    employer or Your employer has intellectual property rights in Your Submission by contract or applicable
    law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
    have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
    You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
    REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
    EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
    PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
    NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
  2. Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
    You later become aware that would make Your representations in this Agreement inaccurate in any
    respect.
  3. Information about Submissions. You agree that contributions to Projects and information about
    contributions may be maintained indefinitely and disclosed publicly, including Your name and other
    information that You submit with Your Submission.
  4. Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
    the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
    Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
    exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
    defenses of lack of personal jurisdiction and forum non-conveniens.
  5. Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
    supersedes any and all prior agreements, understandings or communications, written or oral, between
    the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Rust-native benchmark regression pipeline that compares “before vs after” benchmark outputs in CI using typed deserialization + a Regression trait, removing the need for external Python validation.

Changes:

  • Implement regression checking for disk-index benchmarks (tolerance type + metric comparisons) and make disk-index build/search stats deserializable for A/B comparison.
  • Extend diskann-benchmark-runner with check CLI subcommands and internal tolerance matching/dispatch + numerous regression/UX tests.
  • Add GitHub Actions workflows and benchmark/tolerance JSON inputs to run and validate regressions on PRs and via daily A/A runs.

Reviewed changes

Copilot reviewed 111 out of 151 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
diskann-benchmark/src/backend/disk_index/search.rs Adds Deserialize to disk search stats/results for typed A/B validation.
diskann-benchmark/src/backend/disk_index/build.rs Adds Deserialize and exposes build time in seconds for regression checks.
diskann-benchmark/src/backend/disk_index/benchmarks.rs Registers disk-index benchmarks as regression-capable and implements metric-based regression checking + tolerance input type.
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Adds Wikipedia-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json Adds OpenAI ArXiv-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/disk-index-tolerances.json Adds default tolerance thresholds for disk-index regression validation.
diskann-benchmark-simd/src/lib.rs Wires SIMD benchmarks into regression framework + adds tolerance type and regression check implementation + tests.
diskann-benchmark-simd/src/bin.rs Updates SIMD binary tests to include check verify invocation.
diskann-benchmark-simd/examples/tolerance.json Adds example tolerance file for SIMD regression checks.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json Adds regression UX fixture for incompatible input/tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt Expected output for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt Command script for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md Describes the check-verify-4 scenario.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json Input fixture for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json Adds regression UX fixture for ambiguous/uncovered/orphaned tolerance matching.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt Expected output for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt Command script for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json Input fixture for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json Adds regression UX fixture for “no matching benchmark” during verify.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt Expected output for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt Command script for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md Describes the check-verify-2 scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json Input fixture for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json Adds regression UX fixture for unrecognized tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt Expected output for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt Command script for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md Describes the check-verify-1 scenario.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json Input fixture for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json Adds regression UX fixture for successful verify (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt Command script for check-verify-0.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md Describes the check-verify-0 scenario.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json Input fixture for check-verify-0.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt Expected output for requesting a nonexistent tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt Command script for check-tolerances-2.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md Describes the check-tolerances-2 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt Expected output for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt Command script for check-tolerances-1.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md Describes the check-tolerances-1 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt Expected output for listing all tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt Command script for check-tolerances-0.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md Describes the check-tolerances-0 scenario.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt Expected output for tolerance skeleton printing.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt Command script for check-skeleton-0.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md Describes the check-skeleton-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json Adds regression UX fixture for successful check run execution.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt Expected output for successful check run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt Command script for pass-case check run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md Describes the check-run-pass-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json Output fixture used as both before/after.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json Input fixture for pass-case run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json Expected JSON output from pass-case checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json Adds regression UX fixture for a failing check run result.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt Expected output for failing check run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt Command script for fail-case check run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md Describes the check-run-fail-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json Output fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json Input fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json Expected JSON output from fail-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json Adds regression UX fixture for before/after schema drift error reporting.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt Expected output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt Command script for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json Regression input fixture to force schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md Describes the check-run-error-3 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json Output fixture used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json Adds regression UX fixture for “input drift” dispatch failure in check run.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt Expected output for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt Command script for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json Regression input fixture that drifts to unsupported type.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md Describes the check-run-error-2 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json Output fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json Input fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json Adds regression UX fixture for before/after length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt Expected output for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt Command script for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json Regression input fixture with different job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md Describes the check-run-error-1 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json Output fixture with mismatched job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json Adds regression UX fixture for infrastructure error propagation.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt Expected output for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt Command script for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md Describes the check-run-error-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json Output fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json Input fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt Adds expected output for run --dry-run success.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt Adds command script for run --dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md Describes the dry-run behavior expectation.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json Input fixture for dry-run test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt Updates expected stdout for successful run output text changes.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt Adds command script for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md Describes benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json Adds expected output.json for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json Adds input fixture for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt Adds expected output for overload/dispatch scoring test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt Adds command script for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md Describes overload/dispatch selection behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json Adds expected output.json for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json Adds input fixture for overload test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt Adds expected diagnostics for mismatch description paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md Describes mismatch diagnostics scenario.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt Adds expected diagnostics for “closest matches” reporting.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md Describes mismatch “closest matches” behavior.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt Adds expected output for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt Adds command script for deserialization error test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md Describes deserialization error behavior.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json Adds input fixture with invalid enum value.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt Updates benchmark listing output to include new simple bench.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt Adds command script for benchmark listing.
diskann-benchmark-runner/tests/benchmark/test-4/README.md Describes benchmark listing test.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-3/README.md Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/README.md Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt Adds expected output for listing available input kinds.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt Adds command script for input listing.
diskann-benchmark-runner/tests/benchmark/test-1/README.md Describes input listing test.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt Adds expected output for skeleton input printing.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt Adds command script for skeleton test.
diskann-benchmark-runner/tests/benchmark/test-0/README.md Describes skeleton test.
diskann-benchmark-runner/src/ux.rs Adds scrub_path helper and improves backtrace stripping logic for deterministic test output.
diskann-benchmark-runner/src/utils/percentiles.rs Adds minimum percentile field and marks Percentiles non-exhaustive.
diskann-benchmark-runner/src/utils/num.rs Adds constrained numeric deserialization + relative_change helper for regression checks.
diskann-benchmark-runner/src/utils/mod.rs Exposes new num utilities module.
diskann-benchmark-runner/src/utils/fmt.rs Adds clippy expectation annotation for bounds-checked panic.
diskann-benchmark-runner/src/test/typed.rs Refactors test benches and adds regression-capable typed benchmark checks.
diskann-benchmark-runner/src/test/mod.rs Centralizes registration of test inputs/benchmarks including regression variants.
diskann-benchmark-runner/src/test/dim.rs Adds dimensional test benchmarks including a non-regression “simple bench”.
diskann-benchmark-runner/src/result.rs Adds RawResult loader for reuse in regression checking pipeline.
diskann-benchmark-runner/src/registry.rs Extends registry with regression benchmark registration + tolerance discovery.
diskann-benchmark-runner/src/lib.rs Exposes benchmark module publicly and adds internal module plumbing.
diskann-benchmark-runner/src/jobs.rs Refactors job loading/parsing and improves error messages + exposes raw job accessors.
diskann-benchmark-runner/src/internal/regression.rs Implements tolerance parsing, subset matching, regression job assembly, and execution reporting.
diskann-benchmark-runner/src/internal/mod.rs Adds shared load_from_disk helper and internal module structure.
diskann-benchmark-runner/src/input.rs Adds const INSTANCE for Input wrapper to support regression tolerance typing.
diskann-benchmark-runner/src/checker.rs Adds clippy expectation annotation for internal tag invariants.
diskann-benchmark-runner/src/benchmark.rs Introduces Regression trait + internal object-safe regression plumbing for the runner.
diskann-benchmark-runner/src/app.rs Adds check subcommands (skeleton/tolerances/verify/run) and upgrades UX test harness.
diskann-benchmark-runner/Cargo.toml Adjusts clippy lint configuration for unwrap/expect/panic, etc.
diskann-benchmark-runner/.clippy.toml Allows unwrap/expect/panic in tests for this crate.
.github/workflows/benchmarks.yml Adds PR-triggered and manual benchmark regression workflow for two datasets with Rust-native validation.
.github/workflows/benchmarks-aa.yml Adds daily scheduled A/A stability workflow and issue creation on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +223 to +228
/// Aggregated result of a disk-index regression check.
#[derive(Debug, Serialize)]
struct DiskIndexCheckResult {
search_l: u32,
comparisons: Vec<MetricComparison>,
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskIndexCheckResult stores a single search_l value, but comparisons aggregates metrics across all search_results_per_l entries. If the input contains multiple search_l values, the rendered output will be ambiguous/misleading because individual rows don't indicate which search_l they correspond to. Consider grouping comparisons per search_l (e.g., Vec<PerSearchLResult>) or include search_l on each MetricComparison row and print it.

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +304
// Flip before/after so that a decrease becomes a positive relative_change
let (change_pct, remark, metric_passed) = match relative_change(before, after) {
Ok(change) => {
// For higher-is-better, a negative change is a regression
let ok = -change <= tolerance.get();
if !ok {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the code flips before/after so that a decrease becomes a positive relative_change, but the call is still relative_change(before, after). Either update the comment to reflect the actual sign convention (negative = regression for higher-is-better) or change the computation to match the comment to avoid future confusion.

Copilot uses AI. Check for mistakes.
Comment on lines 13 to 17
#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
#[non_exhaustive]
pub struct Percentiles<T> {
pub minimum: T,
pub mean: f64,
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentiles<T> is a public type and this change adds a new required public field (minimum). Even with #[non_exhaustive], downstream users constructing Percentiles { ... } will break unless they update, so this is a semver-impacting API change. If you want to avoid forcing downstream changes, consider keeping the struct shape stable and exposing the minimum via an accessor/newtype, or make construction go through a constructor function.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +102
#[test]
fn check_verify() {
let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("examples")
.join("simd-scalar.json");
let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("examples")
.join("tolerance.json");

let stdout = run_check_test(&input_path, &tolerance_path);
println!("stdout = {}", stdout);
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_verify currently just prints the verifier output and has no assertions, so it will pass even if verification fails or produces unexpected output. Consider asserting that run_check_test(...) returns an empty string (or whatever the expected success output is), and avoid println! noise in tests unless the assertion fails.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +74
- name: Checkout baseline (${{ inputs.baseline_ref || 'main' }})
uses: actions/checkout@v4
with:
ref: ${{ inputs.baseline_ref || 'main' }}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow is triggered by both workflow_dispatch and pull_request, but it references ${{ inputs.baseline_ref }} when checking out the baseline. The inputs context is only defined for workflow_dispatch, so this can fail on PR runs. Prefer using github.event.inputs.baseline_ref (which safely resolves to null on non-dispatch events) or an explicit conditional on github.event_name with a main default.

Suggested change
- name: Checkout baseline (${{ inputs.baseline_ref || 'main' }})
uses: actions/checkout@v4
with:
ref: ${{ inputs.baseline_ref || 'main' }}
- name: Checkout baseline (${{ github.event.inputs.baseline_ref || 'main' }})
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.baseline_ref || 'main' }}

Copilot uses AI. Check for mistakes.
# DiskANN Benchmarks Workflow
#
# This workflow runs macro benchmarks comparing the current branch against a baseline.
# It is manually triggered and requires a baseline reference (branch, tag, or commit).
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-of-file comments say the workflow is "manually triggered" and "requires a baseline reference", but the workflow now also runs on pull_request (where the baseline is implicitly main). Updating these comments will prevent future confusion about how/when the workflow runs.

Suggested change
# It is manually triggered and requires a baseline reference (branch, tag, or commit).
# It can be triggered manually with a baseline reference (branch, tag, or commit),
# or automatically on pull requests to `main`, where the baseline is implicitly `main`.

Copilot uses AI. Check for mistakes.

permissions:
contents: read
pull-requests: write # Required for posting PR comments
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull-requests: write permission is declared, but this workflow doesn't appear to post PR comments or otherwise write to the PR. Consider dropping it (or scoping it to the specific step that needs it) to follow least-privilege, especially since this runs on pull_request.

Suggested change
pull-requests: write # Required for posting PR comments

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 93.73434% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.40%. Comparing base (0ced23d) to head (c9ebf6c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
diskann-benchmark-runner/src/app.rs 86.02% 19 Missing ⚠️
diskann-benchmark-simd/src/lib.rs 93.47% 12 Missing ⚠️
diskann-benchmark-runner/src/benchmark.rs 89.42% 11 Missing ⚠️
diskann-benchmark-runner/src/test/dim.rs 87.91% 11 Missing ⚠️
diskann-benchmark-runner/src/registry.rs 84.61% 10 Missing ⚠️
...iskann-benchmark-runner/src/internal/regression.rs 97.70% 9 Missing ⚠️
diskann-benchmark-runner/src/jobs.rs 91.66% 1 Missing ⚠️
diskann-benchmark-runner/src/test/typed.rs 97.61% 1 Missing ⚠️
diskann-benchmark-runner/src/ux.rs 95.65% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #912      +/-   ##
==========================================
+ Coverage   89.31%   89.40%   +0.09%     
==========================================
  Files         445      449       +4     
  Lines       84095    85057     +962     
==========================================
+ Hits        75113    76049     +936     
- Misses       8982     9008      +26     
Flag Coverage Δ
miri 89.40% <93.73%> (+0.09%) ⬆️
unittests 89.25% <93.73%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark-runner/src/checker.rs 72.66% <ø> (ø)
diskann-benchmark-runner/src/input.rs 78.37% <100.00%> (ø)
diskann-benchmark-runner/src/internal/mod.rs 100.00% <100.00%> (ø)
diskann-benchmark-runner/src/result.rs 97.84% <100.00%> (+0.03%) ⬆️
diskann-benchmark-runner/src/test/mod.rs 100.00% <100.00%> (ø)
diskann-benchmark-runner/src/utils/fmt.rs 97.50% <ø> (ø)
diskann-benchmark-runner/src/utils/num.rs 100.00% <100.00%> (ø)
diskann-benchmark-runner/src/utils/percentiles.rs 100.00% <100.00%> (ø)
diskann-benchmark-simd/src/bin.rs 87.93% <100.00%> (+6.35%) ⬆️
diskann-benchmark-runner/src/jobs.rs 96.82% <91.66%> (+0.33%) ⬆️
... and 8 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants