Add benchmark pipeline with Rust-native A/B validation#912
Add benchmark pipeline with Rust-native A/B validation#912YuanyuanTian-hh wants to merge 41 commits intomainfrom
Conversation
- Add benchmarks.yml workflow using workflow_dispatch, comparing current branch against a configurable baseline ref - Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs into a CSV suitable for benchmark_result_parse.py - Add benchmark_result_parse.py for validating results and posting PR comments - Add wikipedia-100K-disk-index.json benchmark config using the public Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings, 768-dim, cosine distance) to replace internal ADO datasets
…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.
…-normalized, metric is inner product)
…p, not cosine similarity)
Replaces the previous 3-step pipeline (JSONCSVMarkdownvalidate) with a single script that reads both JSONs directly, compares metrics, writes Markdown to step summary, checks thresholds, and posts PR comments. Removed: - compare_disk_index_json_output.py (JSON diff CSV) - csv_to_markdown.py (CSV Markdown) - benchmark_result_parse.py (CSV threshold check) Also removes pip install csvtomd/numpy/scipy all scripts now use stdlib only.
…ersion, fix missing-field handling, clean up orphaned thresholds, switch data source to BAB v0.4.0
…pipeline' into user/tianyuanyuan/benchmark-regression
- Add Deserialize to DiskIndexStats, DiskSearchStats, DiskSearchResult, DiskBuildStats - Implement Regression trait for DiskIndex<T> with typed before/after comparison - Add DiskIndexTolerance type with configurable thresholds for 7 metrics - Create disk-index-tolerances.json (10% build/QPS, 1% recall/IOs/comps, 15% latency) - Switch registration from register() to register_regression() - Replace Python benchmark_validate.py with Rust-native check run in both workflows - Delete benchmark_validate.py (no longer needed)
|
@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Adds a Rust-native benchmark regression pipeline that compares “before vs after” benchmark outputs in CI using typed deserialization + a Regression trait, removing the need for external Python validation.
Changes:
- Implement regression checking for disk-index benchmarks (tolerance type + metric comparisons) and make disk-index build/search stats deserializable for A/B comparison.
- Extend
diskann-benchmark-runnerwithcheckCLI subcommands and internal tolerance matching/dispatch + numerous regression/UX tests. - Add GitHub Actions workflows and benchmark/tolerance JSON inputs to run and validate regressions on PRs and via daily A/A runs.
Reviewed changes
Copilot reviewed 111 out of 151 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-benchmark/src/backend/disk_index/search.rs | Adds Deserialize to disk search stats/results for typed A/B validation. |
| diskann-benchmark/src/backend/disk_index/build.rs | Adds Deserialize and exposes build time in seconds for regression checks. |
| diskann-benchmark/src/backend/disk_index/benchmarks.rs | Registers disk-index benchmarks as regression-capable and implements metric-based regression checking + tolerance input type. |
| diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json | Adds Wikipedia-100K disk-index benchmark configuration for CI runs. |
| diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json | Adds OpenAI ArXiv-100K disk-index benchmark configuration for CI runs. |
| diskann-benchmark/perf_test_inputs/disk-index-tolerances.json | Adds default tolerance thresholds for disk-index regression validation. |
| diskann-benchmark-simd/src/lib.rs | Wires SIMD benchmarks into regression framework + adds tolerance type and regression check implementation + tests. |
| diskann-benchmark-simd/src/bin.rs | Updates SIMD binary tests to include check verify invocation. |
| diskann-benchmark-simd/examples/tolerance.json | Adds example tolerance file for SIMD regression checks. |
| diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json | Adds regression UX fixture for incompatible input/tolerance tag error. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt | Expected output for check-verify-4. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt | Command script for check-verify-4. |
| diskann-benchmark-runner/tests/regression/check-verify-4/README.md | Describes the check-verify-4 scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-4/input.json | Input fixture for check-verify-4. |
| diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json | Adds regression UX fixture for ambiguous/uncovered/orphaned tolerance matching. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt | Expected output for check-verify-3. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt | Command script for check-verify-3. |
| diskann-benchmark-runner/tests/regression/check-verify-3/input.json | Input fixture for check-verify-3. |
| diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json | Adds regression UX fixture for “no matching benchmark” during verify. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt | Expected output for check-verify-2. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt | Command script for check-verify-2. |
| diskann-benchmark-runner/tests/regression/check-verify-2/README.md | Describes the check-verify-2 scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-2/input.json | Input fixture for check-verify-2. |
| diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json | Adds regression UX fixture for unrecognized tolerance tag. |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt | Expected output for check-verify-1. |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt | Command script for check-verify-1. |
| diskann-benchmark-runner/tests/regression/check-verify-1/README.md | Describes the check-verify-1 scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-1/input.json | Input fixture for check-verify-1. |
| diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json | Adds regression UX fixture for successful verify (no stdout). |
| diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt | Command script for check-verify-0. |
| diskann-benchmark-runner/tests/regression/check-verify-0/README.md | Describes the check-verify-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-0/input.json | Input fixture for check-verify-0. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt | Expected output for requesting a nonexistent tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt | Command script for check-tolerances-2. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md | Describes the check-tolerances-2 scenario. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt | Expected output for describing a specific tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt | Command script for check-tolerances-1. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md | Describes the check-tolerances-1 scenario. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt | Expected output for listing all tolerance kinds. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt | Command script for check-tolerances-0. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md | Describes the check-tolerances-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt | Expected output for tolerance skeleton printing. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt | Command script for check-skeleton-0. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md | Describes the check-skeleton-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json | Adds regression UX fixture for successful check run execution. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt | Expected output for successful check run. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt | Command script for pass-case check run. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md | Describes the check-run-pass-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json | Output fixture used as both before/after. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json | Input fixture for pass-case run. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json | Expected JSON output from pass-case checks. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json | Adds regression UX fixture for a failing check run result. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt | Expected output for failing check run. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt | Command script for fail-case check run. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md | Describes the check-run-fail-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json | Output fixture for fail-case run. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json | Input fixture for fail-case run. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json | Expected JSON output from fail-case checks. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json | Adds regression UX fixture for before/after schema drift error reporting. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt | Expected output for schema drift error. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt | Command script for schema drift error. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json | Regression input fixture to force schema mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/README.md | Describes the check-run-error-3 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/output.json | Output fixture used to trigger schema mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/input.json | Input fixture used to generate output.json. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json | Expected JSON output from error-case checks. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json | Adds regression UX fixture for “input drift” dispatch failure in check run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt | Expected output for input drift dispatch failure. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt | Command script for input drift dispatch failure. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json | Regression input fixture that drifts to unsupported type. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/README.md | Describes the check-run-error-2 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/output.json | Output fixture for error-case run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/input.json | Input fixture for error-case run. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json | Adds regression UX fixture for before/after length mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt | Expected output for length mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt | Command script for length mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json | Regression input fixture with different job count. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/README.md | Describes the check-run-error-1 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/output.json | Output fixture with mismatched job count. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/input.json | Input fixture used to generate output.json. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json | Adds regression UX fixture for infrastructure error propagation. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt | Expected output for infrastructure errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt | Command script for infrastructure errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/README.md | Describes the check-run-error-0 scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/output.json | Output fixture for infrastructure errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/input.json | Input fixture for infrastructure errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json | Expected JSON output from error-case checks. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt | Adds expected output for run --dry-run success. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt | Adds command script for run --dry-run. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/README.md | Describes the dry-run behavior expectation. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/input.json | Input fixture for dry-run test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt | Updates expected stdout for successful run output text changes. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt | Adds command script for benchmark success test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/README.md | Describes benchmark success test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/output.json | Adds expected output.json for benchmark success test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/input.json | Adds input fixture for benchmark success test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt | Adds expected output for overload/dispatch scoring test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt | Adds command script for overload test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md | Describes overload/dispatch selection behavior. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json | Adds expected output.json for overload test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json | Adds input fixture for overload test. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt | Adds expected diagnostics for mismatch description paths. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt | Adds command script for mismatch test. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md | Describes mismatch diagnostics scenario. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json | Adds input fixture for mismatch test. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt | Adds expected diagnostics for “closest matches” reporting. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt | Adds command script for mismatch test. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md | Describes mismatch “closest matches” behavior. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json | Adds input fixture for mismatch test. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt | Adds expected output for input deserialization error reporting. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt | Adds command script for deserialization error test. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md | Describes deserialization error behavior. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json | Adds input fixture with invalid enum value. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt | Updates benchmark listing output to include new simple bench. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt | Adds command script for benchmark listing. |
| diskann-benchmark-runner/tests/benchmark/test-4/README.md | Describes benchmark listing test. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt | Adds expected output for describing a specific input kind. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt | Adds command script for input describe test. |
| diskann-benchmark-runner/tests/benchmark/test-3/README.md | Describes input describe test. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt | Adds expected output for describing a specific input kind. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt | Adds command script for input describe test. |
| diskann-benchmark-runner/tests/benchmark/test-2/README.md | Describes input describe test. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt | Adds expected output for listing available input kinds. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt | Adds command script for input listing. |
| diskann-benchmark-runner/tests/benchmark/test-1/README.md | Describes input listing test. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt | Adds expected output for skeleton input printing. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt | Adds command script for skeleton test. |
| diskann-benchmark-runner/tests/benchmark/test-0/README.md | Describes skeleton test. |
| diskann-benchmark-runner/src/ux.rs | Adds scrub_path helper and improves backtrace stripping logic for deterministic test output. |
| diskann-benchmark-runner/src/utils/percentiles.rs | Adds minimum percentile field and marks Percentiles non-exhaustive. |
| diskann-benchmark-runner/src/utils/num.rs | Adds constrained numeric deserialization + relative_change helper for regression checks. |
| diskann-benchmark-runner/src/utils/mod.rs | Exposes new num utilities module. |
| diskann-benchmark-runner/src/utils/fmt.rs | Adds clippy expectation annotation for bounds-checked panic. |
| diskann-benchmark-runner/src/test/typed.rs | Refactors test benches and adds regression-capable typed benchmark checks. |
| diskann-benchmark-runner/src/test/mod.rs | Centralizes registration of test inputs/benchmarks including regression variants. |
| diskann-benchmark-runner/src/test/dim.rs | Adds dimensional test benchmarks including a non-regression “simple bench”. |
| diskann-benchmark-runner/src/result.rs | Adds RawResult loader for reuse in regression checking pipeline. |
| diskann-benchmark-runner/src/registry.rs | Extends registry with regression benchmark registration + tolerance discovery. |
| diskann-benchmark-runner/src/lib.rs | Exposes benchmark module publicly and adds internal module plumbing. |
| diskann-benchmark-runner/src/jobs.rs | Refactors job loading/parsing and improves error messages + exposes raw job accessors. |
| diskann-benchmark-runner/src/internal/regression.rs | Implements tolerance parsing, subset matching, regression job assembly, and execution reporting. |
| diskann-benchmark-runner/src/internal/mod.rs | Adds shared load_from_disk helper and internal module structure. |
| diskann-benchmark-runner/src/input.rs | Adds const INSTANCE for Input wrapper to support regression tolerance typing. |
| diskann-benchmark-runner/src/checker.rs | Adds clippy expectation annotation for internal tag invariants. |
| diskann-benchmark-runner/src/benchmark.rs | Introduces Regression trait + internal object-safe regression plumbing for the runner. |
| diskann-benchmark-runner/src/app.rs | Adds check subcommands (skeleton/tolerances/verify/run) and upgrades UX test harness. |
| diskann-benchmark-runner/Cargo.toml | Adjusts clippy lint configuration for unwrap/expect/panic, etc. |
| diskann-benchmark-runner/.clippy.toml | Allows unwrap/expect/panic in tests for this crate. |
| .github/workflows/benchmarks.yml | Adds PR-triggered and manual benchmark regression workflow for two datasets with Rust-native validation. |
| .github/workflows/benchmarks-aa.yml | Adds daily scheduled A/A stability workflow and issue creation on failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Aggregated result of a disk-index regression check. | ||
| #[derive(Debug, Serialize)] | ||
| struct DiskIndexCheckResult { | ||
| search_l: u32, | ||
| comparisons: Vec<MetricComparison>, | ||
| } |
There was a problem hiding this comment.
DiskIndexCheckResult stores a single search_l value, but comparisons aggregates metrics across all search_results_per_l entries. If the input contains multiple search_l values, the rendered output will be ambiguous/misleading because individual rows don't indicate which search_l they correspond to. Consider grouping comparisons per search_l (e.g., Vec<PerSearchLResult>) or include search_l on each MetricComparison row and print it.
| // Flip before/after so that a decrease becomes a positive relative_change | ||
| let (change_pct, remark, metric_passed) = match relative_change(before, after) { | ||
| Ok(change) => { | ||
| // For higher-is-better, a negative change is a regression | ||
| let ok = -change <= tolerance.get(); | ||
| if !ok { |
There was a problem hiding this comment.
The comment says the code flips before/after so that a decrease becomes a positive relative_change, but the call is still relative_change(before, after). Either update the comment to reflect the actual sign convention (negative = regression for higher-is-better) or change the computation to match the comment to avoid future confusion.
| #[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)] | ||
| #[non_exhaustive] | ||
| pub struct Percentiles<T> { | ||
| pub minimum: T, | ||
| pub mean: f64, |
There was a problem hiding this comment.
Percentiles<T> is a public type and this change adds a new required public field (minimum). Even with #[non_exhaustive], downstream users constructing Percentiles { ... } will break unless they update, so this is a semver-impacting API change. If you want to avoid forcing downstream changes, consider keeping the struct shape stable and exposing the minimum via an accessor/newtype, or make construction go through a constructor function.
| #[test] | ||
| fn check_verify() { | ||
| let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) | ||
| .join("examples") | ||
| .join("simd-scalar.json"); | ||
| let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) | ||
| .join("examples") | ||
| .join("tolerance.json"); | ||
|
|
||
| let stdout = run_check_test(&input_path, &tolerance_path); | ||
| println!("stdout = {}", stdout); | ||
| } |
There was a problem hiding this comment.
check_verify currently just prints the verifier output and has no assertions, so it will pass even if verification fails or produces unexpected output. Consider asserting that run_check_test(...) returns an empty string (or whatever the expected success output is), and avoid println! noise in tests unless the assertion fails.
| - name: Checkout baseline (${{ inputs.baseline_ref || 'main' }}) | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: ${{ inputs.baseline_ref || 'main' }} |
There was a problem hiding this comment.
The workflow is triggered by both workflow_dispatch and pull_request, but it references ${{ inputs.baseline_ref }} when checking out the baseline. The inputs context is only defined for workflow_dispatch, so this can fail on PR runs. Prefer using github.event.inputs.baseline_ref (which safely resolves to null on non-dispatch events) or an explicit conditional on github.event_name with a main default.
| - name: Checkout baseline (${{ inputs.baseline_ref || 'main' }}) | |
| uses: actions/checkout@v4 | |
| with: | |
| ref: ${{ inputs.baseline_ref || 'main' }} | |
| - name: Checkout baseline (${{ github.event.inputs.baseline_ref || 'main' }}) | |
| uses: actions/checkout@v4 | |
| with: | |
| ref: ${{ github.event.inputs.baseline_ref || 'main' }} |
| # DiskANN Benchmarks Workflow | ||
| # | ||
| # This workflow runs macro benchmarks comparing the current branch against a baseline. | ||
| # It is manually triggered and requires a baseline reference (branch, tag, or commit). |
There was a problem hiding this comment.
Top-of-file comments say the workflow is "manually triggered" and "requires a baseline reference", but the workflow now also runs on pull_request (where the baseline is implicitly main). Updating these comments will prevent future confusion about how/when the workflow runs.
| # It is manually triggered and requires a baseline reference (branch, tag, or commit). | |
| # It can be triggered manually with a baseline reference (branch, tag, or commit), | |
| # or automatically on pull requests to `main`, where the baseline is implicitly `main`. |
|
|
||
| permissions: | ||
| contents: read | ||
| pull-requests: write # Required for posting PR comments |
There was a problem hiding this comment.
pull-requests: write permission is declared, but this workflow doesn't appear to post PR comments or otherwise write to the PR. Consider dropping it (or scoping it to the specific step that needs it) to follow least-privilege, especially since this runs on pull_request.
| pull-requests: write # Required for posting PR comments |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #912 +/- ##
==========================================
+ Coverage 89.31% 89.40% +0.09%
==========================================
Files 445 449 +4
Lines 84095 85057 +962
==========================================
+ Hits 75113 76049 +936
- Misses 8982 9008 +26
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Add benchmark regression pipeline with Rust-native A/B validation
Summary
Adds an automated benchmark regression pipeline to GitHub Actions. This PR builds on PR #900 (benchmark A/B test framework) by implementing the
Regressiontrait for disk-index benchmarks and wiring it into CI workflows. The pipeline builds and searches two public 100K ANN datasets, compares before/after performance, and validates against configurable tolerances all in typed Rust code with no Python dependencies.What's Changed
Benchmark Runner (from PR #900)
diskann-benchmark-runnerRegressiontrait, tolerance matching,check runCLI subcommandDisk-Index Regression Support (this PR)
diskann-benchmark/src/backend/disk_index/benchmarks.rsRegressiontrait forDiskIndex<T>with typed before/after comparison of 7 metricsdiskann-benchmark/src/backend/disk_index/search.rsDeserializetoDiskSearchStats,DiskSearchResultdiskann-benchmark/src/backend/disk_index/build.rsDeserializetoDiskBuildStats, exposedbuild_time_seconds()diskann-benchmark/perf_test_inputs/disk-index-tolerances.jsondiskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.jsondiskann-benchmark/perf_test_inputs/openai-100K-disk-index.jsonCI Workflows
.github/workflows/benchmarks.ymlpull_requestto main, validates withcargo ... check run.github/workflows/benchmarks-aa.ymlHow It Works
main)cargo run -p diskann-benchmark --features disk-index --release -- check run --tolerances ... --before ... --after ...Regression Checks
The
Regressiontrait implementation compares 7 metrics between before and after runs:All checks use
relative_change()which properly handles zero-baseline (returns error, not 0%).Datasets
Improvements over PR #857
This PR addresses all blocking review comments from PR #857:
.get(field, 0)