Skip to content

Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674

Open
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch
Open

Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Conversation

@Arshammik
Copy link
Copy Markdown
Contributor

In _highly_variable_pearson_residuals, the default clip threshold is computed inside the per-batch loop guarded by if clip is None. Because the assignment writes back to the clip parameter itself, the guard is only true on the first iteration; every subsequent batch reuses the first batch's threshold instead of computing its own.

Lause, Berens & Kobak (2021) define the clip as sqrt(N) on the cell count entering the residual computation, so with batch_key the correct value is sqrt(n_cells_in_batch). On a 5000/200 split, the 200-cell batch silently inherits sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1. Residuals in the small batch that should be clipped are not, its per-gene variance is inflated, and HVG ranking shifts. The output also depends on np.unique's alphabetical ordering of batch labels, so renaming "A" to "B" (and vice versa) on otherwise identical data changes the HVG list — not the kind of reproducibility surprise one wants buried in a preprocessing default.

The fix introduces a loop-local clip_batch so the per-batch default is recomputed each iteration; the user-facing clip parameter is left alone. Two tests cover it. test_pearson_residuals_batch_clip_scoped_per_batch monkeypatches the csc_hvg_res / dense_hvg_res bindings and asserts that two distinct clip values reach the kernel across batches — on the buggy code both calls receive the same value. test_pearson_residuals_batch_order_invariant swaps AB on identical data and checks that residual_variances matches bit-for-bit. Both fail on the unpatched code; the rest of the pearson + HVG suite (58 tests) still passes. Verified on an H100.

scanpy ports this routine almost verbatim and carries the same bug — I have a parallel PR open at scverse/scanpy#4138 with the same one-line scoping fix and equivalent CPU-side tests. Happy to land them together, or to hold this one until the scanpy side merges, whichever you prefer.

Arshammik and others added 2 commits May 23, 2026 13:49
`_highly_variable_pearson_residuals` reassigned the function-scoped
`clip` parameter inside the per-batch loop via `if clip is None: clip =
cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so
every subsequent batch reused the first batch's threshold instead of
computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021.

For a 5000/200 split with `clip=None`, the small batch silently inherited
`sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals
in the small batch that should be clipped were not, inflating their
per-gene variance and shifting HVG ranking. The result also depended on
which batch label was alphabetically first (since `np.unique` sorts), so
renaming "A"↔"B" silently changed the HVG output.

The fix introduces a loop-local `clip_batch` variable so the per-batch
default is recomputed each iteration. The user-facing `clip` parameter
is untouched.

Two new tests in `tests/test_hvg.py`:
- `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the
  underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and
  asserts that two distinct clip values reach the kernel for a 5000/200
  split.
- `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B
  on identical data and asserts that `residual_variances` is unchanged.

Both fail on the unpatched code and pass on the fix; the full 58-test
pearson HVG subset of `tests/test_hvg.py` still passes.

scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant