Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674
Open
Arshammik wants to merge 2 commits into
Open
Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674Arshammik wants to merge 2 commits into
_highly_variable_pearson_residuals#674Arshammik wants to merge 2 commits into
Conversation
`_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter inside the per-batch loop via `if clip is None: clip = cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so every subsequent batch reused the first batch's threshold instead of computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021. For a 5000/200 split with `clip=None`, the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating their per-gene variance and shifting HVG ranking. The result also depended on which batch label was alphabetically first (since `np.unique` sorts), so renaming "A"↔"B" silently changed the HVG output. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_hvg.py`: - `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and asserts that two distinct clip values reach the kernel for a 5000/200 split. - `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B on identical data and asserts that `residual_variances` is unchanged. Both fail on the unpatched code and pass on the fix; the full 58-test pearson HVG subset of `tests/test_hvg.py` still passes. scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
_highly_variable_pearson_residuals, the default clip threshold is computed inside the per-batch loop guarded byif clip is None. Because the assignment writes back to theclipparameter itself, the guard is only true on the first iteration; every subsequent batch reuses the first batch's threshold instead of computing its own.Lause, Berens & Kobak (2021) define the clip as
sqrt(N)on the cell count entering the residual computation, so withbatch_keythe correct value issqrt(n_cells_in_batch). On a 5000/200 split, the 200-cell batch silently inheritssqrt(5000) ≈ 70.7instead ofsqrt(200) ≈ 14.1. Residuals in the small batch that should be clipped are not, its per-gene variance is inflated, and HVG ranking shifts. The output also depends onnp.unique's alphabetical ordering of batch labels, so renaming"A"to"B"(and vice versa) on otherwise identical data changes the HVG list — not the kind of reproducibility surprise one wants buried in a preprocessing default.The fix introduces a loop-local
clip_batchso the per-batch default is recomputed each iteration; the user-facingclipparameter is left alone. Two tests cover it.test_pearson_residuals_batch_clip_scoped_per_batchmonkeypatches thecsc_hvg_res/dense_hvg_resbindings and asserts that two distinct clip values reach the kernel across batches — on the buggy code both calls receive the same value.test_pearson_residuals_batch_order_invariantswapsA↔Bon identical data and checks thatresidual_variancesmatches bit-for-bit. Both fail on the unpatched code; the rest of the pearson + HVG suite (58 tests) still passes. Verified on an H100.scanpy ports this routine almost verbatim and carries the same bug — I have a parallel PR open at scverse/scanpy#4138 with the same one-line scoping fix and equivalent CPU-side tests. Happy to land them together, or to hold this one until the scanpy side merges, whichever you prefer.