Fix silent numerical blowup in regress_out with near-collinear regressors#673
Open
Arshammik wants to merge 1 commit into
Open
Fix silent numerical blowup in regress_out with near-collinear regressors#673Arshammik wants to merge 1 commit into
Arshammik wants to merge 1 commit into
Conversation
`cp.linalg.det(R^T R) == 0` was a vacuous guard: in float arithmetic the det of a near-singular Gram matrix is small-but-nonzero (cond=1e12 gives det≈90, not 0), so the guard never fired for near-collinear regressors and the unstable `cp.linalg.inv` path was taken, producing residuals ~1.6e+5 instead of ~1. Replace the `det == 0` check with `cp.linalg.cond(gram) > 1e12` in both the dense (`_regress_out_continuous`) and Dask (`_regress_out_continuous_dask`) paths. The threshold is set to 1e12 to be permissive of well-scaled single-cell regressors with disparate magnitudes (e.g. n_counts ~ 1e3 and percent_mito ~ 1e-2 routinely give cond ~ 1e10) while still catching truly-collinear inputs. While there, switch the dense "all" path from explicit `inv() @ (R^T @ X)` to `cp.linalg.solve(gram, R^T @ X)` — same math, more numerically stable. The existing test's f64 atol is loosened from 1e-7 to 1e-6 to accommodate the small numerical drift between `inv` and `solve`. A new test (`test_regress_out_near_singular_regressors_no_garbage`) verifies that for near-collinear regressors (cond ~ 1e12) the output contains no NaN/Inf and residuals stay bounded by 100× the input range (vs the previous explosion to 1e+5).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
rsc.pp.regress_outwas supposed to fall back to a pseudoinverse when the regressor Gram matrix was singular, but the guard wascp.linalg.det(R.T @ R) == 0. In float arithmetic the determinant of a near-singular matrix is small but not zero (a Gram with cond around 1.45e+12 gives det around 91), so the check never fired and thecp.linalg.inv(R.T @ R) @ Xpath ran on a badly-conditioned matrix. Users passing two near-collinear continuous regressors got residuals on the order of 1e+5 where they should have been order 1.I replaced the det check with a condition-number check at threshold
1e12, and switched the dense path frominv() @ Xtocp.linalg.solve(gram, X)— same math, about 1 ULP tighter. The threshold is calibrated so the common single-cell case of disparate-scale regressors (n_countsandpercent_mitoroutinely sit at cond around 1e10) still goes through the fast path; only genuinely degenerate inputs trip the fallback. The Dask path can't fall back cleanly the same way, so it raisesValueErrorwith the actual cond number — clearer than the old "linearly dependent" message it used to produce only when det was exactly zero.I verified the fix on an H100. On a synthetic case with cond around 1.45e+12 the max absolute residual went from 163,776 (one entry was -3,701 where the lstsq reference said 0.62) to 9.5 — roughly four orders of magnitude. The well-conditioned control case is unchanged at 4.7e-7. The full existing test suite still passes; one assertion in
test_regress_out_ordinal[float64]needed its atol loosened from 1e-7 to 1e-6 because the test compares the "all" path (nowsolve) against the batched cuML SVD path and they now drift by about 3e-7 on f64. There's a comment in the test explaining the change.Two new tests live in
tests/test_regressout.py: one parametrized over f32/f64 confirms the dense path produces no NaN/Inf and keeps residuals bounded for near-collinear regressors, the other confirms the Dask path raisesValueError. I also added a one-line Notes entry to theregress_outdocstring so users know about the fallback behavior.Thanks,
Arsham