Skip to content

Performance: significant speedup and memory usage improvement for PCA().run()#5352

Open
jberg5 wants to merge 4 commits intoMDAnalysis:developfrom
jberg5:pca-perf
Open

Performance: significant speedup and memory usage improvement for PCA().run()#5352
jberg5 wants to merge 4 commits intoMDAnalysis:developfrom
jberg5:pca-perf

Conversation

@jberg5
Copy link
Copy Markdown

@jberg5 jberg5 commented Mar 29, 2026

Fixes #5351

Changes made in this Pull Request:

  • Build covariance matrix once, during _conclude, rather than incrementally on every frame. Build covariance matrix in batches of 500 frames using vanilla numpy matmul rather than accumulating a dot product every frame. This uses a much better optimized openblas gemm routine, and is where the bulk of the performance improvement comes from.
  • Switch from np.linalg.eig to np.linalg.eigh to take advantage of covariance matrix being symmetric and real-valued by construction. This is where the bulk of the memory usage improvement comes from - eigh lapack routine allocates a smaller workspace.

Results from running a PCA analysis on adk equilibrium data on a gcloud c2d-standard-4 (note: these results are out of date! I switched from a single covariance matrix multiplication to a batched approach, see the table after this one):


  ┌──────────┬───────┬────────┬───────┬─────────┬───────────────┬──────────────┐
  │  Select  │ Atoms │ Before │ After │ Speedup │ Memory Before │ Memory After │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ CA       │ 214   │ 5.2s   │ 3.9s  │ 1.3x    │ 13 MB         │ 27 MB        │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ Backbone │ 855   │ 39.0s  │ 6.4s  │ 6.1x    │ 201 MB        │ 183 MB       │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ All-atom │ 3341  │ 880s   │ 84s   │ 10.5x   │ 3067 MB       │ 1854 MB      │
  └──────────┴───────┴────────┴───────┴─────────┴───────────────┴──────────────┘

Here's the gist I used, heads up though Claude wrote it: https://gist.github.com/jberg5/522fd1892f02453145e59e90e5a41e8b

After adding batching to the covariance matrix computation and running https://gist.github.com/jberg5/ea8fd54d7ed63894527ea812e9aa98ca to compare performance at different numbers of frames, these are the new results:

  All-atom (3,341 atoms) on x86 c2d-standard-4:

  ┌──────────────┬─────────┬───────┬─────────┬────────────┬───────────┐
  │    Frames    │ Before  │ After │ Speedup │ Mem Before │ Mem After │
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤
  │ 4,187 (1x)   │ 849s    │ 78s   │ 10.9x   │ 3,067 MB   │ 1,572 MB  │
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤
  │ 8,374 (2x)   │ 1,437s  │ 85s   │ 16.9x   │ 3,067 MB   │ 1,572 MB  │
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤
  │ 20,935 (5x)  │ 3,190s  │ 105s  │ 30.4x   │ 3,068 MB   │ 1,573 MB  │
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤
  │ 41,870 (10x) │ skipped │ 137s  │ —       │ —          │ 1,574 MB  │
  └──────────────┴─────────┴───────┴─────────┴────────────┴───────────┘

  Backbone (855 atoms):

  ┌──────────────┬────────┬───────┬─────────┬────────────┬───────────┐
  │    Frames    │ Before │ After │ Speedup │ Mem Before │ Mem After │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤
  │ 4,187 (1x)   │ 32s    │ 2.1s  │ 15.2x   │ 201 MB     │ 110 MB    │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤
  │ 8,374 (2x)   │ 53s    │ 2.9s  │ 18.3x   │ 201 MB     │ 111 MB    │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤
  │ 20,935 (5x)  │ 125s   │ 5.2s  │ 24.0x   │ 202 MB     │ 112 MB    │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤
  │ 41,870 (10x) │ 243s   │ 9.0s  │ 27.0x   │ 204 MB     │ 113 MB    │
  └──────────────┴────────┴───────┴─────────┴────────────┴───────────┘

LLM / AI generated code disclosure

LLMs or other AI-powered tools (beyond simple IDE use cases) were used in this contribution: yes

First pass of this was done by Claude (with opus 4.6). I've modified it a little bit to remove some unnecessary changes, improve variable names, etc. I also used Claude extensively to write benchmarking scripts.

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added? No - I think existing coverage is pretty good already, and this is a performance improvement
  • Documentation updated/added?
  • package/CHANGELOG file updated?
  • Is your name in package/AUTHORS? (If it is not, add it!)
  • LLM/AI disclosure was updated.

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5352.org.readthedocs.build/en/5352/

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 29, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.83%. Comparing base (c32fe45) to head (44f3340).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
package/MDAnalysis/analysis/pca.py 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5352      +/-   ##
===========================================
- Coverage    93.84%   93.83%   -0.02%     
===========================================
  Files          182      182              
  Lines        22492    22502      +10     
  Branches      3199     3201       +2     
===========================================
+ Hits         21107    21114       +7     
- Misses         923      925       +2     
- Partials       462      463       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@BradyAJohnston BradyAJohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to block this until we have some core dev discussion about it.

Our AI Policy is currently quite restrictive on AI usage. It's this way to help us deal with the wave of slop PRs that we get, and would currently exclude this PR.

Given the performance improvements, limited scope & demonstrated understanding I would argue this would be worthy of an exception but that is different discussion that we need to have.

@jberg5
Copy link
Copy Markdown
Author

jberg5 commented Mar 30, 2026

@BradyAJohnston thanks for the context, totally understand. Just to give a little more background on me and how I have been using Claude: I work in quant finance, where I've picked up some practical experience with using the scientific computing stack to try to pull signal out of noise. My thesis with Claude is that if I use it as a translator, it can help me map my experience onto different domains (perhaps with a little more benefit to society than my day job lol). Turns out PCA is good for more than just looking for factors driving equity prices !!!

The workflow I'm using looks something like 0) clone a cool project like MDAnalysis 1) find or write suitable benchmarks of realistic workflows 2) tell Claude "here are some patterns I want you to look for" (python loops that can be vectorized, linear algebra operations that don't fully exploit the problem structure, etc), and then 3) sift through the slop to find legit improvements.

I try really hard to stick to small diffs that I can completely stand behind. For this particular PR I ended up rewriting most of what Claude proposed anyway, and I think the end result is pretty much identical to what I would have done by hand, the only difference is that AI helped me find the opportunity and iterate much more quickly than I otherwise would have.

Anyway, I know that there's a lot of garbage out there these days, and it's gotta be a lot of work to wade through it, so I'll respect whatever decision you make. Really appreciate your time!

else:
x = self._atoms.positions.ravel()
x -= self._xmean
self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason for building the covariance matrix incrementally is that it ensures bounds on memory usage. In your approach, memory consumption grows linearly with trajectory length because you're copying coordinates into memory and thus a longer trajectory will eventually fill up the memory. The test trajectory is very short. Real trajectories are easily 1000-100,000 times longer.

If there's really an advantage to multiplying huge matrices then perhaps a batched approach is a good middle ground.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some manual benchmarking:

import numpy as np
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
x = u.atoms.positions.ravel()

shows that replacing the np.dot with transposes with np.outer will help already

>>> %timeit np.dot(x[:, np.newaxis], x[:, np.newaxis].T)
61.4 ms ± 466 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit np.outer(x, x)
35.4 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, the addition update is actually slow

%%timeit cov = np.zeros((len(x), len(x))); xx = np.outer(x,x)
    ...: cov += xx
    ...:
97.5 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it makes sense to do the addition as part of matrix multiplication.

Batching is probably the way to go (but that makes streaming more difficult).

Copy link
Copy Markdown
Author

@jberg5 jberg5 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call about batching! And thanks for the context on realistic problem sizes, this is where my complete inexperience in this field really shows 🥲

I did some parameter tuning and settled on a batch size of 500 frames, this seems to capture most of the original speedup while keeping peak memory below what it used to be.

To benchmark additional frames, I basically just tiled the adk data together (not going for realism here, just size haha). You can see the script I used here: https://gist.github.com/jberg5/ea8fd54d7ed63894527ea812e9aa98ca

Results comparing before (what's on main) with my current branch:

  All-atom (3,341 atoms) on x86 c2d-standard-4:                                                                      
                                                                                                                     
  ┌──────────────┬─────────┬───────┬─────────┬────────────┬───────────┐                                              
  │    Frames    │ Before  │ After │ Speedup │ Mem Before │ Mem After │                                              
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤                                              
  │ 4,187 (1x)   │ 849s    │ 78s   │ 10.9x   │ 3,067 MB   │ 1,572 MB  │                                              
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤                                              
  │ 8,374 (2x)   │ 1,437s  │ 85s   │ 16.9x   │ 3,067 MB   │ 1,572 MB  │                                              
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤                                              
  │ 20,935 (5x)  │ 3,190s  │ 105s  │ 30.4x   │ 3,068 MB   │ 1,573 MB  │                                              
  ├──────────────┼─────────┼───────┼─────────┼────────────┼───────────┤                                              
  │ 41,870 (10x) │ skipped │ 137s  │ —       │ —          │ 1,574 MB  │
  └──────────────┴─────────┴───────┴─────────┴────────────┴───────────┘                                              
                                                            
  Backbone (855 atoms):                                                                                              
                                                            
  ┌──────────────┬────────┬───────┬─────────┬────────────┬───────────┐                                               
  │    Frames    │ Before │ After │ Speedup │ Mem Before │ Mem After │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤                                               
  │ 4,187 (1x)   │ 32s    │ 2.1s  │ 15.2x   │ 201 MB     │ 110 MB    │
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤                                               
  │ 8,374 (2x)   │ 53s    │ 2.9s  │ 18.3x   │ 201 MB     │ 111 MB    │                                               
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤                                               
  │ 20,935 (5x)  │ 125s   │ 5.2s  │ 24.0x   │ 202 MB     │ 112 MB    │                                               
  ├──────────────┼────────┼───────┼─────────┼────────────┼───────────┤                                               
  │ 41,870 (10x) │ 243s   │ 9.0s  │ 27.0x   │ 204 MB     │ 113 MB    │
  └──────────────┴────────┴───────┴─────────┴────────────┴───────────┘  

I'm pretty excited by how the magnitude of the speedup scales with more frames.

Copy link
Copy Markdown
Author

@jberg5 jberg5 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for completeness here's the story if you just isolate calcium:

  ┌──────────────┬──────────┬─────────┬─────────┬────────────┬───────────┐                                           
  │    Frames    │ Baseline │ Patched │ Speedup │ Mem Before │ Mem After │
  ├──────────────┼──────────┼─────────┼─────────┼────────────┼───────────┤                                           
  │ 4,187 (1x)   │ 1.4s     │ 0.4s    │ 3.5x    │ 13 MB      │ 9 MB      │
  ├──────────────┼──────────┼─────────┼─────────┼────────────┼───────────┤                                           
  │ 41,870 (10x) │ 12.6s    │ 3.2s    │ 3.9x    │ 15 MB      │ 11 MB     │                                           
  └──────────────┴──────────┴─────────┴─────────┴────────────┴───────────┘  

Not quite as much juice to squeeze here but still an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance improvement: PCA could be much faster

3 participants