perf: replace jmzml JAXB parser with StAX-based mzML reader#6
perf: replace jmzml JAXB parser with StAX-based mzML reader#6
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Why more hits? |
|
What do you mean by more hits? I think when you pass ms2 level filter you manage to search only on the spectra that is ms2; this dataset is a TMT MS3 which means that if you don't pass the threshold you get also some hits in ms3. |
|
I misread the table |
There was a problem hiding this comment.
Pull request overview
This PR replaces the previous jmzML/JAXB-based mzML reading path with a custom StAX streaming parser and wires it through SpectraAccessor as the default mzML backend, alongside broad removal of mzXML-related code paths and dependencies.
Changes:
- Added
StaxMzMLParserplus mzML iterator/map adapters and integrated them intoSpectraAccessorfor mzML access. - Removed jmzML/jmzreader (and related converter/adapter classes) from the build and codebase.
- Removed/disabled mzXML support across CLI/scripts/utilities and tests (format constants, parameter validation, helper tools).
Reviewed changes
Copilot reviewed 71 out of 72 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/java/msgfplus/TestStaxMzMLParser.java | New unit tests validating StAX mzML parsing, indexing, caching, and decoding. |
| src/test/java/msgfplus/TestScoring.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/test/java/msgfplus/TestParsers.java | Removes mzXML reading test case. |
| src/test/java/msgfplus/TestMSGFPlus.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java | New StAX mzML parser with index build + preload/cache + binary decoding. |
| src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraMap.java | New SpectrumAccessorBySpecIndex implementation backed by StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraIterator.java | New sequential iterator wrapper with MS-level filtering + polarity warnings. |
| src/main/java/edu/ucsd/msjava/msutil/SpectraAccessor.java | Wires new StAX mzML backend; removes old jmzML adapter usage. |
| src/main/java/edu/ucsd/msjava/msutil/Spectrum.java | Updates spectrum format inference to no longer map .mzXML to a supported format. |
| src/main/java/edu/ucsd/msjava/msutil/SpecFileFormat.java | Removes MZXML constant and drops it from supported format list. |
| src/main/java/edu/ucsd/msjava/params/ParamManager.java | Stops accepting mzXML in spectrum file parameters. |
| src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/ui/MzIDToTsv.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/mzid/MzIDParser.java | Switch log suppression from MzMLAdapter to StaxMzMLParser. |
| src/main/java/edu/ucsd/msjava/ui/PRMSpecGen.java | Removes mzXML option handling in argument parsing / spectrum reading. |
| src/main/java/edu/ucsd/msjava/ui/MSProfile.java | Removes mzXML option handling in format detection and iteration. |
| src/main/java/edu/ucsd/msjava/ui/MSDictionary.java | Removes mzXML branch when selecting spectrum iterator backend. |
| src/main/java/edu/ucsd/msjava/scripts/SpecFileValidator.java | Removes mzXML branch when selecting spectrum iterator backend. |
| src/main/java/edu/ucsd/msjava/scripts/SelectSpectra.java | Switches from MzXMLSpectraMap to SpectraAccessor (still targets .mzXML). |
| src/main/java/edu/ucsd/msjava/scripts/MergeSpectra.java | Disables mzXML merge entry point; refactors internal (disabled) path to SpectraAccessor. |
| src/main/java/edu/ucsd/msjava/scripts/CountSpectra.java | Replaces mzXML counting implementation with UnsupportedOperationException. |
| src/main/java/edu/ucsd/msjava/misc/PreprocessSpec.java | Removes mzXML option handling and mzXML iterator path. |
| src/main/java/edu/ucsd/msjava/misc/PhosAnalysis.java | Removes mzXML branch and disables mzXML-dependent method body. |
| src/main/java/edu/ucsd/msjava/misc/IPRGStudy.java | Disables mzXML-dependent logic path. |
| src/main/java/edu/ucsd/msjava/misc/HeckWhole.java | Replaces some mzXML access with SpectraAccessor; disables mzXML conversion method. |
| src/main/java/edu/ucsd/msjava/misc/HeckRevision.java | Replaces mzXML iterator/map usage with SpectraAccessor. |
| src/main/java/edu/ucsd/msjava/misc/HeckPercolator.java | Disables mzXML-dependent logic path. |
| src/main/java/edu/ucsd/msjava/misc/HCDCIDETD.java | Disables mzXML-dependent logic paths. |
| src/main/java/edu/ucsd/msjava/misc/ControlNew.java | Disables mzXML-dependent logic path. |
| src/main/java/edu/ucsd/msjava/misc/Clauser.java | Disables mzXML-dependent logic path. |
| src/main/java/edu/ucsd/msjava/misc/Chores.java | Disables mzXML-dependent logic paths. |
| src/main/java/edu/ucsd/msjava/misc/CIDETDPairs.java | Removes mzXML counting branch (mgf-only). |
| src/main/java/edu/ucsd/msjava/misc/AnnotatedSpecGenerator.java | Removes mzXML map branch (mgf-only). |
| src/main/java/edu/ucsd/msjava/ui/MSGFLib.java | Replaces mzXML handling with “no longer supported” message + continue. |
| src/main/java/edu/ucsd/msjava/ui/MSGF.java | Replaces mzXML handling with “no longer supported” message + continue. |
| src/main/java/edu/ucsd/msjava/parser/PSMList.java | Removes mzXML branch when selecting SpectrumAccessorBySpecIndex backend. |
| src/main/java/edu/ucsd/msjava/parser/SortedSpectraIterator.java | Removes mzXML demo main() content. |
| src/main/java/edu/ucsd/msjava/msutil/SpecKey.java | Removes mzXML demo main() content and mzXML iterator import. |
| src/main/java/edu/ucsd/msjava/mzml/MzMLAdapter.java | Deleted (jmzML-based mzML adapter removed). |
| src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraMap.java | Deleted (jmzML-based spectra map removed). |
| src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraIterator.java | Deleted (jmzML-based iterator removed). |
| src/main/java/edu/ucsd/msjava/mzml/SpectrumConverter.java | Deleted (jmzML Spectrum conversion removed). |
| src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraMap.java | Deleted (mzXML random access removed). |
| src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraIterator.java | Deleted (mzXML iterator removed). |
| src/main/java/edu/ucsd/msjava/parser/MzXMLToMgfConverter.java | Deleted (mzXML->mgf conversion removed). |
| src/main/java/edu/ucsd/msjava/misc/MzXMLToMgf.java | Deleted (mzXML->mgf tool removed). |
| src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraMapTest.java | Deleted (mzXML test removed). |
| src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraIteratorTest.java | Deleted (mzXML test removed). |
| src/main/java/edu/ucsd/msjava/scripts/AgilentCyclicSpecPreProcess.java | Replaces mzXML logic with UnsupportedOperationException. |
| src/main/java/org/systemsbiology/jrap/stax/* | Deleted legacy jrap stax mzXML/mzML parsing/indexing infrastructure. |
| pom.xml | Removes jmzml/jmzreader dependencies and updates shading notes accordingly. |
Comments suppressed due to low confidence (2)
src/main/java/edu/ucsd/msjava/msutil/SpectraAccessor.java:95
- SpectraAccessor can be constructed with an unrecognized extension (e.g., now that .mzXML is no longer in SpecFileFormat), which makes specFormat null. getSpecMap()/getSpecItr() then dereference specFormat (e.g., specFormat.getPSIName()) and can throw a NullPointerException. Consider validating specFormat in the constructor and throwing an IllegalArgumentException with a clear message, or providing a safe fallback/error path before using specFormat.
public SpectraAccessor(File specFile) {
this(specFile, SpecFileFormat.getSpecFileFormat(specFile.getName()));
}
/**
* Constructor that accepts a file and a file format
*
* @param specFile
* @param specFormat
*/
public SpectraAccessor(File specFile, SpecFileFormat specFormat) {
this.specFile = specFile;
this.specFormat = specFormat;
this.spectrumParser = null;
}
/**
* Set the MS level range for spectrum filtering (both inclusive).
*
* @param minMSLevel minimum MS level to consider (inclusive).
* @param maxMSLevel maximum MS level to consider (inclusive).
*/
public void setMSLevelRange(int minMSLevel, int maxMSLevel) {
this.minMSLevel = minMSLevel;
this.maxMSLevel = maxMSLevel;
}
public SpectrumAccessorBySpecIndex getSpecMap() {
if (specMap == null) {
if (specFormat == SpecFileFormat.MZML) {
if (staxParser == null) {
try {
staxParser = new StaxMzMLParser(specFile);
} catch (Exception e) {
throw new RuntimeException("Failed to parse mzML file: " + specFile.getAbsolutePath(), e);
}
}
specMap = new StaxMzMLSpectraMap(staxParser, minMSLevel, maxMSLevel);
} else if (specFormat == SpecFileFormat.DTA_TXT)
specMap = new PNNLSpectraMap(specFile.getPath());
else {
SpectrumParser parser = null;
if (specFormat == SpecFileFormat.MGF)
parser = new MgfSpectrumParser();
else if (specFormat == SpecFileFormat.MS2)
parser = new MS2SpectrumParser();
else if (specFormat == SpecFileFormat.PKL)
parser = new PklSpectrumParser();
else
return null;
spectrumParser = parser;
specMap = new SpectraMap(specFile.getPath(), parser);
}
}
if (specMap == null) {
System.out.println("No spectra were found");
System.out.println("File: " + specFile.getAbsolutePath());
System.out.println("Format: " + specFormat.getPSIName());
}
src/main/java/edu/ucsd/msjava/params/ParamManager.java:406
- This change removes mzXML from the accepted spectrum file formats for CLI parameter parsing. The PR description focuses on replacing the mzML parser; if mzXML removal is intended, it should be called out explicitly (and ideally split into a separate PR) since it’s a breaking change for existing workflows.
public void addSpecFileParam(boolean isOptional) {
FileParameter specFileParam = new FileParameter(ParamNameEnum.SPECTRUM_FILE);
if (isOptional) {
specFileParam.setAsOptional();
}
specFileParam.addFileFormat(SpecFileFormat.MZML);
specFileParam.addFileFormat(SpecFileFormat.MGF);
specFileParam.addFileFormat(SpecFileFormat.MS2);
specFileParam.addFileFormat(SpecFileFormat.PKL);
specFileParam.addFileFormat(SpecFileFormat.DTA_TXT);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Override | ||
| public Spectrum next() { | ||
| Spectrum cur = currentSpectrum; | ||
| currentSpectrum = delegate.hasNext() ? delegate.next() : null; | ||
| if (currentSpectrum == null) hasNext = false; | ||
|
|
||
| if (cur.getScanPolarity() == Spectrum.Polarity.NEGATIVE) { | ||
| warnNegativePolarity(cur); | ||
| } | ||
| return cur; |
There was a problem hiding this comment.
StaxMzMLSpectraIterator.next() does not guard against being called when hasNext() is false; currentSpectrum can be null and cur.getScanPolarity() will throw a NullPointerException. Iterator.next() should throw NoSuchElementException when exhausted.
| public Spectrum getSpectrumBySpecIndex(int specIndex) { | ||
| Spectrum cached = cache.get(specIndex); | ||
| if (cached != null) return cached; | ||
|
|
||
| if (!allLoaded) { | ||
| try { | ||
| preloadAllSpectra(); | ||
| } catch (Exception e) { | ||
| System.err.println("Error preloading spectra: " + e.getMessage()); | ||
| return null; | ||
| } | ||
| return cache.get(specIndex); | ||
| } |
There was a problem hiding this comment.
StaxMzMLParser.getSpectrumBySpecIndex() triggers a full-file preload on the first cache miss, even when specIndex is invalid/not present in the index. This can turn a simple “not found” lookup into an expensive full parse. Consider checking indexBySpecIdx.containsKey(specIndex) (or range) up front and returning null immediately for unknown indices.
|
So when the MS2 level is specified, does the number of identifications increase than original codes for the TMT MS3 datasets? Interesting |
|
When the msgf+ search by mistake MS3, then it inflates the PSMs and reduce Ids. This is a big problem we have when we are searching because by default comet is in quantms charge 2; and for msgf+ we have to provide the fragmentation mode. |
Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX streaming parser that extracts only the 11 fields MSGF+ needs. The new parser builds a spectrum index in a single pass, then preloads all spectra into memory on first random access, eliminating repeated XML parsing during the search phase. Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads): - Wall time: 957s -> 853s (-10.9%) - PSMs at 1% FDR: 6,936 (unchanged) - Score completeness: 100% (unchanged) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port - Remove jmzml (JAXB-based mzML parser) dependency from pom.xml - Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap, MzMLSpectraIterator, SpectrumConverter - Add referenceableParamGroupRef resolution to StaxMzMLParser: builds a map of param groups during index pass, resolves refs during spectrum parsing (critical for files that define polarity, MS level, etc. in referenceable groups) - Move turnOffLogs() utility to StaxMzMLParser, update all callers - Keep fastutil dependency (needed by jmzidml at runtime) JAR size reduced from 39.5MB to 38MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports anywhere in the codebase — a dead dependency from earlier development. All spectrum file format parsing uses custom implementations: mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL (custom parsers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove mzXML file format support entirely: - Delete embedded jrap/stax library (20 files, ~5,800 lines) - Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter - Delete MzXMLToMgf utility and mzXML test resources (38MB) - Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager - Update misc/scripts/ui classes to remove mzXML code paths mzXML is a legacy format superseded by mzML. Users with mzXML files can convert to mzML using msconvert (ProteoWizard). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache, fix class-level doc (preload-all, not bounded LRU), check index before preloading, propagate exceptions instead of returning null - StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted - SpectraAccessor: throw exception instead of System.exit(-1), validate specFormat is non-null in constructor - SelectSpectra: update stale .mzXML reference to .mzML - pom.xml: fix duplicate <manifest>, remove stale comments, note fastutil is required by jmzidentml at runtime Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Write search results directly to TSV from in-memory objects, bypassing mzIdentML serialization. Output is column-identical to MzIDToTsv (verified by diff on test.mgf search). This avoids generating large .mzid files when only TSV is needed downstream (e.g. OpenMS MSGFPlusAdapter, Percolator). - New DirectTSVWriter class with same score/protein/mod logic as MZIdentMLGen but streaming tab-delimited output - New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both - Includes fixed + variable mods, MGF Title column, decoy filtering - Backwards compatible: default remains mzid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now includes all PSMFeatureFinder columns needed for Percolator: ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio, MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency, NumMatchedMainIons, and all error statistics (MeanError/StdevError for All and Top7, both absolute and relative). These features were previously only available as UserParams in mzid and were not extracted by OpenMS's addMSGFFeatures() — now they are directly accessible as TSV columns. The peptide modification format (M+15.995) is already compatible with OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms it to bracket notation M[+15.995] for AASequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ConvertToMgf-based tests (class removed in PR #7) with StaxMzMLParser and SpectraAccessor mzML parsing tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abe1e38 to
3eb2966
Compare
Summary
MzMLUnmarshallerwith a custom StAX streaming parser (StaxMzMLParser)Architecture
Wired into
SpectraAccessor.javaas drop-in replacement forMzMLSpectraMap/MzMLSpectraIterator.Files Changed (5 files)
Source (2 new, 1 modified):
StaxMzMLParser.java— Core StAX parser with index, preload, base64/zlib decodingStaxMzMLSpectraMap.java— SpectrumAccessorBySpecIndex implementationStaxMzMLSpectraIterator.java— Sequential iterator with MS level filteringSpectraAccessor.java— Wire StAX classes, remove jmzml dependency for spectrum accessTests (1 new):
TestStaxMzMLParser.java— 18 tests covering parsing, indexing, caching, iteration, binary decodingBenchmark (TMT 1.1GB mzML, target-decoy, 4 threads)
Test plan
mvn testpasses (155 tests, 0 failures)🤖 Generated with Claude Code