U.S. Department of Energy

Pacific Northwest National Laboratory

MSPathFinder: An Open Source Proteoform Identification and Quantification Tool for Top-Down Proteomics


Top-down proteomics provides valuable information on biologically active intact protein forms (i.e. proteoforms). However, analyzing top-down mass spectrometry (MS) data remains challenging because (1) spectra are complex due to the many different possible fragments as well as their possible large size and multiple charge states and (2) large number of possible proteoforms due to e.g. post-translational modifications (PTMs). We present MSPathFinder, a new open source database search tool for top-down proteomics. MSPathFinder addresses the aforementioned challenges using a new deconvolution algorithm ProMex and a graph-based search algorithm called the sequence graph. Also, MSPathFinder’s results can be viewed with our new interactive visualization tool, LcMsSpectator. Using several datasets from complex samples, we show that MSPathFinder and LcMsSpectator greatly facilitate top-down proteomics.



MSPathFinder works just like a typical database search tool for bottom-up proteomics; it takes a spectrum file, protein sequences, and modifications as an input and reports proteoform spectrum matches (PrSMs) and their scores. It first runs Mspot to identify plausible precursor masses of all MS/MS spectra. Next, for each MS/MS spectrum, it finds the proteoform with the most fragment ion matches by database searching and reports false discovery rates (FDRs) using the target-decoy approach. LcMsSpectator takes the spectrum file, ProMex results, and MSPathFinder results as an input and displays LC-MS features with their abundance values as well as extracted ion chromatograms and spectrum annotations of identified proteoforms. It also allows users to modify mischaracterized proteoforms and export the results.


Preliminary Data 

A challenge in top-down proteomics is how to explore a huge search space of proteoforms. Two popular tools, ProSightPC and MS-Align+, address this issue differently. First, ProSightPC restricts the search space to known proteoforms. While it enables accurate and sensitive characterization, it is only applicable to organisms with well annotated genomes. In contrast, MS-Align+ allows mass shifts for any protein residue in the database. Although this “blind” search approach is valuable for discovery of unknown PTMs and mutations, it often reports excessive false positives. MSPathFinder fills the gap between ProSightPC and MS-Align+, considering all proteoforms derived by applying modifications to input protein sequences. For this to work, it is essential to cope with the exponentially growing number of proteoforms, particularly proteoforms that differ only in the placement of PTMs, which drastically increase the search space. MSPathFinder efficiently addresses this problem using the sequence graph, which simultaneously scores all proteoforms derived from the same sequence.


For benchmarking, we used 4 datasets from Shewanella oneidensis cell lysates (103,142 spectra, denoted Shewanella) and 10 datasets from human ovarian cancer tissues (68,711 spectra, denoted Human). We first ran MSPathFinder and MS-Align+ and compared the number of identified PrSMs at 1% FDR. For MSPathFinder, Acetylation of protein N-terminus, Oxidation of Met, and Dehydration of Cys were considered as dynamic modifications. For MS-Align+, up to 2 blind modifications were allowed per sequence and obvious false identifications were filtered out using an in-house script. For both the Shewanella and Human datasets, MSPathFinder identified significantly more proteoforms than MS-Align+. The difference was much larger for proteoforms with masses over 25,000 Da, where MS-Align+ missed most proteoforms identified by MSPathFinder. Furthermore, unlike MS-Align+, a majority of proteoforms identified by MSPathFinder did not require further manual validation. Regarding the running time, MSPathFinder was 3-6 times faster than MS-Align+.


Novel Aspect 

A new proteoform identification and quantification tool greatly facilitates top-down proteomics data analysis.

| Pacific Northwest National Laboratory