U.S. Department of Energy

Pacific Northwest National Laboratory

Flexible Library Assisted SearcH (FLASH): a hybrid library/database search engine

Introduction

Library searches are an efficient methodology for peptide/spectrum match identification, and have improved sensitivity relative to database search algorithms due to a restricted search space. One traditional drawback of library searches is the perceived incompleteness of libraries; however, with the increased data being available in public repositories, this is becoming less of a problem. Unfortunately, the increase in library sizes also increases the time of computation for library searches. A final challenge is the calculation of false-discovery rates and related confidence metrics. Although several methods have been proposed, none has gained wide-spread use. We propose a hybrid library/database search engine to leverage the prior knowledge contained in a library and the strong computational methods for FDR calculation in database searches.

Method

Libraries are made from mzID and mzML files, with highly confident PSM identifications (q<0.0001) across numerous organisms and experimental designs.  Each tandem mass spectrum is limited to the 20 most abundant peaks, to improve the specificity of matching to query spectra. We utilize a novel set overlap algorithm, called the Blazing Signature Filter, to identify shared peaks between query and library spectra. This search does not need to use precursor m/z and therefore has the ability to identify spectra from related peptides in addition to spectra from identical peptides. Candidate PSMs are then scored with the generating function from MSGF+ to obtain a spectrumEvalue. 

Preliminary Data

As the number of publicly accessible mass spectra grows, the coverage of proteome continues to increase. Public datasets now exceed 200TB and billions of spectra in hundreds of organisms. This deepening coverage of observable peptides increases the success of library search algorithms for peptide identification. To efficiently utilize this data for library searches requires rapid algorithms to compare query spectra to the ever expanding library. 
We devised the FLASH algorithm to quickly identify candidate PSMs based on the overlap of fragment ions in MS/MS spectra. We utilize a novel set overlap algorithm (BSF) that is O(log n) and therefore scales well as data sizes increase. The use of fragment ions increases the stringency of match and therefore reduces the number of candidate PSMs – which is essential when libraries potentially contain millions of peptides. Using a low m/z cutoff, we show that 99.9% of all non- matches between query and library spectra (i.e. false-positives) have zero or one shared peak, thus creating an effective filter for candidate PSMs.
Candidate PSMs that pass the BSF filter are sent to an inference engine performs additional quality checks between the query and library spectra. Candidates are then sent to MSGF+ to determine the spectrum Evalue, a computationally robust metric for data quality. Importantly this method does not depend on database size or decoy hits, and is a statistically rigorous measurement of probability. By using both library search (as a filter) and database search (for spectrum probability scoring) we are able to search vast databases in computationally reasonable time scales and identify peptides from complex multi-organism samples.

Novel Aspect

Hybrid library/database search engine; expandable biodiversity spectral library

| Pacific Northwest National Laboratory