U.S. Department of Energy

Pacific Northwest National Laboratory

A Flexible Learning Infrastructure for Proteomics

Introduction 

Developing a scoring model for MS/MS sequence-spectrum matches has been typically considered as part of the software development process and is rarely a user-serviceable component. Models are usually created from scratch and hard-coded for each project or fragmentation, which hinders adaptation and requires significant source code modifications to process new data types. In addition, advances such as increased resolving power and new dissociation methods result in new types of spectra where different fragment ions (e.g. due to neutral and side chain losses) can be considered for scoring enhancements. Here we introduce the Flexible Learning Infrastructure for Proteomics (FLIP). As in machine learning, FLIP conceives training/learning as an ongoing process, allowing rapid customizations over time.

Methods 

FLIP accepts as input raw MS/MS data, true-positive sequence matches, and true-negative sequence matches in community standard data formats for both spectra and identifications. A classifier is used to weight fragment ion features, such as mass error, isotopic fit, and intensity that best separate the true-positive and true-negative training data. By default, it supports both logistic regression and support vector machine models through the Accord.NET machine learning framework. Cross-validation feature reduction is used to select relevant fragment ions. The trained model is written to a tab-separated file, which serves as input for MS/MS scoring in a search engine. It also provides figures such as score histograms and ROC curves for performance evaluation of the trained model.

Preliminary Data 

The FLIP framework utilizes a modular design and the dependency injection software design pattern. It has been divided into five customizable core modules: parsing, for reading raw spectra and identifications; pre-processing, which performs deconvolution and spectrum filtering; modelling, which selects features from data for training; learning, which runs a machine learning classifier; and cross-validation. This allows users to substitute any of these modules with their own code without recompilation of the FLIP framework code and provides flexibility for new implementations such as support for their own data formats, custom features to train on, learning models, or cross validation metrics.

 

FLIP does not require the user to manually determine which type of fragment ions to train on. It starts with a very large set of possible fragment ions and performs multiple rounds of 10-fold cross-validation, each round reducing the number of fragment ions used for training. Area under the ROC curve is calculated each round to determine the optimal set of fragment ions.

 

Because our goal is to make this framework flexible to various types of data, we have used it to train scoring models on samples from peptides and intact proteins from both HCD and ultraviolet photodissociation (UVPD) dissociation methods. We created bottom up and top down MS/MS spectra on a Thermo QExactive for 6 bacterial organisms, which resulted in over 100,000 unique peptides for training and testing the scoring algorithms. The scoring models were evaluated with a Hela Lysate sample using our database search tool, MSPathFinder. We observed that FLIP was able to define effective models with significant differences in the numbers and types of fragment ions found for each of these data types, increasing the number of confident identifications (<1% false discovery rate) by at least 14% when compared to a peak counting scoring model.

 

Novel aspects 

FLIP is a modular, user-serviceable proteomics framework for developing and optimizing MS/MS scoring models to process different fragmentation spectra.

| Pacific Northwest National Laboratory