U.S. Department of Energy

Pacific Northwest National Laboratory



Computational biology research depends on publically available data for training and testing new algorithms. Researchers at the Pacific Northwest National Laboratory have participated in hundreds of collaborative projects that have involved mass spectrometry-based proteomic analysis of more than 300 species or distinct environmental communities. We announce the deposition of proteomics data from 112 microbial organisms representing 15 phyla into public 3rd party repositories (Table 1). Instrument data and analysis files are available at ProteomeXchange via the MassIVE partner repository under the identifiers PXD001860 and MSV000079053. All the data has been analyzed and organized in a uniform manner to make analysis and reuse easier. The combined data deposited is 13 TB from 35,162 mass spectrometry files and their associated analysis files. In total, the library contains >70 million spectra identified at q< 0.0001, with 3 million peptides from 230,000 proteins.

As part of the analysis, we have cross referenced protein identifications to KEGG functional annotation where possible. When viewing the Library as a whole, annotated biological pathways are broadly covered by the identified proteins. For example, the reference ‘cysteine and methionine metabolism pathway’ as defined by KEGG consists of 81 orthologous genes participating in 73 reactions. As expected, not all orthologs are annotated in every genome, e.g. Cellulomonas flavigena has only 23 of the 81 genes. By searching all MS/MS data with standard RefSeq databases, we can easily identify that 21 of the 23 Cellulomonas genes were observed in MS/MS data, or 91%. When considering all organisms in the Library, the median coverage of the cysteine and methionine metabolism pathway is 89%. A summary of the coverage of every KEGG pathway for each organism is presented in Supplemental Table 1. Using KEGG pathway categories, we determined the median coverage of all functionally classified proteins (Figure). For example, in all 13 pathways for amino acid metabolism, the median coverage across the entire library is 89%. This high coverage is seen for most KEGG pathway categories: 82% for lipid metabolism, 83% for vitamin and cofactor metabolism, etc.


To maximize the utility and ease of access, the data described in this publication have been uploaded to the ProteomeXchange with accession PXD001860 via MassIVE. On MassIVE (identifier MSV000079053), each organism’s data is located in a separate folder, with both raw and processed data. Each dataset is associated with a file describing the peptides that were identified via the spectra. This file was created using the MSGF+ algorithm version v9979. Searches included oxidized methionine as an optional post-translational modification, and specified partial trypsin specificity. For experiments that utilized iodoacetimide as an alkylation agent, the static modification (C+57) was also added. Precursor and fragment mass tolerances were set according to the resolving power of the mass analyzer. The output of MSGF+ is stored in the community standard mzIdentML format, which describes the peptide/spectrum match (PSM), search parameters and scoring details.

We created a spectrum library for each microbial organism using Bibliospec. Peptide/spectrum matches were filtered for high quality matches (MSGF+’s q-value <0.0001). When viewed in aggregate, the 112 organisms had 70,455,991 spectra passing this cutoff (with 1951 false hits and an estimated FDR of 2e-5). This strict filtering is necessary to control false-positives when creating very large libraries. The libraries, stored as .blib files, are also available on the MassIVE repository.

Sam Payne
Area of Research: 
Supplemental Materials: 
| Pacific Northwest National Laboratory