U.S. Department of Energy

Pacific Northwest National Laboratory

Prokaryotic Proteogenomics

Genome annotations are often insufficient, mostly containing predictions based primarily on DNA sequence and not experimental evidence. Proteogenomics is the use of experimental proteomics data to re-evaluate protein sequence predictions. The peptides identified in mass spectrometry experiments are compared to the public genome annotation, and any novel coding region is noted.

For ease of access, proteomics results are provided in gff format with peptides mapped onto their DNA coordinates. Files can be downloaded and displayed using a genome viewer, e.g. NCBI Genome workbench, artemis, CLC Genomics workbench, etc.


Proteogenomics methods are explained in detail in the publication Proteogenomic Analysis of Bacteria and Archaea (alternatively, download from PLoS One). Briefly, tandem mass spectra were searched by Inspect [PMID:16013882] against a translation of the genome and subsequently rescored with PepNovo [PMID:15858974] and MSGF [PMID:18597511].We downloaded genomic DNA from RefSeq, and translated all six frames to generate a protein database. Each stop to stop open reading frame (ORF) was included regardless of coding potential. We concatenated decoy records by shuffling each ORF. Significant peptide/spectrum matches (PSM) were those with a pvalue of e-10 or better, which led to a peptide level false discovery rate of ~ 0.3%. All confident peptides were mapped onto their genomic location (nucleotide coordinates) and grouped into sets within an ORF. We employ five ORF filter. First we remove low complexity peptides and peptides which are more than 750 bp from the next in-frame peptide. We remove ORFs which lack a uniquely mapping peptide or which lack a fully tryptic peptide. Finally, we require two peptides per protein.

Peptides found in the gff files are from ORFs which meet all of the above criteria.


Peptides mapping to regions of the genome which lack a protein annotation represent either novel genes, or 5' extensions of current genes. For example, at 3.376 MB in the B. anthracis Sterne genome lies the BAS3403 gene, the small spore protein Tlp. Many peptides from this protein were discovered in our dataset. In the open reading frame directly upstream we discover other peptides which are not part of any currently annotated protein. Blasting this ORF reveals homology to another spore coat protein from B. cereus, but currently not annotated in any anthrax genome.



There are multiple levels of data and analysis for this project. We have uploaded to PeptideAtlas the two that are easy to download and interpret.

  • Mapped peptides, in GFF format
  • Peptide/spectrum results

See the Repository Link below for the data files. The raw MS/MS instrument data is also available by request (very large file sizes).

Samuel Payne
Area of Research: 
| Pacific Northwest National Laboratory