U.S. Department of Energy

Pacific Northwest National Laboratory

MS-GF+

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database. It supports the HUPO PSI standard input file (mzML) and saves results in the mzIdentML format, though results can easily be transformed to TSV. ProteomeXchange supports Complete data submissions using MS-GF+ search results.

Area of Research: 
Description: 

MS-GF+ identifies peptides in LC-MS/MS datasets.  It is optimized for a variety of spectral types, i.e., combinations of fragmentation method, instrument, enzyme, and experimental protocols. It supports a variety of input file formats, including mzML, mzXML, Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl), and Concatenated DTA files (_dta.txt)

Version: 

v2017.01.13, released January 13, 2017

Downloads: 
Software Instructions: 

MSGF+ Syntax

Usage: java -Xmx3500M -jar MSGFPlus.jar
	-s SpectrumFile (*.mzML, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt)
	   Spectra should be centroided. Profile spectra will be ignored.
	-d DatabaseFile (*.fasta or *.fa)
	[-o OutputFile (*.mzid)] (Default: [SpectrumFileName].mzid)
	[-t PrecursorMassTolerance] (e.g. 2.5Da, 20ppm or 0.5Da,2.5Da, Default: 20ppm)
	   Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the minus (expMass<theoMass) and 2.5Da to plus (expMass>theoMass)
	[-ti IsotopeErrorRange] (Range of allowed isotope peak errors, Default:0,1)
	   Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation.
	   The combination of -t and -ti determines the precursor mass tolerance.
	   E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
	[-thread NumThreads] (Number of concurrent threads to be executed, Default: Number of available cores)
	[-tda 0/1] (0: don't search decoy database (Default), 1: search decoy database)
	[-m FragmentMethodID] (0: As written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD)
	[-inst MS2DetectorID] (0: Low-res LCQ/LTQ (Default), 1: Orbitrap/FTICR, 2: TOF, 3: Q-Exactive)
	[-e EnzymeID] (0: unspecific cleavage, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage)
	[-protocol ProtocolID] (0: Automatic (Default), 1: Phosphorylation, 2: iTRAQ, 3: iTRAQPhospho, 4: TMT, 5: Standard)
	[-ntt 0/1/2] (Number of Tolerable Termini, Default: 2)
	   E.g. For trypsin, 0: non-tryptic, 1: semi-tryptic, 2: fully-tryptic peptides only.
	[-mod ModificationFileName] (Modification file, Default: standard amino acids with fixed C+57)
	[-minLength MinPepLength] (Minimum peptide length to consider, Default: 6)
	[-maxLength MaxPepLength] (Maximum peptide length to consider, Default: 40)
	[-minCharge MinCharge] (Minimum precursor charge to consider if charges are not specified in the spectrum file, Default: 2)
	[-maxCharge MaxCharge] (Maximum precursor charge to consider if charges are not specified in the spectrum file, Default: 3)
	[-n NumMatchesPerSpec] (Number of matches per spectrum to be reported, Default: 1)
	[-addFeatures 0/1] (0: output basic scores only (Default), 1: output additional features)
	[-ccm ChargeCarrierMass] (Mass of charge carrier, Default: mass of proton (1.00727649))

Examples

High-precision:

java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 20ppm -ti -1,2 -ntt 2 -tda 1 -o testMSGFPlus.mzid -mod Mods.txt

Low-precision

java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 0.5Da,2.5Da -ntt 2 -tda 1 -o testMSGFPlus.mzid -mod Mods.txt

MSGF+ Parameters

  • -s SpectrumFile (.mzML*, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt) - Required
    • Spectrum file name. Currently, MS-GF+ supports the following file formats: mzML, mzXML, mzML, mgf, ms2, pkl and _dta.txt.
    • We recommend to use mzML, whenever possible.
  • -d DatabaseFile (*.fasta or *.fa) - Required
    • Path to the protein database file. If the database file does not have auxiliary index files (*.canno, *.cnlcp, *.csarr, and *.cseq), MS-GF+ will create them.
    • When "-tda 1" option is used, the database specified here must contain only target protein sequences.
    • If multiple MS-GF+ processes access the same database file, it is strongly recommended to index the database prior to the database search by running BuildSA (see below).
  • -o OutputFile (*.mzid)
    • Filename where the output (mzIdentML 1.1 format) will be written.
    • File extension must be "mzid" (case sensitive).
    • By default, the output file name will be "[SpectrumFileName].mzid".
    • E.g. for the input spectrum file "test.mzML", the output will be written to "test.mzid" if this parameter is not specified.
  • -t ParentMassTolerance (Default: 20ppm)
    • Parent mass tolerance in Da. or ppm. There must be no space between the number and the unit. E.g. 2.5Da, 20ppm
    • To set asymmetric tolerances, use comma to separate left (experimental mass < theoretical mass) or right (experimental mass > theoretical mass) tolerances. E.g. 0.5Da,2.5Da
    • It is recommended to use a tight tolerance rather than a loose tolerance (e.g. for Orbitrap data, 10 or 20ppm usually identifies more spectra than 50ppm).
  • -ti IsotopeErrorRange (Default: 0,1)
    • Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
    • If the parent mass tolerance is equal to or larger than 0.5Da or 500ppm, this parameter will be ignored.
    • The combination of -t and -ti determines the precursor mass tolerance.
    • E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
  • -thread NumOfThreads (Number of concurrent threads to be executed, Default: Number of available cores)
    • Number of concurrent threads to be executed together.
    • Default value is the number of available logical cores (e.g. 8 for quad-core processor with hyper-threading support).
  • -tda 0/1 (0: don't search decoy database (default), 1: search decoy database to compute FDR)
    • Indicates whether to search the decoy database or not.
    • If 0, the decoy database is not searched.
    • If 1, FDRs are computed based on the target-decoy approach (i.e. reversed database is appended to the target database and MS-GF+ searches the combined database)
      • FDR(t) = #(DecoyPSMs with score equal or above t) / #(TargetPSMs with score equal or above t).
      • PSM: Peptide-Spectrum Match
      • -log(SpecProb) is used as the score to compute FDR.
    • If -tda 1 is specified, MS-GF+ automatically creates a combined target/reversed database file (DBFileName.revConcat.fasta). Thus, when specifying "-d" parameter, DatabaseFile must contain only target proteins.
  • -m FragmentationMethodID (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
    • Fragmentation method identifier (used to determine the scoring model).
    • If the identifier is 0 and fragmentation method is written in the spectrum file (e.g. mzML files), MS-GF+ will recognize the fragmentation method and use a relevant scoring model.
    • If the identifier is 0 and there is no fragmentation method information in the spectrum (e.g. mgf files), CID model will be used by default.
    • If the identifier is non-zero and the spectrum has fragmentation method information, only the spectra that match with the identifier will be processed.
    • If the identifier is non-zero and the spectrum has no fragmentation method information, MS-GF+ will process all spectra assuming the specified fragmentation method.
    • If the identifier is 4, MS/MS spectra from the same precursor ion (e.g. CID/ETD pairs, CID/HCD/ETD triplets) will be merged and the "merged" spectrum will be used for searching instead of individual spectra. See Kim et al., MCP 2010 for details.
  • -inst InstrumentID (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: High-res LTQ (Default for HCD), 2: TOF, 3: Q-Exactive)
    • Identifier of the instrument to generate MS/MS spectra (used to determine the scoring model).
    • For "hybrid" spectra with high-precision MS1 and low-precision MS2, use 0.
    • For usual low-precision instruments (e.g. Thermo LTQ), use 0.
    • If MS/MS fragment ion peaks are of high-precision (e.g. tolerance = 10ppm), use 2.
    • For TOF instruments, use 2.
    • For Q-Exactive HCD spectra, use 3.
    • For other HCD spectra, use 1.
  • -e EnzymeID (Default: 1)
    • Enzyme identifier. Trypsin (1) will be used by default.
    • 0: unspecific cleavage, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase (Glu-C), 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage
    • Use 9 for peptidomics studies
  • -p ProtocolID (Default: 0)
    • Protocol identifier. Protocols are used to enable scoring parameters for enriched and/or labeled samples.
    • 0: No protocol (Default)
    • 1: Phosphorylation: for phosphopeptide enriched samples
    • 2: iTRAQ: for iTRAQ-labeled samples
    • 3: iTRAQPhospho: for phosphopeptide enriched and iTRAQ-labeled samples
  • -ntt 0/1/2 (Number of tolerable (tryptic) termini, Default: 2)
    • This parameter is used to apply the enzyme cleavage specificity rule when searching the database.
    • Specifies the minimum number of termini matching the enzyme specificity rule.
      • For example, for trypsin, K.ACDEFGHR.C (NTT=2), G.ACDEFGHR.C (NTT=1), K.ACDEFGHI.C (NTT=1) and G.ACDEFGHR.C (NTT=0).
      • '-ntt 2' will search for fully tryptic peptides only.
    • By default, -ntt 2 will be used. Using -ntt 1 (or 0) will make the search significantly slower.
  • -mod ModificationFile (Default: standard amino acids with fixed C+57)]
    • Modification file name. ModificationFile contains the modifications to be considered in the search.
    • If -mod option is not specified, standard amino acids with fixed Carboamidomethylation C will be used.
    • Download an example modification file.
  • -minLength MinPepLength (Default: 6)
    • Minimum length of the peptide to be considered.
  • -maxLength MaxPepLength (Default: 40)
    • Maximum length of the peptide to be considered.
  • -minCharge MinPrecursorCharge (Default: 2)
    • Minimum precursor charge to consider. This parameter is used only for spectra with no charge.
  • -maxCharge MinPrecursorCharge (Default: 3)
    • Maximum precursor charge to consider. This parameter is used only for spectra with no charge.
  • -n NumMatchesPerSpec (Default: 1)
    • Number of peptide matches per spectrum to report.
    • Expected false discovery rates (EFDRs) will be reported only when this value is 1.
  • -addFeatures 0/1
    • If 0, only basic scores are reported.
    • If 1, the following features are reported
      • MS2IonCurrent: Summed intensity of all product ions
      • ExplainedIonCurrentRatio: Summed intensity of all matched product ions (e.g. b, b-H2O, y, etc.) divided by MS2IonCurrent
      • NTermIonCurrentRatio: Summed intensity of all matched prefix ions (e.g. b, b-H2O, etc.) divided by MS2IonCurrent
      • CTermIonCurrentRatio: Summed intensity of all matched suffix ions (e.g. y, y-H2O, etc.) divided by MS2IonCurrent
  • -showQValue 0/1
    • If 0, QValue and PepQValue are not reported.
    • If 1, QValue (PSM-level Q-value) and PepQValue (peptide-level Q-value) are reported (Default).
    • This parameter is ignored when "-tda 0".

MS-GF+ output

MS-GF+ outputs results as an mzIdentML (version 1.1) file. See http://www.psidev.info/mzidentml/ for details on the mzIdentML format. For every PSM, MS-GF+ reports the scores.

  • MS-GF:RawScore: MS-GF+ raw score of the peptide-spectrum match
  • MS-GF:DeNovoScore: the score of the optimal scoring peptide for the spectrum (not necessary in the database) (MS-GF:RawScore <= MS-GF:DeNovoScore)
  • MS-GF:SpecEValue: spectral E-value (spectrum level E-value) of the peptide-spectrum match - the lower the better
  • MS-GF:EValue: database level E-value (expected number of peptides in a random database having equal or better scores than the PSM score) - the lower the better
  • MS-GF:QValue
    • PSM-level Q-value estimated using the target-decoy approach.
    • MS-GF:QValue is computed solely based on MS-GF:SpecEValue.
  • MS-GF:PepQValue
    • Peptide-level Q-value estimated using the target-decoy approach.
    • Reported only if "-tda 1" is specified.
    • If multiple spectra are matched to the same peptide, only the best scoring PSM (lowest SpecProb) is retained. After that, MS-GF:PepQValue is calculated as #DecoyPSMs>s / #TargetPSMs>s among the retained PSMs. This approximates the Q-value of the set of unique peptides. In the MS-GF+ output, the same PepQValue value is given to all PSMs sharing the peptide. So, even a low-quality PSM may get a low PepQValue (if it has a high-quality "sibling" PSM sharing the peptide). Note that this should not be used to count the number of identified PSMs.
  • Using MzIDToTsv One can convert MS-GF+ output (*.mzid) into the tsv format

MzIDToTsv

Converts MS-GF+ output (.mzid) into the tsv format (.tsv).  The MzidToTsvConverter.exe (option 1) can convert the .mzid file faster than option 2 (Java based).  In addition, MzidToTsvConverter.exe uses less memory.

Option 1: MzidToTsvConverter.exe

Usage: MzidToTsvConverter -mzid:"mzid path" [-tsv:"tsv output path"] [-unroll|-u] [-showDecoy|-sd]

MzidToTsvConverter Parameters

  • -mzid:path
    • Path to the .mzid or .mzid.gz file.  If the path has spaces, it must be in quotes.
  • -tsv:path
    • Path to the tsv file to be written. If not specified, will be created in the same location as the .mzid file.
  • -unroll or -u
    • Signifies that results should be unrolled: one line per unique peptide/protein combination in each spectrum identification
  • -showDecoy or -sd
    • Signifies that decoy results should be included in the output .tsv file.
    • Decoy results have protein names that start with XXX_

Option 2: MzIDToTsv Java module

Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
    -i MzIDFile (MS-GF+ output file (*.mzid))
    [-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
    [-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
    [-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
    [-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)

MzIDToTsv Parameters

  • -i MzIDFile
    • Path to the MS-GF+ result file (*.mzid)
  • -o TSVFile
    • Path to the tsv output file (*.tsv)
    • If not specified, for input MyFile.mzid, the output will be MyFile.tsv.
  • -showQValue 0/1
    • If 0, QValue and PepQValue are not be reported.
    • If 1, QValue and PepQValue are reported (Default).
  • -showDecoy 0/1
    • If 0, decoy PSMs will not be reported (Default).
    • If 1, decoy PSMs will be reported.
  • -unroll 0/1
    • This parameter controls the output format for shared peptides (peptides matched to multiple proteins).
    • When "-unroll 0" (Default), a PSM matched to a shared peptide will be printed as a single line.
      • Peptide column does not show neighboring amino acids (e.g. IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK).
      • Protein column shows all proteins in a single line.
      • Example: MyProtein(pre=K,post=T);MyProteinIsoform(pre=K,post=T)
      • Download example file
    • When "-unroll 1", a PSM matched to a shared peptide will be printed in multiple lines.
      • Peptide column shows neighboring amino acids (e.g. K.IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK.T).
      • Different peptide-protein matches are printed in different lines.
      • Download example file

BuildSA

Index a protein database for fast searching.

Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA 
	-d DatabaseFile (*.fasta or *.fa)
	[-tda 0/1/2] (0: target only, 1: target-decoy database only, 2: both)

BuildSA Parameters

  • -d DbPath
    • Name of a protein database (*.fasta or *.fa)
    • Database file must ends with ".fasta" or ".fa".
  • -tda 0/1/2
    • If 0, only "DatabaseFile" will be indexed.
    • If 1, a new database file (*.revConcat.fasta) will be generated by appending reversed proteins. This forward-reverse database will be indexed.
    • If 2, both the original database and the forward-reverse database file will be indexed.

BuildSA creates a suffix array of the protein database. For a input database file DBFileName.fasta, BuildSA will generate 4 auxiliary files (DbFileName.canno, DBFileName.cnlcp, DBFileName.csarr, DBFileName.cseq). It needs to be executed only once per each database file.

Publications

Source code

Acknowledgment

All publications that utilize this software should provide appropriate acknowledgement to PNNL and the OMICS.PNL.GOV website. However, if the software is extended or modified, then any subsequent publications should include a more extensive statement, as shown in the Readme file for the given application or on the website that more fully describes the application.

 

Disclaimer

These programs are primarily designed to run on Windows machines. Please use them at your own risk. This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness or any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights.

Portions of this research were supported by the NIH National Center for Research Resources (Grant RR018522), the W.R. Wiley Environmental Molecular Science Laboratory (a national scientific user facility sponsored by the U.S. Department of Energy's Office of Biological and Environmental Research and located at PNNL), and the National Institute of Allergy and Infectious Diseases (NIH/DHHS through interagency agreement Y1-AI-4894-01). PNNL is operated by Battelle Memorial Institute for the U.S. Department of Energy under contract DE-AC05-76RL0 1830.

We would like your feedback about the usefulness of the tools and information provided by the Resource. Your suggestions on how to increase their value to you will be appreciated. Please e-mail any comments to proteomics@pnl.gov

| Pacific Northwest National Laboratory