crux percolator
Usage:
crux percolator [options] <search results>
Description:
Percolator is a semi-supervised learning algorithm that dynamically learns to separate target from decoy peptide-spectrum matches (PSMs). The algorithm is described in this article:
Lukas Käll, Jesse Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss. "Semi-supervised learning for peptide identification from shotgun proteomics datasets." Nature Methods. 4(11):923-925, 2007.Percolator requires as input two collections of PSMs, one set derived from matching observed spectra against real ("target") peptides, and a second derived from matching the same spectra against "decoy" peptides. The output consists of ranked lists of PSMs, peptides and proteins. Peptides and proteins are assigned two types of statistical confidence estimates: q-values and posterior error probabilities.
The features used by Percolator to represent each PSM are summarized here.
Percolator also includes code from Fido, which performs protein-level inference. The Fido algorithm is described in this article:
Oliver Serang, Michael J. MacCoss and William Stafford Noble. "Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data." Journal of Proteome Research. 9(10):5346-5357, 2010.Crux includes code from Percolator. Crux Percolator differs from the stand-alone version of Percolator in the following respects:
- In addition to the native Percolator XML file format, Crux Percolator supports additional input file formats (SQT, PepXML, tab-delimited text) and output file formats (PepXML, mzIdentML, tab-delimited text).
- To maintain consistency with the rest of the Crux commands, Crux Percolator uses different parameter syntax than the stand-alone version of Percolator.
- Like the rest of the Crux commands, Crux Percolator writes its files to an output directory, logs all standard error messages to a log file, and is capable of reading parameters from a parameter file.
Input:
<search results> – A collection of target and decoy peptide-spectrum matches (PSMs). Input may be in one of six formats: pin.xml, SQT, PepXML, Crux tab-delimited text, a list of files (when
list-of-files=T
, or a tab-delimited table of features (whenfeature-in-file=T
(see below for details). Note that if the input is provided as SQT, PepXML or Crux tab-delimited text, then a pin.xml file will be generated in the Percolator output directory prior to execution.Decoy PSMs can be provided to Percolator in two ways: either as a separate file or embedded within the same file as the target PSMs. Percolator will first search for target PSMs in a separate file. The decoy file name is constructed from the target name by replacing "target" with "decoy." For example, if
search.target.txt
is provided as input, then Percolator will search for a corresponding file namedsearch.decoy.txt
. If no decoy file is found, then Percolator will assume that the given input file contains a mix of target and decoy PSMs. Within this file, decoys are identified using a prefix (specified via--decoy-prefix
) on the protein name.Output:
Percolator produces the following files in the
crux-output
directory:
- percolator.target.pout.xml: an XML file containing all of the Percolator results, defined according to this schema.
- percolator.target.proteins.txt: a tab-delimited file containing the target protein matches. See here for a list of the fields.
- percolator.decoy.proteins.txt: a tab-delimited file containing the decoy protein matches. See here for a list of the fields.
- percolator.target.peptides.txt: a tab-delimited file containing the target peptide matches. See here for a list of the fields.
- percolator.decoy.peptides.txt: a tab-delimited file containing the decoy peptide matches. See here for a list of the fields.
- percolator.target.psms.txt: a tab-delimited file containing the target PSMs. See here for a list of the fields.
- percolator.decoy.psms.txt: a tab-delimited file containing the decoy PSMs. See here for a list of the fields.
- percolator.params.txt: a file containing the name and value of all parameters for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the
--parameter-file
option for other crux programs.- percolator.pep.xml: a file containing the PSMs in pepXML format. This file can be used as input to some of the tools in the Transproteomic Pipeline.
- percolator.mzid: a file containing the protein, peptide, and spectrum matches in mzIdentML format.
- percolator.log.txt: a log file containing a copy of all messages that were printed to standard error.
Options:
Percolator options:
--c-pos <float>
– Penalty for mistakes made on positive examples. If this value is not specified, then it is set via cross validation over the values {0.1, 1, 10}, selecting the value that yields the largest number of PSMs identified at the q-value threshold set via the--test-fdr
parameter.--c-neg <float>
– Penalty for mistakes made on negative examples. This parameter requires that--c-pos
is set explicitly; otherwise,--c-neg
will have no effect. If not specified, then this value is set by cross validation over {0.1, 1, 10}.--train-fdr <float>
– False discovery rate threshold to define positive examples in training. Default = 0.01.--test-fdr <float>
– False discovery rate threshold used in selecting hyperparameters during internal cross-validation and for reporting the final results. Default = 0.01.--maxiter <int>
– Maximum number of iterations for training. Default = 10.--train-ratio <float>
– Fraction of the negative data set to be used as train set when only providing one negative set. The remaining examples will be used as test set. Default = 0.6.--default-direction <int>
– In its initial round of training, Percolator uses one feature to induce a ranking of PSMs. By default, Percolator will select the feature that produces the largest set of target PSMs at a specified FDR threshold (cf.--train-fdr
). This option allows the user to specify which feature is used for the initial ranking, using the name as a string from this table. The name can be preceded by a hyphen (e.g., "-XCorr") to indicate that a lower value is better.--unitnorm T|F
– Use unit normalization (i.e., linearly rescale each PSM's feature vector to have a Euclidean length of 1), instead of standard deviation normalization. Default = F.--test-each-iteration T|F
– Measure performance on the test set at each iteration. Default = F.--static-override T|F
– By default, Percolator will examine the learned weights for each feature, and if the weights appear to be problematic, then Percolator will discard the learned weights and instead employ a previously trained, static score vector. This switch allows this error checking to be overridden. Default = F.--seed <int>
– Set the seed of the random number generator. Default = 1.--klammer T|F
– Use retention time features calculated as in "Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions" by Klammer AA, Yi X, MacCoss MJ and Noble WS. (Analytical Chemistry. 2007 Aug 15;79(16):6111-8.)--list-of-files <T|F>
– Specify that the search results are provided as lists of files, rather than as individual files. Default = F.--only-psms T|F
– Do not remove redundant peptides; keep all PSMs and exclude peptide level probabilities.--top-match <int>
Specify the maximum number of matches to consider for each spectrum. Note that this option will be ignored in conjunction with input in pin.xml format. Default = 5.Fido options:
--protein T|F
– Output protein level probabilities. If this option is not set, then none of the options below may be used. Default = F.--alpha <float>
– Specify the probability with which a present protein emits an associated peptide. Set by grid search (see--deepness
parameter) if not specified.--beta <float>
– Specify the probability of the creation of a peptide from noise. Set by grid search (see--deepness
parameter) if not specified.--gamma <float>
– Specify the prior probability that a protein is present in the sample. Set by grid search (see--deepness
parameter) if not specified.--allow-protein-group T|F
– Treat ties as if it were one protein. Default = F.--protein-level-pi0 T|F
– Use pi_0 value when calculating empirical q-values. Default = F.--empirical-protein-q T|F
– Output empirical q-values (from target-decoy analysis). Default = F.--group-proteins T|F
– Proteins with same probabilities will be grouped. Default = F.--no-prune-proteins T|F
– Peptides with low score will not be pruned before calculating protein probabilities. Default = F.--deepness <0|1|2|3>
– Set depth of the grid search for alpha, beta and gamma estimation. The values considered, for each possible value of the--deepness
parameter, are as follows:Default = 3.
- 0: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.5}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05, 0.1}; gamma = {0.1, 0.25, 0.5, 0.75}.
- 1: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05}; gamma = {0.1, 0.25, 0.5}.
- 2: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.1, 0.5}.
- 3: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.5}.
Input and output
--feature-file T|F
– Output the computed features in tab-delimited text format to a file named "percolator.feature.txt." Default = F.--feature-in-file T|F
– When set toT
, the<search results>
argument should be a tab-delimited file, in which the first row is a header, and each subsequent row is a PSM. The fields should be identifier, label (1 = target, -1 = decoy), feature1, ..., featureN, peptide, proteins. Default = F.--decoy-xml-output T|F
– Include decoys (PSMs, peptides and/or proteins) in the XML output. Default = F.--decoy-prefix <string>
– Specifies the prefix of the protein names that indicates a decoy. Default = "decoy_".--output-weights T|F
– Output final weights to a file named "percolator.weights.txt." Default = F.--input-weights <string>
– Read initial weights from the given file (one per line). Default = F.--fileroot <string>
– Thefileroot
string will be added as a prefix to all output file names. Default = none.--output-dir <filename>
– The name of the directory where output files will be created. Default = crux-output.--overwrite T|F
Replace existing files if true (T) or fail when trying to overwrite a file if false (F). Default = F.--txt-output T|F
– Output tab-delimited results files to the output directory. Default = T.--pout-output T|F
– Output a Percolator pout.xml format results files to the output directory. Default = F.--mzid-output T|F
– Output an mzIdentML results file to the output directory. Default = F.--pepxml-output T|F
– Output a pepXML results file to the output directory. Default = F.--parameter-file <filename>
– A file containing command-line or additional parameters. See the parameter documentation page for details.--verbosity <int>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.
Crux home