crux calibrate-scores
Usage:
crux calibrate-scores [options] <search results> <column name>
Description:
Given a collection of scored peptide-spectrum matches (PSMs), estimate two statistical confidence measures for each: a q-value and a posterior error probability (PEP).
q-value
The q-value is analogous to a p-value but incorporates false discovery rate multiple testing correction. The q-value associated with a score threshold T is defined as the minimal false discovery rate at which a score of T is deemed significant. In this setting, the q-value accounts for the fact that we are analyzing a large collection of PSMs.
To estimate q-values,
calibrate-scores
searches the input directory for a corresponding set of decoy PSMs. The false discovery rate associated with a given score is estimated as the number of decoy scores above the threshold divided by the number of target scores above the threshold, multiplied by the ratio of the total number of targets to total number of decoys. This methodology is described in the following article:Lukas Käll, John D. Storey, Michael J. MacCoss and William Stafford Noble. "Assigning significance to peptides identified by tandem mass spectrometry using decoy databases." Journal of Proteome Research. 7(1):29-34, 2008.Note that calibrate-scores does not (yet) estimate the percentage of incorrect targets, as described in the above article. Hence, the method implemented here as "decoy q-values" is analogous to the "Simple FDR" procedure shown in Figure 4A of the above article.In each case, the estimated FDRs are converted to q-values by ranking the PSMs by score and then taking, for each PSM, the minimum of the current FDR and all of the FDRs below it in the ranked list.
Posterior error probability
Unlike the q-value, which is calculated with respect to the collection of PSMs with scores above a specified threshold, the PEP (also known in the literature as the "local FDR") is calculated with respect to a single score. The PEP is the probability that a particular PSM is incorrect. Crux's PEPs are estimated using the methodology described in this article:
Lukas Käll, John Storey and William Stafford Noble. "Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry." Bioinformatics (Proceedings of the ECCB). 24(16):i42-i48, 2008.A primer on multiple testing correction can be found here:
William Stafford Noble. "How does multiple testing correction work?" Nature Biotechnology. 27(12):1135-1137, 2009.A discussion of q-values versus posterior error probabilities is provided in this article:
Lukas Käll, John D. Storey, Michael J. MacCoss and William Stafford Noble. "Posterior error probabilities and false discovery rates: two sides of the same coin." Journal of Proteome Research. 7(1):40-44, 2008.Input:
- <search results> – A collection of target and decoy peptide-spectrum matches (PSMs) in Crux tab-delimited text. Decoy PSMs can be provided in two ways: either as a separate file or embedded within the same file as the target PSMs. Crux will first search for target PSMs in a separate file. The decoy file name is constructed from the target name by replacing "target" with "decoy." For example, if
tide-search.target.txt
is provided as input, then Crux will search for a corresponding file namedtide-search.decoy.txt
. If no decoy file is found, then Crux will assume that the given input file contains a mix of target and decoy PSMs. Within this file, decoys are identified using a prefix (specified via--decoy-prefix
) on the protein name.- <column name> – The name of the column from which to extract the score.
Output:
The program writes files to the folder
crux-output
by default. The name of the output folder can be set by the user using the--output-dir
option. The following files will be created:- calibrate-scores.target.txt: a tab-delimited text file containing the PSMs. See txt file format for a list of the fields. The file will contain two additional columns, named "<column name> q-value" and "<column name> PEP" where "<column name>" is provided on the command line.
- calibrate-scores.log.txt: a log file containing a copy of all messages that were printed to stderr.
- calibrate-scores.params.txt: a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the
--parameter-file
option for other crux programs.Options:
Miscellaneous
--decoy-prefix <string>
– Specifies what protein name prefix is used to indicate a decoy. Default = "decoy_".--pi-zero <value>
– The estimated proportion of target scores that are drawn according to the null distribution. Default=1.0.Input and output
--fileroot <string>
– Thefileroot
string will be added as a prefix to all output file names. Default = none.--output-dir <filename>
– The name of the directory where output files will be created. Default = crux-output.--overwrite <T|F>
– Replace existing files if true (T) or fail when trying to overwrite a file if false (F). Default = F.--parameter-file <filename>
– A file containing command-line or additional parameters. See the parameter documentation page for details.--verbosity <0-100>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.