crux search-for-matches
Usage:Description:crux search-for-matches [options] <ms2 input filename> <protein input>
This command searches a protein database with a set of spectra. For each spectrum, the precursor mass is computed from either the measured precursor singly charged mass (m+h) or the mass-to-charge (m/z) and an assumed charge. Candidate peptides whose mass lies within a specified range of the precursor mass are identified. These candidate peptides are scored with the SEQUEST® XCorr, and the top-ranking matches for each spectrum are reported.
An optional p-value may be computed for each spectrum based on the distribution of scores for that spectrum ( Aaron A. Klammer, Christopher Y. Park and William Stafford Noble. "Statistical calibration of the SEQUEST XCorr function." Journal of Proteome Research. 8(4):2106-2113, 2009).
The input protein database may either be in FASTA format or it may be a binary index created by
crux create-index
. Using an index will typically yield much faster search speeds.Modifications: Crux handles two types of modifications: static and variable. Static modifications are a change of mass applied to a given amino acid in every peptide in which it occurs. By default, a static modification of +57 da to cystine (C) is applied. Variable modifications allow peptides to be generated with and without a mass change to a given amino acid. Crux handles variable modifications as follows. The user specifies an allowed set of amino acid modifications, using the options
mod
,cmod
andnmod
, which are described below. Before any search is performed, Crux generates an exhaustive list of all possible combinations of amino acid modifications that could be applied to a peptide. Subsequently, for each spectrum, Crux performs one search for each possible combination of modifications including no modifications. For example, if the precursor m/z for a spectrum is 800 Th, the charge state is 2+, and Crux is considering a modification of +79, then Crux will retrieve from the database all candidate peptides whose total mass is close to 321 Th. The candidate peptide list is then updated to remove any peptides that cannot be modified (because they contain no modifiable amino acids) and expanded to include all possible modified forms of each candidate. These candidates are scored as usual and the top n candidate peptides are added to a composite, sorted list of peptides. Finally, after all modifications have been searched, Crux reports for the current spectrum the top m peptides from the composite list.Input:
Output:
- <ms2> – The name of the file from which to parse the MS2 spectra. File formats are supported by proteowizard, with exception of the vendors formats.
- <protein-database> – The name of the file in fasta format or the directory containing a protein index from which to retrieve proteins and peptides.
The program writes files to the folder
crux-output
by default. The name of the output folder can be set by the user using the--output-dir
option. The following files will be created:
- search.params.txt: a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the
--parameter-file
option for other crux programs.- search.pin.xml: a file containing the PSMs in PIN XML format (schema). This file can be used as input to
crux percolator
.- search.target.txt: a tab-delimited text file containing the PSMs. See txt file format for a list of the fields. These files can be used as input to
crux percolator
,crux compute-q-values
, andcrux q-ranker
.- search.target.pep.xml: a file containing the PSMs in pepxml format. See pep xml file format for further reference. This file can be used as input to some of the tools in the Transproteomic Pipeline.
- search.log.txt: a log file containing a copy of all messages that were printed to stderr.
If decoys are enabled using
Options:--num-decoys-per-target
, then files called search.decoy.txt and search.decoy.pep.xml are also produced.Parameter file options:
--fileroot <string>
– Thefileroot
string will be added as a prefix to all output file names. Default = none.--output-dir <filename>
– The name of the directory where output files will be created. Default = crux-out.--overwrite T|F
– Replace existing files if true (T) or fail when trying to overwrite a file if false (F). Default = F.--seed <int>
– Set the seed of the random number generator. Default = 1.--num-decoys-per-target <int>
– Specify the number of decoy peptides to search for every target peptide searched. Control where the decoys are returned (to what files) with--decoy-location
. At least one decoy set (in its own file) is required to run the algorithm 'percolator' in a subsequent crux run. Default = 1.--decoys none|reverse|protein-shuffle|peptide-shuffle
– Include a decoy version of every peptide by shuffling or reversing the target sequence. Use 'reverse' to reverse each protein sequence, 'protein-shuffle' to shuffle each protein sequence, or 'peptide-shuffle' to shuffle the sequence between enzyme cleavage sites, leaving the termini in place. Use 'none' for no decoys. Default=peptide-shuffle.--decoy-location target-file|one-decoy-file|separate-decoy-files
– File(s) in which decoy results are returned. Only applies whennum-decoys-per-target
is not zero. Use 'target-file' to mix target and decoy psms in one file. Use 'one-decoy-file' to print target psms to one file and all decoys to a separate file. Use 'separate-decoy-files' to print one .txt file for each decoy set. (crux percolator
accepts up to two search.decoy.txt files.crux q-ranker
accepts only one search.decoy.txt file.) Default = separate-decoy-files.--compute-sp T|F
– Compute the preliminary score Sp for all candidate peptides. This option is recommended if results are to be analyzed bypercolator
,q-ranker
orbarista
. Note that, ifsqt-output
is enabled, thencompute-sp
is automatically enabled and cannot be overridden. Default = F.--compute-p-values T|F
– Estimate the paramters of the score distribution for each spectrum and compute a p-value for each PSM. The score distribution parameters are estimated only from target PSM scores. The same parameters will be used to compute p-values for the decoy PSMs. This option can be used in conjunction withcrux compute-q-values
. Default = F.--spectrum-parser pwiz|mstoolkit
– Specify the parser to use for reading in MS/MS spectra. The default, ProteoWizard parser should be able to read the MS/MS file formats listed here. The alternative is MSToolkit parser. If the ProteoWizard parser fails to read your files properly, you may want to try the MSToolkit parser instead. Default = pwiz.--spectrum-min-mz <float>
– The lowest spectrum m/z to search in the ms2 file. Default = 0.0--spectrum-max-mz <float>
– The highest spectrum m/z to search in the ms2 file. Default = no maximum.--spectrum-charge 1|2|3|all
– The spectrum charges to search. With 'all' every spectrum will be searched and spectra with multiple charge states will be searched once at each charge state. With 1, 2, or 3 only spectra with that charge will be searched. Default = all.--max-ion-charge 1|2|3|peptide
– Predict fragment ions up to this charge-state. The integer options ('1', '2', and '3') specify a fixed maximum charge-state. The 'peptide' option indicates that the ions should range up to the maximum charge-state of the peptide itself minus 1. Thus, a 3+ charge state peptide would have fragment ions of 1+ and 2+. One exception: 1+ charge state peptide always have ions of 1+. Default = peptide.--scan-number <int>|<int>-<int>
– A single scan number or a range of numbers to be searched. Range should be specified as 'first-last' which will include scans 'first' and 'last'. Default = search all spectra.--mz-bin-width <float>
– Before calculation of the XCorr score, the m/z axes of the observed and theoretical spectra are discretized. This parameter specifies the size of each bin. The exact formula is floor((x/mz-bin-width) + 1.0 - mz-bin-offset), where x is the observed m/z value. By default, the mz-bin-width is 1.0005079 Da when searching using monoisotopic mass and 1.0011413 Da with average mass.--mz-bin-offset <float>
– In the discretization of the m/z axes of the observed and theoretical spectra, this parameter specifies the location of the left edge of the first bin, relative to mass = 0 (i.e., mz-bin-offset = 0.xx means the left edge of the first bin will be located at +0.xx Da). The parameter must lie in the range 0 ≤ mz-bin-offset ≤ 1. Default=0.40.--parameter-file <filename>
– A file containing command-line or additional parameters. See the parameter documentation page for details. Default = no parameter file.--verbosity <0-100>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.--txt-output <T|F>
– Output a tab-delimited results file to the output directory. Default = T.--sqt-output <T|F>
– Output a SQT results file to the output directory. Default = F.--pepxml-output <T|F>
– Output a pepXML results file to the output directory. Default = F.--mzid-output <T|F>
– Output an mzIdentML results file to the output directory. Default = F.--pinxml-output <T|F>
– Output a PIN XML results file to the output directory. Default = F.
min-peaks <int>
– The minimum number of peaks a spectrum must have in order to be searched. Default = 20.fragment-mass average|mono
– Which isotopes to use in calcuating fragment ion mass (average, mono). Default = mono.use-flanking-peaks T|F
– Turn on or off the peaks flanking the b/y ions. Forcrux search-for-matches
, default = F; forcrux search-for-xlinks
, default = T.precursor-window <float>
– Tolerance used for matching peptides to spectra. Peptides must be within +/- 'precursor-window
' of the spectrum value. Definition of precursor window units depends uponprecursor-window-type
. Default = 3.0.precursor-window-type mass|mz|ppm
– Specify the units for the window that is used to select peptides around the precursor mass location (mass, mz, ppm). The magnitude of the window is defined by theprecursor-window
option, and candidate peptides must fall within this window. For themass
window-type, the spectrum precursor m+h value is converted to mass, and the window is defined as that mass ±precursor-window
. If the m+h value is not available, then the mass is calculated from the precursor m/z and provided charge. The peptide mass is computed as the sum of the average amino acid masses plus 18 Da for the terminal OH group. Themz
window-type calculates the window as spectrum precursor m/z ±precursor-window
and then converts the resulting m/z range to the peptide mass range using the precursor charge. For the parts-per-million (ppm
) window-type, the spectrum mass is calculated as in themass
type. The lower bound of the mass window is then defined as the spectrum mass / (1.0 + (precursor-window
/ 1000000)) and the upper bound is defined as spectrum mass / (1.0 - (precursor-window
/ 1000000)). Default = mass.top-match <int>
– The number of psms per spectrum writen to the output files. Default = 5.mod <mass change>:<aa list>:<max per peptide>:<prevents cleavage>:<prevents cross-link>
– Consider modifications on any amino acid in aa list with at most max-per-peptide in one peptide. This parameter may be included with different values multiple times so long as the total number ofmod
,cmod
, andnmod
parameters does not exceed 11. Theprevents cleavage
andprevents cross-link
are T/F optional arguments for describing whether the modification prevent enzymatic cleavage or cross-linking respectively. The same modifications must be given for any post-search process (crux compute-q-values
,crux q-ranker
,crux percolator
). Default = no variable modifications.cmod <mass change>:<max distance from protein C-terminus>
– Consider modifications on the C-terminus of any peptide whose C-terminus is no more than max-distance residues from the protein C-terminus. Use -1 to consider the C-terminus of all peptides regardless of position in the protein. This parameter may be included with different values multiple times so long as the total number ofmod
,cmod
, andnmod
parameters does not exceed 11. The same modifications must be given for any post-search process (crux compute-q-values
,crux q-ranker
,crux percolator
). Default = no c-terminal modifications.nmod <mass change>:<max distance from protein N-terminus> –
Consider modifications on the N-terminus of any peptide whose N-terminus is no more than max-distance residues from the protein N-terminus. Use -1 to consider the N-terminus of all peptides regardless of position in the protein. This parameter may be included with different values multiple times so long as the total number ofmod
,cmod
, andnmod
parameters does not exceed 11. The same modifications must be given for any post-search process (crux compute-q-values
,crux q-ranker
,crux percolator
). Default = no n-terminal modifications.cmod-fixed <mass change>
– Add a modification of the given mass change to the C-terminus of every peptide.nmod-fixed <mass change>
– Add a modification of the given mass change to the N-terminus of every peptide.max-mods <int>
– The maximum number of modifications that can be applied to a single peptide. Default = no limit.max-aas-modified <int>
– The maximum number of modified amino acids that can appear in one peptide. Each aa can be modified multiple times. Default = no limit.mod-mass-format mod-only|total|separate
– Specify how sequence modifications are reported in various ouptut files. Each modification is reported as a number enclosed in square braces following the modified reside; however, the number may correspond to one of three different masses: (1) 'mod-only' reports the value of the mass shift induced by the modification; (2) 'total' reports the mass of the residue with the modification (residue mass plus modification mass); (3) 'separate' is the same as 'mod-only', but multiple modifications to a single amino acid are reported as a comma-separated list of values. For example, suppose amino acid D has an unmodified mass of 115 as well as two modifications of masses +14 and +2. In this case, the amino acid would be reported as D[16] with 'mod-only', D[131] with 'total', and D[14,2] with 'separate'.precision <int>
– Set the precision (number of significant digits) for scores written to text files. Default = 8.print-search-progress <int>
– Show search progress by printing every n spectra searched. Set to 0 to show no search progress. Default = 1000.NOTE: the following parameters are also used when creating an index and must be compatible with any index used.
min-mass <float>
– The minimum neutral mass of the peptides to place in the index. Default = 200.max-mass <float>
– The maximum neutral mass of the peptides to place in index. Default = 7200.min-length <int>
– The minimum length of the peptides to place in the index. Default = 6.max-length <int>
– The maximum length of the peptides to place in the index. Default = 50.--enzyme trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|idosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|no-enzyme
– Enzyme to use for in silico digestion of protein sequences. Used in conjunction with thedigestion
andmissed-cleavages
options. Use 'no-enzyme' for non-specific digestion. Digestion rules are as follows: enzyme name [cuts after one of these residues]|{but not before one of these residues}. trypsin [RK]|{P}, trypsin/p [RK]|[], elastase [ALIV]|{P}, chymotrypsin [FWYL]|{P}, clostripain [R]|[], cyanogen-bromide [M]|[], iodosobenzoate [W]|[], proline-endopeptidase [P]|[], staph-protease [E]|[], elastase-trypsin-chymotrypsin [ALIVKRWFY]|{P}, asp-n []|[D], lys-c [K]|{P}, lys-n []|[K], arg-c [R]|{P}, glu-c [DE]|{P}, pepsin-a [FL]|{P}. Default = trypsin.custom-enzyme <residues before cleavage>|<residues after cleavage>
– Specify rules for in silico digestion of protein sequences. Overrides theenzyme
option. Two lists of residues are given enclosed in square brackets or curly braces and separated by a |. The first list contains residues required/prohibited before the cleavage site and the second list is residues after the cleavage site. If the residues are required for digestion, they are in square brackets, '[' and ']'. If the residues prevent digestion, then they are enclosed in curly braces, '{' and '}'. Use X to indicate all residues. For example, trypsin cuts after R or K but not before P which is represented as[RK]|{P}
. AspN cuts after any residue but only before D which is represented as[X]|[D]
.digestion full-digest|partial-digest
– Degree of digestion used to generate peptides . Either both ends (full-digest) or at least one end (partial-digest) of a peptide must conform to enzyme specificity rules. Used in conjunction with theenzyme
orcustom-enzyme
option whenenzyme
is not set to to 'no-enzyme'. Default full-digest.missed-cleavages <int>
– Include in the index peptides containing up to <int> missed cleavage sites. Default = 0.isotopic-mass average|mono
– Specify the type of isotopic masses to use when calculating the peptide mass. Default = average.<A-Z> <float>
– Specify static modifications. This is a mass change applied to the given amino acid (in single-letter-code A thru Z) for every peptide in which it occurs. Use themod
option for generating peptides both with and without the mass change. Default C=57.