crux barista
Description:
Barista is a protein identification algorithm that combines two different tasks—peptide-spectrum match (PSM) verification and protein inference—into a single learning algorithm. The program requires three inputs: a set of MS2 spectra, a protein database, and the results of searching the spectra against the database. Barista produces as output three ranked lists of proteins, peptides and PSMs, based on how likely the proteins and peptides are to be present in the sample and how likely the PSMs are to be correct. Barista can jointly analyze the results of multiple shotgun proteomics experiments, corresponding to different experiments or replicate runs.
Barista uses a machine learning strategy that requires that the database search be carried out on target and decoy proteins. The searches may be carried out on a concatenated database or, using the
--separate-searches
option, separate target and decoy databases. Thecrux create-index
command can be used to generate a decoy database.Barista assigns two types of statistical confidence estimates, q-values and posterior error probabilities, to identified PSMs, peptides and proteins. For more information about these values, see the documentation for calibrate-scores.
Usage:
crux barista [options] <protein-database> <spectra> <search results>
Required Inputs:
Output:
- protein-database – The program requires the FASTA format protein database files against which the search was performed. The protein database input may be a concatenated database or separate target and decoy databases. In either case, Barista distinguishes between target and decoy proteins based on the presence of a decoy prefix on the sequence identifiers (see the
--decoy-prefix
option, below).- spectra – The fragmentation spectra must be provided in MS2 format.
- search results – Barista recognizes search results in the tab-delimited text format produced by Crux.
Each of the three required arguments can be provided in three different ways: (1) as a single file, (2) as a text file containing a list of filenames, one per line, or (3) as a directory containing multiple files. File types are identified based on the filename extension: ".fa", ".fasta" or ".fsa" for FASTA files, ".ms2" for MS2 files and ".txt" for tab-delimited text files or lists of filenames. Note that the input mode for spectra and for search results must be the same; i.e., if you provide a list of files for the spectra, then you must also provide a list of files containing your search results. This mode is specified using the
--list-of-files
option, described below.Options:
The program writes files to the folder
crux-output
by default. The name of the output folder can be set by the user using the--output-dir
option. The following files will be created:- barista.xml: an XML file format that contains four main parts:
- Proteins
- Subset Proteins
- Peptides
- PSMs
- barista.target.proteins.txt: a tab-delimited file containing a ranked list of groups of indistinguishable target proteins with associated Barista scores and q-values and with peptides that contributed to the identification of the protein group).
- barista.target.subset-proteins.txt: a tab-delimited file containing groups of indistinguishable proteins, which constitute a subset of some group in the barista.target.proteins.txt file in terms of the peptides identified in these proteins.
- barista.target.peptides.txt: a tab-delimited file containing a ranked list of target peptides with the associated Barista scores and q-values.
- barista.target.psm.txt: a tab-delimited file format containing a ranked list of target peptide-spectrum matches with the associated Barista scores and q-values.
- barista.log.txt: a file where the program reports its progress.
- barista.params.txt: a file with the values of all the options given to the current run.
--enzyme trypsin|chymotrypsin|elastase
– The enzyme used to digest the proteins in the experiment. Default = trypsin.--decoy-prefix <string>
– Specifies the prefix of the protein names that indicates a decoy. Default = rand_.--optimization <string>
– Specifies whether to do optimization at the protein, peptide or psm level. Default = protein.--spectrum-parser pwiz|mstoolkit
– Specify the parser to use for reading in MS/MS spectra. The default, ProteoWizard parser should be able to read the MS/MS file formats listed here. The alternative is MSToolkit parser. If the ProteoWizard parser fails to read your files properly, you may want to try the MSToolkit parser instead. Default = pwiz.--separate-searches <search results>
– This option indicates that the target and decoy searches were run separately, rather than using a concatenated database. In this case, Barista will assume that the database search results provided as a required argument are from the target database search. This option then allows the user to specify the location of the decoy search results. Like the required arguments, these search results can be provided as a single file, a list of files or a directory. However, the choice (file, list or directory) must be consistent for the spectrum files and the target and decoy search results. Also, if the spectrum and the search results are provided in directories, then Barista will use the spectrum filename (<name>.ms2
) to identify corresponding target and decoy search results with names<name>*.target.txt
, and<name>*.decoy.txt
. Note that the decoy database can be provided as part of the required <database> argument.--fileroot <string>
– Thefileroot
string will be added as a prefix to all output file names. Default = none.--output-dir <directory>
– The name of the directory where output files will be created. Default = crux-output.--overwrite <T/F>
– The option applies to the situation when the output directory specified for the run already exists. If set to T, Barista will overwrite the contents of the output directory specified for the run. Default = F.--skip-cleanup <T/F>
– Barista analysis begins with a pre-processsing step that creates a set of lookup tables which are then used during training. Normally, these lookup tables are deleted at the end of the Barista analysis, but setting this option toT
prevents the deletion of these tables. Subsequently, the Barista analysis can be repeated more efficiently by specifying the--re-run
option (see below). Default = F.--re-run <directory>
– Re-run a previous Barista analysis using a previously computed set of lookup tables. For this option to work, the--skip-cleanup
must have been set to true when Barista was run the first time.--use-spec-features <T/F>
– Barista uses an enriched feature set derived from the spectra. Default = T.--parameter-file <filename>
– A file containing command-line or additional parameters. See the parameter documentation page for details. Default = no parameter file.--feature-file <T|F>
– Optional file into which PSM features are printed. Default = F.--list-of-files <T|F>
– Specify that the spectra and search results are provided as lists of files, rather than as individual files. When the spectrum files and the database search results files are provided via a file listing, Barista assumes that the order of the spectrum files matches the order of the search result files. Alternatively, when the spectrum files and search results files are provided via directories, Barista will search for pairs of files with the same root name but different extensions (".ms2" .txt"). Default = F.--verbosity <0-100>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.--txt-output <T|F>
– Output a tab-delimited results file to the output directory. Default = T.--pepxml-output <T|F>
– Output a pepXML results file to the output directory. Default = F.Selected Examples of Use:
- Concatenated Search (Using Comet and/or Tide):
1) Generate a decoy protein fasta:
crux create-index decoys protein-shuffle proteins.fasta temp-index
2) Create concatenated fasta:
cat proteins.fasta ./temp-index/proteins-random.fasta > target-decoy.fasta
3) run comet using fasta with no decoys generated:
crux comet --decoy_search 0 spectra.mzXML.gz target-decoy.fasta
4) run tide-search using fasta with no decoys generated:
crux tide-index --decoy-format none target-decoy.fasta tide-index
crux tide-search spectra.mzXML.gz tide-index
5) If needed, convert spectra to ms2:
crux get-ms2-spectrum spectra.mzXML.gz > spectra.ms2
6) Run barista:
crux barista target-decoy.fasta spectra.ms2 crux-output/comet.target.txt
crux barista target-decoy.fasta spectra.ms2 crux-output/tide-search.txt
- Separate Search:
Using Comet:
crux create-index decoys protein-shuffle proteins.fasta temp-index
cat proteins.fasta ./temp-index/proteins-random.fasta > target-decoy.fasta
crux comet --output-dir target-search --decoy_search 0 spectra.mzXML.gz proteins.fasta
crux comet --output-dir decoy-search --decoy_search 0 spectra.mzXML.gz ./temp-index/proteins-random.fasta
crux get-ms2-spectrum spectra.mzXML.gz > spectra.ms2
crux barista --separate-searches decoy-search/comet.target.txt target-decoy.fasta spectra.ms2 target-search/comet.target.txt
Using Tide-Search:
crux create-index decoys protein-shuffle proteins.fasta temp-index
cat proteins.fasta ./temp-index/proteins-random.fasta > target-decoy.fasta
crux tide-index --decoy-format none proteins.fasta target-index
crux tide-index --decoy-format none ./temp-index/proteins-random.fasta decoy-index
crux tide-search --output-dir target-search spectra.mzXML.gz target-index
crux tide-search --output-dir decoy-search spectra.mzXML.gz decoy-index
crux get-ms2-spectrum spectra.mzXML.gz > spectra.ms2
crux barista --separate-searches decoy-search/tide-search.txt target-decoy.fasta spectra.ms2 target-search/tide-search.txt
Crux home