crux barista

Description:

Barista is a protein identification algorithm that combines two different tasks—peptide-spectrum match (PSM) verification and protein inference—into a single learning algorithm. The program requires two inputs: the results of a database search and a database of proteins. Barista produces as output a ranking of proteins based on how likely they are to be present in the sample. Barista also re-ranks the peptide identifications to more accurately distinguish between correct and incorrect identifications. Barista uses a machine learning strategy that requires that the database search be carried out on target and decoy proteins.

Usage:

crux barista [options] <protein database folder> <sqt folder> <ms2 folder>

crux barista [options] <list of protein database files> <list of sqt files> <list of ms2 files>

crux barista [options] <protein database file> <sqt file> <ms2 file>

Required Input:

  • protein database – The program requires the protein database files against which the search was performed. The protein database input may be a concatenated database or separate target and decoy databases. However, in both cases, the program distinguishes between target and decoy proteins based on the decoy prefix. There is an option to specify the prefix to the protein name that indicates a decoy, the default is "random_".

    The format of the input can be a directory containing the database files, a text file with a list of protein database files or a single database file. If the protein database input contains suffix "fasta", the program assumes a single-file input. Otherwise it checks whether the input is a directory or a text file.
  • psm input – Barista recognizes sqt format for psm input. The psm input can be specified as a single file with suffix "sqt" and should be accompanied by the corresponding ms2 file as a third required argument. Alternatively, psm input can be given as a text file containing the list of all the "sqt" files and should be accompanied by the text file with the list of the corresponding ms2 files as a third required argument.

    Finally, the psm input can be specified as a directory in which all the database search results are located. In this case, the program collects all the files with prefix "sqt" contained in the directory and analyses them jointly. As it finds each sqt file, it simultaneously searches the ms2 input directory for an ms2 file with the same name, but with the suffix "sqt" replaced with suffix the "ms2".

    The multiple files can be results of many different experiments or of multiple duplicate runs. They can also be the result of separate target and decoy searches. All such files in the given directory are analyzed jointly.

  • ms2 input – The ms2 input can be specified as a single file with suffix "ms2" and as a text file containing the list of all the ms2 files and or as a directory where all the ms2 files corresponding to the psm input in the second argument can be found.
Output:

    The program writes files to the folder crux-output by default. The name of the output folder can be set by the user using the --output-dir option. The following files will be created:

  • barista.target.html: an HTML file that summarizes the protein ranking in a human-readable format. Sets of indistinguishable proteins are grouped together, and their IDs are printed along with the Barista q-value. Along with each protein group is a list of the corresponding peptides. The information about each peptide includes its amino acid sequence, scan and charge of the peptide, its position in the protein, the q-value and score assigned by Barista. Following is a portion of a sample output file, showing the identification of a pair of proteins:

  • q-valuescoreID PeptidesMatched Description
    0.0011.25foo123 Phosphotyrosine glutamate transporter
    foobar73 Phosphotyrosine glutamate transporter (N-terminal domain)

      EAMPKscan=000230charge=220-24q=0.07score=1.27
      YRMLKscan=004870charge=227-31q=0.09score=1.05
      NLMMRPPKscan=006790charge=354-61q=0.59score=-0.09

  • barista.target.proteins.txt: a tab-delimited file containing a ranked list of groups of indistinguishable target proteins with associated Barista scores and q-values and with peptides that contributed to the identification of the protein group). The following columns are included: proteins in group, Barista score, decoy q-value, peptides with scan and charge (for example, EAMPK-001285.2). The semantics of these columns are explained in this document.
  • barista.target.peptides.txt: a tab-delimited file containing a ranked list of target peptides with the associated Barista scores and q-values. The following columns are included: peptide, scan of spectrum in the peptide-spectrum match, charge of spectrum in the peptide-spectrum match, Barista score, decoy q-value. The semantics of these columns are explained in this document.
  • barista.target.psm.txt: a tab-delimited file containing a ranked list of target peptide-spectrum matches with the associated Barista scores and q-values. The following columns are included: scan of spectrum in the peptide-spectrum match, charge of spectrum in the peptide-spectrum match, peptide in the peptide-spectrum match, Barista score, decoy q-value, the filename where each psm was found. The semantics of these columns are explained in this document.
Options:
  • --enzyme trypsin|chymotrypsin|elastase – The enzyme used to digest the proteins in the experiment. Default = trypsin.
  • --decoy-prefix <string> – specifies the prefix of the protein names that indicates a decoy.
  • --fileroot <string> – The fileroot string will be added as a prefix to all output file names. Default = none.
  • --output-dir <filename> – The name of the directory where output files will be created. Default = crux-output.
  • --skip-cleanup <T/F> – Barista analysis consists of preprocessing data and creating lookup tables which are then used during training. This option allows keeping the tables after the analysis and doing training directly by specifying the --dir-with-tables option (see below). Default = F.
  • --dir-with-tables <filename> – The name of the directory where the lookup tables with preprocessed data are located. Default = none.

  • Crux home