phred.doc - CodonCode Corporation

File: PHRED.DOC

*|***************************************************************************|*
*|                                                                           |*
*|   Program: phred                                                          |*
*|   Version: 0.000925.c                                                     |*
*|                                                                           |*
*|   Copyright (C) 1993-2000 by Phil Green and Brent Ewing.                  |*
*|   All rights reserved.                                                    |*
*|                                                                           |*
*|   This software is a beta-test version of the phred package.              |*
*|   It should not be redistributed or used for any commercial               |*
*|   purpose, including commercially funded sequencing, without              |*
*|   written permission from the author and the University of                |*
*|   Washington.                                                             |*
*|                                                                           |*
*|   This software is provided ``AS IS'' and any express or                  |*
*|   implied warranties, including, but not limited to, the                  |*
*|   implied warranties of merchantability and fitness for a                 |*
*|   particular purpose, are disclaimed.  In no event shall                  |*
*|   the authors or the University of Washington be liable for               |*
*|   any direct, indirect, incidental, special, exemplary, or                |*
*|   consequential damages (including, but not limited to,                   |*
*|   procurement of substitute goods or services; loss of use,               |*
*|   data, or profits; or business interruption) however caused              |*
*|   and on any theory of liability, whether in contract, strict             |*
*|   liability, or tort (including negligence or otherwise)                  |*
*|   arising in any way out of the use of this software, even                |*
*|   if advised of the possibility of such damage.                           |*
*|                                                                           |*
*|   Portions of the code benefit from ideas due to Dave Ficenec,            |*
*|   LaDeana Hillier, Mike Wendl, and Tim Gleeson.  These are                |*
*|   indicated in the relevant source files.                                 |*
*|                                                                           |*
*|***************************************************************************|*


PHRED Documentation
-------------------

1. Introduction.

   Phred reads DNA sequencer trace data, calls bases, assigns quality
   values to the bases, and writes the base calls and quality values to
   output files.  Phred can read trace data from SCF, ABI model 373 and
   377 DNA sequencer chromatogram, and MegaBACE ESD chromatograms files,
   automatically detecting the file format, and whether the chromat
   file was compressed by gzip or UNIX compress.  After calling bases,
   phred writes the sequences to files in either FASTA format, the
   format suitable for XBAP, PHD format, or the SCF format.  Quality
   values for the bases are written to FASTA format files or PHD files,
   which can be used by the phrap sequence assembly program in order to
   increase the accuracy of the assembled sequence.

   Significant differences in this release

   New

     - add X86_GCC_LINUX definition. Defining this (in the supplied
       Makefile) when compiling phred on x86 Linux machines using
       the GCC compiler, causes phred to call bases and assign quality
       values that are identical to those called and assigned when
       phred runs on IEEE conforming UNIX machines. See the INSTALL
       and Makefiles in this distribution and the Linux system file
       /usr/include/fpu_control.h for additional information.

     - add LI-COR quality value lookup tables for chromatograms created
       by the LI-COR sequencing machine. See the section below entitled
       LI-COR Data.

   Modified

     - set `source' string, which is written in the output SCF files,
       based on machine type inferred from primer ID and phredpar.dat
       rather than based on the chromatogram type. The possible `source'
       values are

         "ABI 373A or 377"
         "MegaBACE"
         "ABI 3700"
         "LI-COR 4000"
         "unknown"

     - add `-trim_fasta', `-trim_scf', and `-trim_phd' options, which
       cause phred to write trimmed sequence and quality values to
       FASTA, SCF, and PHD files, respectively. Also added `-trim_out',
       which causes phred to write trimmed sequence and quality values
       to all output files (FASTA, SCF, and PHD) except .poly files.
       This is the same as using `-trim_fasta', `-trim_scf', and
       `-trim_phd' in combination. See the section on `Sequence
       Trimming' below.
       NOTE: `-trim_phd' affects the values in the `TRIM' field of
       the PHD file comment block.

     - describe trimming in greater detail in the section `Sequence
       Trimming'.

     - add protection against buffer overflows in code


2. Acknowledgements.

   Phred benefits from ideas developed by LaDeana Hillier, Mike Wendl,
   Dave Ficenec, Tim Gleeson, Alan Blanchard, and Richard Mott.
   

3. Algorithms.

   Phred uses simple Fourier methods to examine the four base traces in
   the region surrounding each point in the data set in order to predict
   a series of evenly spaced predicted locations.  That is, it determines
   where the peaks would be centered if there were no compressions,
   dropouts, or other factors shifting the peaks from their "true"
   locations.

   Next phred examines each trace to find the centers of the actual, or
   observed, peaks and the areas of these peaks relative to their neighbors.
   The peaks are detected independently along each of the four traces so
   many peaks overlap.  A dynamic programming algorithm is used to match
   the observed peaks detected in the second step with the predicted peak
   locations found in the first step.

   Phred evaluates the trace surrounding each called base using four or
   five quality value parameters to quantify the trace quality.  It
   uses a quality value lookup table to assign the corresponding quality
   value.  The quality value is related to the base call error probability
   by the formula

     QV = - 10 * log_10( P_e )

     where P_e is the probability that the base call is an error.

   Phred uses data from a chemistry parameter file called 'phredpar.dat'
   in order to identify dye primer data.  For dye primer data, phred
   identifies loop/stem sequence motifs that tend to result in
   CC and GG merged peak compressions.  It reduces the quality values
   of potential merged peaks and splits those peaks that have certain
   trace characteristics indicative of merged CC and GG peaks.  In
   addition, the chemistry and dye information are passed to phrap.


Please note: the following instructions for building and 
installing do not apply to the software distributed by
CodonCode. Please follow the specific installation
instructions provided by CodonCode. 

4. Building and installing.

   The INSTALL file describes the steps for building and installing
   phred.

   Copy the phred parameter file, called 'phredpar.dat', to a
   directory that is accessible by phred users and set the environment
   variable 'PHRED_PARAMETER_FILE' to the full path name of the file.
   For example, if you copy 'phredpar.dat' to '/usr/local/etc/PhredPar'
   and you are using the C shell then issue the command

     % setenv PHRED_PARAMETER_FILE /usr/local/etc/PhredPar/phredpar.dat

   It is most convenient to set the environment variable in the system-
   wide shell startup (cshrc or equivalent) file.

   You can rename the phred parameter file but the PHRED_PARAMETER_FILE
   environment variable must reflect the new name.

   With Windows NT you give the command

     % set PHRED_PARAMETER_FILE=\usr\local\etc\PhredPar\phredpar.dat

   in the DOS command window in which you will run phred.

   Note: if you compile phred on a SUN Solaris OS using the BSD C
         compiler in the directory `/usr/ucb', you will find that
         the `-id' command line option fails (phred reports that it
         cannot read files, and it prints the name of each file it
         fails to read; however, the name it prints lacks the first
         few characters of the true name of the file). If this occurs,
         recompile phred using either the optional C compiler in the
         directory /opt/SUNWspro/bin or the GNU C compiler.


5. Running phred.

   Phred uses command line options to control input, processing, and
   output.  The command line options are delimited by a dash, "-".


   The command line options are

   Input Options
   -------------
   
   -id <directory name>	    Read and process files in <directory name>.

   -if <file name>		    Read and process files listed in the file
                                <file name>.  Each line in <file name> must
                                specify a valid path to a single input file.

   -zd <directory name>         Location of compression program.  If -zd is
                                omitted, phred uses the current path to search
                                for the compression program.

   -zt <directory name>         Directory where chromat is uncompressed. If
                                -zd is omitted, phred uses /usr/tmp.  When
                                phred processes a compressed file, it 
                                uncompresses the chromat into this temporary
                                directory before it reads the file.  It
                                subsequently deletes the uncompressed file in
                                the temporary directory.


   Processing Options
   ------------------

   -nocall			       Disable phred base calling and set the
                                current sequence to the ABI base calls
                                that are read from the input file.  By
                                default, the current sequence is set
                                to the phred base calls.  This affects
                                the base trimming and output options.

   -trim <enzyme sequence>      Perform sequence trimming on the current
                                sequence.  Bases are trimmed from the start
                                and end of the sequence on the basis of
                                trace quality.  In addition, <enzyme sequence>
                                specifies a base sequence that is used
                                to trim bases off the start of the current
                                sequence.  You can specify a NULL enzyme
                                sequence using empty double quotes, "".
                                See the note below on the effect of using
                                the trim option.

   -trim_alt <enzyme sequence>  Perform sequence trimming on the current
                                sequence.  Bases are trimmed from the start
                                and end of the sequence on the basis of
                                trace quality.  Specifically, for each base,
                                the phred error probability is subtracted
                                from the default value of 0.05 (or the value
                                set using the `-trim_cutoff' option), and the
                                resulting values are summed to find the
                                maximum scoring subsequence.  Furthermore,
                                the subsequence must have a minimum number
                                of bases.  In addition, <enzyme sequence>
                                specifies a base sequence that is used to
                                trim bases off the start of the current
                                sequence. You can specify a NULL enzyme
                                sequence using empty double quotes, "".

   -trim_cutoff <value>         Set trimming error probability for the
                                `-trim_alt' option and the trimming points
                                written in the phd files. The default value
                                is 0.05.

   -trim_fasta                  Trim sequences written to sequence and
                                quality value FASTA files. Set trimming
                                information in the FASTA headers to reflect
                                the high quality of the sequence, and append
                                the string `trimmed' to the header.

   -trim_scf                    Trim sequence, quality values, and base
                                locations written to SCF file. Append the
                                string `trimmed' to the comments.

   -trim_phd                    Trim sequence, quality values, and base
                                locations written to PHD files. Also set the
                                first and last high quality base locations
                                specified in the `TRIM' comment field to
                                the numbers of the first and last bases of
                                the trimmed sequence (the first base in the
                                sequence is base number zero). Finally set
                                the error probability cutoff value in the
                                `TRIM' comment field to -1.00 to indicate that
                                the sequence is trimmed, and that the trim
                                points may be unrelated to the error
                                probability cutoff value.

   -trim_out                    Trim information in the FASTA, SCF, and
                                PHD output files. This is equivalent to
                                specifying `-trim_fasta', `-trim_scf',
                                and `-trim_phd' on the command line.

   -nonorm                      Disable phred trace normalization.  This
                                option is not recommended unless the base
                                caller fails due to huge noise peaks
                                extending over a large region at the start
                                of the trace, as is characteristic of some
                                dye terminator reactions.

   -nosplit                     Disable compressed peak splitting.  By
                                default, phred identifies and splits
                                C and G peaks that may be a merged pair
                                of peaks.  Phred searches for compression
                                prone loop/stem sequence motifs and
                                attempts to confirm a compression using
                                characteristics of the trace, primarily
                                the size of the candidate peak.

   -nocmpqv                     Force phred to use the four parameter quality
                                values.  By default, phred uses five parameter
                                quality values for dye primer data (only) in
                                order to reduce the quality values of merged
                                CC and GG peaks.  (Phred uses the four
                                parameter quality values for dye terminator
                                chemistry data automatically.  If phred cannot
                                determine the chemistry, it uses the four
                                parameter quality values.)

   -ceilqv <ceil_qv>            Specifies a maximum quality value assigned
                                to bases.  Bases with quality value parameters
                                that correspond to quality values greater
                                than <ceil_qv> are assigned the value
                                <ceil_qv>.

   -beg_pred <trace_point>      Specifies the trace point at which to begin the
                                peak prediction. This point should be in a
                                region of `good' trace where the peak spacing
                                is even and representative of the peak spacing
                                throughout the trace. In addition the peaks
                                should be large and the noise low in the
                                region, and the value of <trace_point> must not
                                be within 100 points of the trace ends.


   Output Options
   --------------

   -st fasta                    Set the output sequence file format
                                to FASTA. (Default.) Trimming options
                                affect the FASTA file; see the Notes
                                below for more information.

   -st xbap                     Set the output sequence file format
                                to XBAP.

   -s                           Write sequence output files with the
                                names obtained by appending ".seq" to
                                the names of the input files, and store
                                them in the directory where phred is
                                running.

   -s <file name>               Write a sequence output file with the
                                name <file name>.
                                This option is valid for a single input
                                file only.

   -sd <directory name>         Write sequence output files with the
                                names obtained by appending ".seq" to
                                the names of the input files, and write
                                them in the directory <directory name>.

   -sa <file name>              Write a sequence output file in FASTA
                                format with the name <file name>.  The
                                file contains the base calls of all the
                                reads processed in this run of phred.

   -qt fasta                    Set the output quality file format
                                to FASTA. Trimming options affect the
                                FASTA file; see the Notes below for
                                more information.

   -qt xbap                     Set the output quality file format
                                to XBAP.  Trimmed off base quality
                                values are omitted.

   -qt mix                      Set the output quality file format
                                to FASTA. Base quality values for
                                all bases are written (including those
                                for trimmed off bases).

   -q                           Write quality output files with the
                                names obtained by appending ".qual" to
                                the names of the input files, and store
                                them in the directory where phred is
                                running.
                                This option is valid for FASTA format
                                output files only.

   -q <file name>               Write a quality output file with the
                                name <file name>.
                                This option is valid for a single input
                                file and a FASTA format output file only.

   -qd <directory name>         Write quality output files with the
                                names obtained by appending ".qual" to
                                the names of the input files, and store
                                them in the directory <directory name>.

   -qa <file name>              Write a quality output file in FASTA
                                format with the name <file name>.  The
                                file contains the quality values of all the
                                reads processed in this run of phred.

   -qr <file name>              Write a histogram of the number of high
                                quality bases per read.  This is meaning-
                                ful when phred processes more than one
                                read.

   -c                           Write SCF files with the trace data,
                                the base calls of the current sequences,
                                and the positions of the base calls.  The
                                SCF files have the names of the input
                                files (phred will refuse to write the SCF
                                file if you ask it to write the SCF file
                                in the directory in which the input file
                                resides).

   -c <file name>               Write an SCF file with the trace data,
                                the base calls of the current sequence,
                                and the positions of the base calls.
                                The SCF file has the name <file name>.
                                This option is valid for a single input
                                file only.

   -cd <directory name>         Write SCF files with the trace data,
                                the base calls of the current sequences,
                                and the positions of the base calls.
                                The SCF files are written in the directory
                                <directory name> and have the same names
                                as the input files.

   -cp <number of bytes>        Store SCF trace data as 1 or 2 byte values.
                                Defaults to 1 when the maximum trace value is
                                less than 256, or to 2 when the maximum
                                trace value is greater than or equal to 256.
                                This is the trace precision.

   -cs                          Always scale traces before writing them to
                                an SCF output file. This ensures that the
                                largest trace value has the largest value
                                that can be stored in the SCF file. When the
                                file trace precision is `1', the maximum
                                value is 255, and when the precision is 2,
                                the maximum value is 65535. Without this
                                option, phred does not scale the trace unless
                                (a) the trace was read from an ESD file or
                                (b) the maximum trace value exceeds the value
                                that can be stored in the SCF file at the
                                precision used. Trace scaling ensures the
                                maximum digital resolution for a given
                                storage precision but it will make a
                                uniformly low level trace appear to be a
                                high level.

   -p                           Write a PHD file, which is used by the
                                consed editor to display bases.  A PHD
                                file contains a set of comments used by
                                consed for maintaining consistency between
                                the chromat file, the .ace file and
                                the PHD file, and it contains base data
                                as triples consisting of the base call,
                                quality, and position.  Phred always
                                writes the first version of the PHD
                                file for a read, which has the name
                                <filename>.phd.1.  When a read is edited
                                using consed, a new version of the phd is
                                written by consed, for example, the second
                                version has the name <filename>.phd.2.  With
                                the -p option, <filename> is the name of the
                                input file.

   -p <filename>                Write a PHD file with the name <filename>.phd.1.
                                This option is valid for processing a single
                                input file.

   -pd <directory name>         Write PHD files in directory <directory name>.
                                The PHD files have the names <filename>.phd.1
                                where <filename> is the name of the input file.

   -d                           Write a data file that is used for detecting
                                polymorphic bases.  The file has the
                                name <filename>.poly where <filename> is the
                                name of the input file.  The first line of
                                the file consists of the sequence name, the
                                smallest amplitude normalization factor, and
                                the amplitude normalization factors for the
                                A, C, G, and T traces.  One line for each
                                called base follows the header line.  The
                                information on each line consists of the
                                called base, the position of the called base,
                                the area of the called peak, the relative area
                                of the called peak, the uncalled base, the
                                position of the uncalled base, the area of the
                                uncalled base, the relative area of the
                                uncalled base, and the amplitudes of the four
                                traces at the position of the called base.

   -dd <dirname>                Write polymorphism data files in directory
                                <directory name>.  The files have the names
                                <filename>.poly where <filename> is the name
                                of the input file.
              
   -raw <sequence name>         Write <sequence name> in the header of
                                the sequence output file and the quality
                                output file.
                                By default, the name of the input file
                                is written in the headers of these files.
                                This option is valid for a single input
                                file only.

   -log                         Make phred append a log entry describing
                                the processing run in the file "phred.log".



   Miscellaneous
   -------------

   -v  <n>                      Verbose operation. You can control the level of
                                verbosity with <n>, which ranges from 1 to 63.

   -tags                        Label common output with tags in order to
                                facilitate output parsing.

   -h, -help                    Display a command line option summary.

   -doc                         Display phred documentation.

   -V                           Display phred version.                                




   Examples
   --------

   If you plan to use phred base calls and base quality information as
   input to the phrap assembly program and to the consed finishing
   program, simply follow the documentation supplied with consed and
   then type:

   phredPhrap
   
   (with no arguments)

   If you intend to use consed, you *MUST* use this perl script.  Failure
   to use this script will result in many consed features not working
   correctly, including consed's autofinish function, user-defined
   consensus tags, tagging ALU and other repeats, and tagging vector
   sequence.  Use the phredPhrap perl script.

   An outline of the important processing steps performed by the script
   follows.

   Let us say you want to call bases from the chromat files in
   subdirectory "chromat_dir", use phrap to assemble the contigs, and
   run consed to edit/examine the contigs.  In this case you must ask
   phred to create "phd" output files, which are required by consed.

   It runs phred with the options

     % phred -id chromat_dir -pd phd_dir

   which causes phred to read the chromat files in "chromat_dir" and
   write the "phd" files to "phd_dir".  Next it makes FASTA files
   from the "phd" files by running the phd2fasta program.
   For example,

     % phd2fasta -id phd_dir -os seqs_fasta -oq seqs_fasta.screen.qual

   Subsequently it screens out the vector in the sequences in
   "seqs_fasta" using cross_match:

     % cross_match seqs_fasta vector.seq -minmatch 12 -minscore 20 -screen > screen.out

   which generates the screened sequence file "seqs_fasta.screen",

   It runs phrap to perform the sequence assembly as follows:

     % phrap seqs_fasta.screen -new_ace > phrap.out

   Phrap writes the the assembled contigs to the file
   "seqs_fasta.screen.contigs", and creates a .ace file that can be
   used for importing the assembly to xbap, consed, or ace-mbly for
   editing.

   As another example, again you want to process the chromat files
   in subdirectory "chromat_dir",  but now you want phred to write
   the base calls to a FASTA file named "seqs_fasta" and the base
   quality values to "seqs_fasta.qual".  In this case you run phred
   with the options

     % phred -id chromat_dir -sa seqs_fasta -qa seqs_fasta.qual

   We recommend that you not use the trim option.  Inaccurate bases
   called near the ends of the traces will not interfere with proper
   phrap assembly.

   Refer to the file "phrap.doc", which is part of the phrap
   distribution, for information on cross_match and phrap.


   Return values
   -------------

   Phred returns 0 for successful processing and for file read errors. It
   returns -1 for processing errors and file write errors.

   Phred continues processing on file read and write errors but halts on
   serious processing errors.


6. Phred parameter file


   Phred reads the `primer ID' information in the chromatogram and it
   tries to find the same name in the phred parameter file, which is
   described in the `Building and installing' section above.  If it
   succeeds, the phredpar.dat entry for the `primer ID' identifies the
   sequencing reaction chemistry (primer or terminator) and the type of
   dye.  If it cannot find the `primer ID' information in the
   chromatogram, it reports

     no dye primer ID in chromat yyyy

   where yyyy is the chromatogram name. If it cannot find the `primer ID'
   name in phredpar.dat (or it cannot find the phredpar.dat file), it
   reports

     unknown chemistry (xxxx) in chromat yyyy
     add a line of the form
     "xxxx"    <chemistry>      <dye type>      <machine type>
     to the file zzzz
     type `phred -doc' for more information

   where xxxx is the `primer ID' and yyyy is the chromatogram name.
   Add the indicated line to phredpar.dat.

   Phred reads the `PHRED_PARAMETER_FILE' environment variable in order
   to find the phredpar.dat file.  If this is not set on your system,
   phred reports

     warning: 'PHRED_PARAMETER_FILE' environment variable not set:
               unable to identify chemistry and dye
               type `phred -doc' for more information

   If the `PHRED_PARAMETER_FILE' environment variable is set incorrectly,
   that is, phred cannot find the phredpar.dat file there or the file is
   not valid, phred reports

     readParamFile: warning: unable to open file zzzz
       warning: processing without phred parameters

   where zzzz is the value of the PHRED_PARAMETER_FILE environment.
   It processes the chromatograms but warns that it could not read
   the parameter file as it processes each chromatogram as explained
   above. In this case, you must set PHRED_PARAMETER_FILE to a valid name.

   In these three cases phred processes the chromatogram but it uses the
   default (ABI) four parameter quality values, does not try to split
   compression peaks, and reports the chemistry and dye types to phrap
   as `unknown'.

   If you use a `primer ID' for your reactions that is not in phredpar.dat,
   you can add the `primer ID' name to phredpar.dat.  You will need to know
   the `primer ID' name as it is stored in the chromatograms, the chemistry
   type (primer or terminator), the dye name, and the type of sequencing
   machine.  Use a text editor to add `primer ID' entries to phredpar.dat.
   You will find additional information about the form of phredpar.dat
   entries in phredpar.dat.

   The columns in phredpar.dat have the form

   column      value name
   ------      ----------
   1           primer identification string
   2           chemistry
   3           dye
   4           sequencing machine type

   where the column values are separated by spaces or horizontal tabs.

   The values phred recognizes are

   value name                 values
   ----------                 ------
   primer id. string          primer name enclosed in double quotes
   chemistry                  primer, terminator
   dye                        rhodamine, d-rhodamine, big-dye,
                              energy-transfer, bodipy
   sequencing machine type    ABI_373_377, MolDyn_MegaBACE, ABI_3700


   NOTE: the `MegaBACE Mobility File' entry in the phredpar.dat file
         specifies `unknown' chemistry, rather than `primer' or
         `terminator' because some early MegaBACE software wrote
         `MegaBACE Mobility File' for the `primer ID' string in both
         primer and terminator chemistry ABD files. You may want to
         change this value if you process exclusively primer or
         terminator chemistry MegaBACE data; however, you must
         remember to change it if you decide to process different
         chemistry data from the MegaBACE later.



7. Notes


   Sequence Trimming
   -----------------

   First, a warning: do not trim sequences that phrap will assemble.  We
   cautiously introduce trimming capabilities in phred to allow
   identification of the high quality region of reads, and to permit
   trimming off low quality segments of reads that are not destined for
   a phrap assembly.

   Phred uses a number of different algorithms to calculate trimming
   information. The algorithm used and its effect depend on the output
   file and the trimming-related command line options.

   The phd output file always contains trimming information in the
   header. Phred calculates this trimming information using a modified
   Mott algorithm (it does not trim off vector sequence so the trimming
   information identifies the entire high quality segment of the read).
   The trimming information appears in the phd file header in the form

   TRIM: <n1> <n2> <r1>

   where <n1> is the first high quality base (where the first base in
   the sequence is number zero) and <n2> is the last high quality base.
   <r1> is the error probability cutoff value used to calculate the
   trim points. The command line option `-trim_cutoff' affects the
   phd file trimming information by setting the error probability cutoff
   value used to calculate the base scores. If the sequence has fewer
   than 20 high quality bases, the values <n1> and <n2> are set to -1.
   If the `-trim_phd' or `-trim_out' option is used, <n1> and <n2>
   are set to the numbers of the first and last bases in the trimmed
   sequence (so <n1> is always zero), and <r1> is set to -1.00 to
   indicate that the sequence is trimmed and that the error probability
   cutoff value may be unrelated to the trim points.

   The sequence, quality value, SCF, and PHD output files can be
   affected by the trimming-related command line options. (Sequence
   and quality value files are those created using the -s, -sa, -sd,
   -q, -qa, and -qd options, SCF files are created using the -c and
   -cd options, and PHD files are created using the -p and -pd
   options). When phred runs without trimming-related options set,
   it does not calculate trimming values for the sequence, quality
   value, and SCF output files (and it does not `trim' the values
   stored in them).

   The `-trim' and `-trim_alt' options select the trimming algorithm
   used to calculate the trimming information used in the sequence,
   quality value, and SCF output files. The algorithm used for the
   `-trim' option is based directly on characteristics of the trace.
   It predates phred and phred quality values. The algorithm used for
   the `-trim_alt' option is based on the modified Mott algorithm: it
   uses the base error probabilities calculated from the phred quality
   values and the error probability cutoff (the cutoff can be adjusted
   using the -trim_cutoff option). We believe that the `-trim' option
   tends to be conservative, `trimming off' more bases, in comparison to
   the `-trim_alt' option. So we recommend using the `-trim_alt'
   algorithm. Both the `-trim' and `-trim_alt' options take an argument
   consisting of a vector sequence. If the argument is "" (null), phred
   finds the high quality segment of the read. If the argument is not
   null, and phred finds the beginning of the vector sequence within
   the first 100 bases of the read, phred sets the left trim point to
   remove the vector sequence as well as low quality bases.

   Selecting either `-trim' or `-trim_alt' causes phred to determine
   trimming information and to modify the sequence, quality value, and
   SCF files as follows.

     The FASTA sequence header contains trimming information
     but the sequence is unaffected. The header has the form

     >chromat_name   1323     15    548  ABI

     where the sequence name immediately follows the header
     delimiter, which is ">", the first integer is the number
     of bases called by phred, the second integer is the
     number of bases `trimmed off' the beginning of the
     sequence, the third integer is the number of bases
     `remaining following trimming', and the string describes
     the type of input file.

     The XBAP-type of sequence header contains trimming
     information, and the low quality bases are commented out.

     For quality value file type option `-qt fasta' (default),
     the FASTA quality value header contains the same trimming
     information as in the FASTA sequence header and the
     quality values of the `trimmed off' bases are set to zero.

     For quality value file type option `-qt xbap', phred
     writes a XBAP-type of sequence header with trimming
     information followed by the quality values of the bases
     remaining after trimming on subsequent lines.

     For quality value file type option `-qt mix', phred
     writes a FASTA quality value header with the same
     trimming information as in the FASTA sequence header
     followed by the quality values of all bases (without
     trimming).

     The SCF file contains trimming information in the header,
     and the sequence, quality values, and trace locations of
     the called peaks are unaffected. The left clip is the
     number of bases to trim off the left end of the sequence
     and the right clip is the number of bases to trim off
     the right end.
     
   When the `-trim_fasta' or `-trim_out' option is used with the `-trim'
   or `-trim_alt' (and -s, -sa, -sd, -q, -qa, or -qd) option, phred
   writes the trimmed sequence to the sequence FASTA file and trimmed
   quality values to the quality value FASTA file; that is, it writes
   only the high quality bases and the corresponding quality values. In
   addition, it appends the string `trimmed' to the FASTA headers and
   the trimming information in the header indicates that no (additional)
   bases are to be trimmed off. The option `-trim_fasta' is invalid with
   the `-qt xbap' and `-qt mix' options.

   When the `-trim_scf' or `-trim_out' option is used with the `-trim'
   or `-trim_alt' (and -c or -cd) option, phred writes the trimmed
   sequence, trimmed quality value, and trimmed called peak locations to
   the SCF output file. In addition, it appends the string `trimmed' to
   the comment field and the left and right clip values are set to zero.

   When the `-trim_phd' or `-trim_out' option is used with the `-trim'
   or `-trim_alt' (and -p or -pd) option, phred writes the trimmed
   sequence, trimmed quality value, and trimmed called peak locations to
   the PHD output file. In addition, when it writes the `TRIM' field
   in the comment block (at the beginning of the file), it sets the
   values for the first and last high quality bases to the numbers of
   the first and last bases of the trimmed sequence (where the first
   base is number zero), and it sets the error probability cutoff value
   to -1.00. Setting the cutoff value to -1.00 indicates that the
   sequence is trimmed, and that the trim points may be unrelated to the
   error probability cutoff value.

   The modified Mott trimming algorithm, which is used to calculate the
   trimming information for the `-trim_alt' option and the phd files,
   uses base error probabilities calculated from the phred quality
   values. For each base it subtracts the base error probability from an
   error probability cutoff value (0.05 by default, and changed using
   the `-trim_cutoff' option) to form the base score. Then it finds the
   highest scoring segment of the sequence where the segment score is
   the sum of the segment base scores (the score can have non-negative
   values only). The algorithm requires a minimum segment length, which
   is set to 20 bases.


   ESD Files
   ---------

   Phred reads processed MegaBACE ESD files.  It cannot read the raw
   ESD files.  It is important that you identify the dye chemistry
   correctly when you run the MegaBACE base caller so that phred can
   assign the right base to each trace. (This is important with ABI
   data too.)

   In order to obtain the best phred quality value accuracy with
   MegaBACE data, phred must use the quality value lookup tables
   designed for this data.  Phred identifies the sequencing machine
   by reading the `primer ID' string in the chromatogram and matching
   it with an entry in the phredpar.dat file.  The matching entry
   lists the chemistry, dye, and sequencing machine types. For example,
   the `primer ID' string of the form `ET Primer' identifies a
   chromatogram as ET dye primer data generated on a MegaBACE
   sequencing machine. You can check that phred interprets the
   `primer ID' string correctly by using the `-v 63' option to have
   phred write diagnostic information to the screen.


   LI-COR Data
   -----------

   Band Spread Ratio (BSR)

   Phred reads SCF files created by the LI-COR gel processing software
   and has quality value lookup tables calibrated for traces processed
   with Band Spread Ratio (BSR) of 2.2. The LI-COR software writes a
   `primer ID' string in the SCF file that indicates the BSR value
   used in the trace processing, which for BSR=2.2 is 
   `DyePrimer{LI-COR_IR_2.2}'. Accordingly, the phredpar.dat file in
   this distribution has an entry with this string, which enables
   phred to recognize LI-COR traces processed with BSR=2.2, and to
   use the quality value lookup table designed for this LI-COR data.
   Phred has a quality value lookup table for data processed with
   BSR=2.2 only so the quality values for LI-COR traces processed
   with other BSR values will have reduced accuracy.


8. References

   Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green.
   Base-calling of automated sequencer traces using phred. I. Accuracy
   assessment. 1998. Genome Research 8:175-185.

   Brent Ewing and Phil Green
   Base-calling of automated sequencer traces using phred. II. Error
   probabilities. 1998. Genome Research 8:186-194.


End: PHRED.DOC