Sequences in each project were re-called using the program PHRED (B. Ewing & P. Green (1998), Genome Research 8: 186-194), and compared to fully or partially "finished" sequences using CROSS_MATCH (Phil Green, submitted for publication). CROSS_MATCH performs a local alignment using the same algorithms as PHRAP, and excludes high-error regions at the beginning and end of sequences from the alignment. To establish identical inclusion criteria for all projects, and to eliminate distortions from mis-aligned or mis-assembled reads, sequences that had less than 175 bases with PHRED quality score of at least 30, or an error rate of 5% or higher in the aligned region, were excluded from the analysis.
Actual error rates were calculated from CROSS_MATCH output, and compared to predicted error rates in 50-base windows or over the entire aligned region by converting PHRED qualities in the aligned regions to error probabilities.
To optimize the image processing parameters for use with PHRED, this analysis was repeated with the same data set generated with different settings for the image processing steps.
Data sets generated by four-color fluorescent (ABD) sequencers where kindly supplied by leading sequencing laboratories, as described previously (see: P. Richterich (1998), Genome Research 8: 251-259). For comparison to LI-COR generated data, the highest quality four-color data set ("Project A") was primarily used.
Data sets generated on LI-COR DNA sequencers were kindly supplied by the Drs. P. Brottier, H. Crespeau and P. Wincker from the GENOSCOPE National Sequencing Center (B.P. 191, 2 rue Gaston Cremieux, 91006 EVRY Cedex, France) and Dr. B.Fartman from MWG-Biotech GmbH (Anzinger Strasse 7, D-85560 Ebersberg, Germany).