[Abstract] [Methods] [Software] [Adaptation] [Data Quality] [Home]

Introduction

In many DNA sequencing projects, read length and accuracy of "raw" DNA sequence data is of major importance. Longer read length and lower error rates can reduce costs of consumables and labor, facilitate correct assembly of repeats, reduce the number of contigs, and simplify finishing.

We characterized the raw data accuracy from large-scale sequencing projects that used LI-COR sequencers, and compared the results to six previously studied projects which used four-color (ABD) sequencers. Parameters analyzed were average alignable (usable) read length, total error rates, error rates as a function of position within the reads, and distribution of "high quality" base counts. Key results are presented in Figures 5 (Fig 5a, 95K; Fig 5b, 71K), 6, and 7, and Tables 2 and 3.

Table 2. Comparison of ABD and LI-COR data.

 

ABD data set
("Project A")
LI-COR data set 1
(MWG Biotech)
LI-COR data set 2
(Genoscope)

Aligned read length

910
1,050
1,093

Actual error rate

3.12%
1.98%
1.77%

Bases/set with PHRED quality > 29

491
708
732

 

Table 3. PHRED vs. LI-COR base caller accuracy.
Data Set
Parameter Analyzed
PHRED
LI-COR / CodonCode
(no ambiguities)

Data set 1
(MWG Biotech)

Aligned read length

1050
1093

Actual error rate

1.98%
1.58%

Data set 2
(Genoscope)

Aligned read length

1093
1173

Actual error rate

1.77%
1.54%

 

Figure 6a. Data accuracy comparison.

Figure 6b. Local error rate comparison.

Figure 7. "Very high quality" base count distribution.

 

Benchmarking Results Summary

Both data sets that were generated exclusively on LI-COR sequencers showed significantly lower average error rates and longer alignable read lengths than any of the six ABD-generated data sets. The average aligned read length for the LI-COR projects ranged from 1,050 to 1,173 bases (depending on the project and base calling program; see Tables 2 and 3). The best ABD-generated project had an average aligned read length of 910 bases. The average error rate in this ABD project was 3.12%; the average error rate in the LI-COR projects was 36% to 50% lower, ranging from 1.54% to 1.98% (see Tables 2 and 3).

As Figure 6 demonstrates, the highly accurate region (>99% accurate, <1% error) was significantly longer for LI-COR generated sequences (about 900 bases) than for ABD generated sequences (approximately 550 bases). A similar trend can be seen in another commonly used measure of sequence quality, the number of bases with PHRED quality values of at least 30: for ABD project A, the "very high quality" base count distribution has a maximum near 550, while the distribution for LI-COR project 2 has a distribution maximum at 850. In other words, LI-COR generated sequences in the projects had about 50% more highly accurate bases (with an estimated error rate of 1:1,000 or less) than the best ABD project in our study (which had 15%-50% more "very high quality" bases, as well as substantially longer aligned read length, than the other ABD projects studied).

The observed higher data quality for LI-COR sequencer data is likely to be due to several different factors. First, longer gels lead to better electrophoretic separation, a trend that can also be seen for ABD-generated data in Figure 6b (compare project D, which used exclusively "short" gels, to project A, which used exclusively "long" gels). Second, the four-color detection employed in ABD sequencers requires a mathematical color separation, which could potentially affect signal-to-noise ratios negatively, and thus lead to lower data quality. And third, the performance of the base calling software also affects the results: the new version of Base ImagIR software gives significantly higher accuracy and longer read lengths than previous versions, and also performed slightly better than PHRED on the two projects studied.