Sequence Trimming With PHRED

 PHRED can automatically remove low-quality base calls from the start and end of DNA sequences, a process called "trimming" or "clipping". CodonCode's programs MacPHRED and InterPhace support trimming with PHRED. This page gives a brief discussion about trimming with PHRED.

 The effect of trimming is easiest to understand by looking at trimmed and untrimmed sequence traces with a program like TraceViewer (please visit http://www.codoncode.com/TraceViewer/ for more information). A typical sequence has "junk" base calls at the beginning of a sequence, as shown in the picture below:

The picture above shows the start of an untrimmed sequence. The PHRED quality scores are shown as read, blue, or black number above the traces. The same trace trimmed with the default setting (0.05) looks like shown on the picture below:

Trimmed with a more stringent setting of 0.01, additional bases are lopped off the beginning, as shown below:

Compared to the 5% setting, an additional 8 bases with quality scores between 9 and 20, corresponding to error probabilities of 11% to 1%, have been trimmed.

When to use trimming (and when not to!)

When generating trimmed output files, you will loose bases at the start and end of sequences, so trimming should be used with care. In general, you should NOT use trimming if you are using PHRAP for assembly of large-scale sequencing projects, for example BAC shotgun projects.

However, trimming sequences can make sense if you:

If you plan to generate trimmed sequences, you may want to first experiment with different cutoff scores to see which setting works best for you.

What if you first generated clipped SCF files, and later want to get back the entire sequence? If you have a copy of the original ABI (or SCF) trace around, you can simply use that file. Or you can just run MacPHRED again on the (original or trimmed) chromatogram files, and generate new, untrimmed SCF files.

More about trimming

Curious about how the trimming is done? InterPhace and MacPHRED use the "-trim_alt" option in PHRED. The original PHRED documentation describes the algorithm as follows:

The modified Mott trimming algorithm, which is used to calculate the trimming information for the `-trim_alt' option and the phd files, uses base error probabilities calculated from the phred quality values. For each base it subtracts the base error probability from an error probability cutoff value (0.05 by default, and changed using the `-trim_cutoff' option) to form the base score. Then it finds the highest scoring segment of the sequence where the segment score is the sum of the segment base scores (the score can have non-negative values only). The algorithm requires a minimum segment length, which is set to 20 bases.

A somewhat simplified way of describing the algorithm is as follows: PHRED finds the longest fragment in a sequence where the estimated error rate is below the cutoff value (which you can set in the "Error rate" text box in the "Base Calling" dialog). The trimmed sequence cannot be extended at either end without adding segments that have an error rate above the cutoff values. However, small segments within the trimmed sequence may have higher error rates &endash; sequencing artifacts like stops and compressions are an example.

The average estimated error rate of the trimmed sequence will generally be lower than the cutoff value. Statistically, the actual error rates of trimmed sequences will also be lower than the cutoff. However, any single sequence (or small number of sequences) may have an actual error rate that is higher, due to statistical sampling errors. (For example, if 10 bases have a 10% error probability, than 9 of the 10 will have an actual error rate of 0%, while one of the 10 will have a 100% actual error rate).

Finally, the trimming requires that the trimmed sequence is at least 20 bases long. Any sequence that is shorter after trimming will be completely trimmed, with no bases left over. The reason for this is that even traces that are clearly artifacts and do not contain any real sequence will often have short stretches (5-10 bases) where the base call quality is high. By enforcing the minimum length requirement, PHRED makes sure that garbage is always identified as such when trimming is used.

 

Support - CodonCode Home - Phrap.com

© Copyright 2000 CodonCode Corporation. All rights reserved.