Fine Tuning Sequence Assembly in CodonCode Aligner
This page explains how to adjust parameters for sequence assembly in CodonCode Aligner. We will start with a brief overview of how CodonCode Aligner assembles sequences, and then look at suggested parameter changes for two specific workflows:
- Including short oligonucleotides, for example primers, in assemblies
- Assembling sequences with large deletions, for example cDNA to genomic DNA
In addition, we will also look in detail at all assembly settings, from algorithm choices to match scoring parameters.
How does CodonCode Aligner assemble sequences?
CodonCode Aligner's algorithms for sequence assembly are simple greedy algorithms with the following basic approach:
- Find potential overlaps between sequence pairs by looking for shared "words", with a settable word length (default: 12 bases).
- Sort all pairwise matches based on number of shared words and overlap length.
- Go through this list of pairwise overlaps and align the two sequences; if a sequence is already part of a contig, use the contig. The pairwise alignment is generated using a banded dynamic programming method (Needleman-Wunsch or Smith-Waterman algorithm).
- Evaluate the pairwise alignment to see if it meets minimum stringency parameters like percent identity, overlap length, and alignment score. Aligned pairs that satisfy the stringency requirements are merged into a contig, for which a consensus sequence is calculated.
- Repeat steps 3 and 4 until all sequences have been merged, or all pairwise mergers identified in step 1 have been tried, or many successive failures indicate that further mergers are unlikely.
The result of the assembly can be one of more contigs, unless no pairwise alignments that meet the match criteria are found. Sequences that cannot be merged with other sequences remain unassembled.
How do I include primers in assemblies?
To include primers or other oligonucleotides in assemblies, it can be necessary to change two parameters:
- Minimum overlap length: the default value of 25 means oligos shorter than 25 bases would be excluded; if you have shorter primers, reduce this to the length of the shortest primer.
- Minimum score: the default minimum alignment score of 20 means that only primers with a perfect match over at least 20 bases will be assembled. For shorter primers, or primers that contain ambiguity bases or mismatches, you need to reduce the minimum score.
A third parameter that occasionally may need to be changed is the Word length. The default value of 12 allows for the assembly of oligonucleotides that have a perfect match of at least 12 bases. For oligos that are shorter, or have multiple ambiguities or mismatches, you may need to reduce the word length accordingly.
How do I assemble cDNA and genomic DNA?
To assemble sequences where some sequences have large insertions or deletions, for example cDNA and genomic DNA, you need to change the Algorithm to Large gap alignments.
The default global alignment algorithm will not align sequences with large gaps because
- it uses a banded alignment for faster assembly speeds, which restricts the number of gaps considers to (roughly) the bandwidth (default: 30); and
- the alignment scores are likely to be too low due to gap penalties.
Tip: when aligning cDNA sequences to genomic DNA, it often makes sense to declare the genomic DNA as the reference sequence, and use the Align to reference instead of Assemble. The parameters for alignments to a reference sequence are very similar to the Assembly parameters, but are set in the Alignment preferences.
All sequence assembly settings explained
You can customize assembly of Aligner in the Preferences dialog. To display the Preference dialog:
- On macOS, select Preferences in the CodonCode Aligner menu, or press Command-comma
- On Windows, select Edit → Preferences, or press Alt-Enter
This opens the preferences dialog shown in the screen shot below. To change the assembly settings, click on Assembly in the left panel.
Assembly Algorithm
The Algorithm pulldown lets you choose how CodonCode Aligner compares sequences during assembly, with the following options:
- End to end alignment (default): When this algorithm is used, alignments always include the entire sequences. When using this algorithm, it is important that samples have been end clipped (and possibly also vector trimmed).
- Local alignments: When this algorithm is used, Aligner uses local alignments (this method is also used when assembling using Phrap). This means the start and the end of sequences is not necessarily included in the alignment - the alignments stop when the alignment score would not improve anymore. This can be due to too many discrepancies in low quality sequence near the ends, or due to unremoved vector sequences. The resulting unaligned ("dangling") ends are shown on gray background in the contig view, base view, and trace view.
- Large gap alignments: This algorithm is typically used when aligning cDNA to genomic DNA. It allows for large gaps in between alignments, without penalizing the large gaps. The large gap algorithm can also be useful when analyzing samples with large insertions or deletions.
Minimum Percent Identity
This is the minimum percentage of identical bases in the aligned region. The default parameter of 70% is relatively relaxed; you may want to use a more stringent setting for your projects, especially if you did use end clipping before the alignment.
Be careful about setting this value to 100%: only samples that fully match each other in the overlapping regions will be assembled, samples with even a single discrepancy will not be aligned.
Minimum Overlap Length
This is the minimum length of the aligned region. If the aligned region is shorter than the value you set here (with 25 being the default), alignments will be rejected, and samples will remain in the "Unassembled Samples" folder.
To include primers or oligonucleotides shorter than 25 bases in assemblies, reduce this parameter to the length of the shortest oligo.
Minimum Alignment Score
This parameter is similar to the "Minimum Overlap Length", but it takes discrepancies into account. Scores will be scaled so that a match gives a score of 1 - for each matching base in the aligned region, a score of +1 will be added. With the default settings, a score of -2 will be subtracted for each mismatch; for single base insertions or deletions, a score of -5 will be subtracted (-3 gap introduction penalty and -2 gap penalty; additional gaps in the same run lead to a subtraction of -2 per base).
In general, your minimum alignment score should be lower than your minimum overlap length to allow for some level of discrepancies between the sequences.
Maximum Unaligned End Overlap
Note: this parameter does not apply when the global alignment algorithm is selected.
After doing an alignment, Aligner looks at unaligned ("dangling") ends of both reads, and can reject alignments when the dangling ends are too long.
Specifically, Aligner calculates the relative amount of unaligned sequence that could have been aligned by dividing the overlapping bases in the unaligned ends by the length of the shorter sequence. If this relative unaligned length is higher than the percentage set for the maximim unaligned end overlap, the alignment is rejected, and the two sequences or contigs are not merged.
You may need to adjust this value, depending on the kind of project you are doing. If you aligned cDNA sequences to genomic DNA, use values of or near 100%, since large stretches of exons may be unaligned. But if you expect your samples to match end-to-end, and pre-processed your sequence with end clipping and vector trimming, you can use lower values to reduce the chance that different copies of repeats will be incorrectly assembled together.
Bandwidth (Maximum Gap Size)
CodonCode Aligner uses a banded implementation of local and global alignment algorithms for faster assembly speeds and to reduce memory requirements during alignment. Instead of searching a full two-dimensional matrix to generate the alignment, only matches near the diagonal of the matrix are evaluated. This can dramatically increase assembly speeds and reduce memory requirements.
The Bandwidth parameter controls how far to the sides of the diagonal will be searched; somewhat simplified, this is the maximum total number of insertions or deletions one sequence can have relative to the other sequence. A bit simplified, if one sample has an insertion or deletion that is larger than the Bandwidth, the alignment will typically stop at the insertion/deletion, and the rest of the sample will be unaligned. If the insertion or deletion is shorter than the bandwidth, the alignment will continue after introducing the necessary number of gaps in one sequence, as long as the aligned parts after the gaps are long enough (with the exact length required depending on the scoring parameters).
The bandwidth parameter has an impact on the assembly speed - larger values mean slower assemblies. For large projects, you may want to reduce the bandwidth value; for projects where you know that you have larger insertions and deletions, you may want to increase it. Note, however, that increasing the bandwidth will typically not be enough to extend alignments through very large gaps, like large introns; such alignments require to change the algorithm to Support for Large gap alignments.
Word Length
The Word length parameter determines the size of "words" that CodonCode Aligner uses when looking for potential overlaps between sequences. Only sequence pairs that have perfect matches of at least this length will be considered for merging.
If you are trying to assemble sequences with high error or mutation rates, reducing the word length may help to get samples aligned. For large projects or projects with many repeat sequences, larger numbers may give faster assemblies and better results.
Maximum Successive Failures
This parameter is only relevant for larger assemblies with hundreds or thousands of reads. It can be used to limit how long Aligner will try to merge contigs for very large projects, with larger numbers meaning more tries and longer assembly times, but potentially fewer contigs.
Match scoring
The last four alignment parameters determine how matches are scored:
- the Match score is used when two aligned nucleotides are identical
- the Mismatch penalty when two base calls are different
- the Gap penalty is used when one of the two sequences has a deletion relative to the other sequence
- the Additional first gap penalty is added at the first inserted or deleted base. For single base deletions, the penalty score will be the sum of the gap penalty and the additional first gap penalty; for additional deleted bases (multiple gaps in a row), the penalty will be just the gap penalty.
You can change the scoring within limits (scores from 1 to 19 for matches, and penalties of -1 to -19). In general, we suggest that only experts change the match scores and penalties.
Restoring default parameters
To restore the default parameters, click on the "Defaults" button near the bottom. This will reset all parameters to the choices shown in the screen shot above.