CodonCode Corporation
Better Software for DNA Sequencing

Fine Tuning Sequence Assembly in CodonCode Aligner

This page explains how to adjust parameters for sequence assembly in CodonCode Aligner. We will start with a brief overview of how CodonCode Aligner assembles sequences, and then look at suggested parameter changes for two specific workflows:

In addition, we will also look in detail at all assembly settings, from algorithm choices to match scoring parameters.

How does CodonCode Aligner assemble sequences?

CodonCode Aligner's algorithms for sequence assembly are simple greedy algorithms with the following basic approach:

  1. Find potential overlaps between sequence pairs by looking for shared "words", with a settable word length (default: 12 bases).
  2. Sort all pairwise matches based on number of shared words and overlap length.
  3. Go through this list of pairwise overlaps and align the two sequences; if a sequence is already part of a contig, use the contig. The pairwise alignment is generated using a banded dynamic programming method (Needleman-Wunsch or Smith-Waterman algorithm).
  4. Evaluate the pairwise alignment to see if it meets minimum stringency parameters like percent identity, overlap length, and alignment score. Aligned pairs that satisfy the stringency requirements are merged into a contig, for which a consensus sequence is calculated.
  5. Repeat steps 3 and 4 until all sequences have been merged, or all pairwise mergers identified in step 1 have been tried, or many successive failures indicate that further mergers are unlikely.

The result of the assembly can be one of more contigs, unless no pairwise alignments that meet the match criteria are found. Sequences that cannot be merged with other sequences remain unassembled.

How do I include primers in assemblies?

To include primers or other oligonucleotides in assemblies, it can be necessary to change two parameters:

A third parameter that occasionally may need to be changed is the Word length. The default value of 12 allows for the assembly of oligonucleotides that have a perfect match of at least 12 bases. For oligos that are shorter, or have multiple ambiguities or mismatches, you may need to reduce the word length accordingly.

How do I assemble cDNA and genomic DNA?

To assemble sequences where some sequences have large insertions or deletions, for example cDNA and genomic DNA, you need to change the Algorithm to Large gap alignments.

The default global alignment algorithm will not align sequences with large gaps because

  1. it uses a banded alignment for faster assembly speeds, which restricts the number of gaps considers to (roughly) the bandwidth (default: 30); and
  2. the alignment scores are likely to be too low due to gap penalties.

Tip: when aligning cDNA sequences to genomic DNA, it often makes sense to declare the genomic DNA as the reference sequence, and use the Align to reference instead of Assemble. The parameters for alignments to a reference sequence are very similar to the Assembly parameters, but are set in the Alignment preferences.

All sequence assembly settings explained

You can customize assembly of Aligner in the Preferences dialog. To display the Preference dialog:

This opens the preferences dialog shown in the screen shot below. To change the assembly settings, click on Assembly in the left panel.

Preference dialog showing assembly settings

Assembly Algorithm

The Algorithm pulldown lets you choose how CodonCode Aligner compares sequences during assembly, with the following options:

Minimum Percent Identity

This is the minimum percentage of identical bases in the aligned region. The default parameter of 70% is relatively relaxed; you may want to use a more stringent setting for your projects, especially if you did use end clipping before the alignment.

Be careful about setting this value to 100%: only samples that fully match each other in the overlapping regions will be assembled, samples with even a single discrepancy will not be aligned.

Minimum Overlap Length

This is the minimum length of the aligned region. If the aligned region is shorter than the value you set here (with 25 being the default), alignments will be rejected, and samples will remain in the "Unassembled Samples" folder.

To include primers or oligonucleotides shorter than 25 bases in assemblies, reduce this parameter to the length of the shortest oligo.

Minimum Alignment Score

This parameter is similar to the "Minimum Overlap Length", but it takes discrepancies into account. Scores will be scaled so that a match gives a score of 1 - for each matching base in the aligned region, a score of +1 will be added. With the default settings, a score of -2 will be subtracted for each mismatch; for single base insertions or deletions, a score of -5 will be subtracted (-3 gap introduction penalty and -2 gap penalty; additional gaps in the same run lead to a subtraction of -2 per base).

In general, your minimum alignment score should be lower than your minimum overlap length to allow for some level of discrepancies between the sequences.

Maximum Unaligned End Overlap

Note: this parameter does not apply when the global alignment algorithm is selected.

After doing an alignment, Aligner looks at unaligned ("dangling") ends of both reads, and can reject alignments when the dangling ends are too long.

Specifically, Aligner calculates the relative amount of unaligned sequence that could have been aligned by dividing the overlapping bases in the unaligned ends by the length of the shorter sequence. If this relative unaligned length is higher than the percentage set for the maximim unaligned end overlap, the alignment is rejected, and the two sequences or contigs are not merged.

You may need to adjust this value, depending on the kind of project you are doing. If you aligned cDNA sequences to genomic DNA, use values of or near 100%, since large stretches of exons may be unaligned. But if you expect your samples to match end-to-end, and pre-processed your sequence with end clipping and vector trimming, you can use lower values to reduce the chance that different copies of repeats will be incorrectly assembled together.

Bandwidth (Maximum Gap Size)

CodonCode Aligner uses a banded implementation of local and global alignment algorithms for faster assembly speeds and to reduce memory requirements during alignment. Instead of searching a full two-dimensional matrix to generate the alignment, only matches near the diagonal of the matrix are evaluated. This can dramatically increase assembly speeds and reduce memory requirements.

The Bandwidth parameter controls how far to the sides of the diagonal will be searched; somewhat simplified, this is the maximum total number of insertions or deletions one sequence can have relative to the other sequence. A bit simplified, if one sample has an insertion or deletion that is larger than the Bandwidth, the alignment will typically stop at the insertion/deletion, and the rest of the sample will be unaligned. If the insertion or deletion is shorter than the bandwidth, the alignment will continue after introducing the necessary number of gaps in one sequence, as long as the aligned parts after the gaps are long enough (with the exact length required depending on the scoring parameters).

The bandwidth parameter has an impact on the assembly speed - larger values mean slower assemblies. For large projects, you may want to reduce the bandwidth value; for projects where you know that you have larger insertions and deletions, you may want to increase it. Note, however, that increasing the bandwidth will typically not be enough to extend alignments through very large gaps, like large introns; such alignments require to change the algorithm to Support for Large gap alignments.

Word Length

The Word length parameter determines the size of "words" that CodonCode Aligner uses when looking for potential overlaps between sequences. Only sequence pairs that have perfect matches of at least this length will be considered for merging.

If you are trying to assemble sequences with high error or mutation rates, reducing the word length may help to get samples aligned. For large projects or projects with many repeat sequences, larger numbers may give faster assemblies and better results.

Maximum Successive Failures

This parameter is only relevant for larger assemblies with hundreds or thousands of reads. It can be used to limit how long Aligner will try to merge contigs for very large projects, with larger numbers meaning more tries and longer assembly times, but potentially fewer contigs.

Match scoring

The last four alignment parameters determine how matches are scored:

You can change the scoring within limits (scores from 1 to 19 for matches, and penalties of -1 to -19). In general, we suggest that only experts change the match scores and penalties.

Restoring default parameters

To restore the default parameters, click on the "Defaults" button near the bottom. This will reset all parameters to the choices shown in the screen shot above.