How to Align Multiple Sequences with CodonCode Aligner
This page explains how to perform multiple sequence alignments in CodonCode Aligner using Clustal Omega, MUSCLE, and Aligner’s built-in algorithm. CodonCode Aligner offers several features not found in other programs that can simplify creating and reviewing multi-sequence aligments. This includes:
- the ability to automatically flip sequences to the correct orientation before alignment, and
- the option to align contigs directly, keeping the connection to the underlying sequence chromatograms or text sequences.
What is multiple sequence alignment?
Multiple sequence alignment (MSA) is the process of aligning three or more protein or nucleic acid sequences to identify regions of similarity. By analyzing patterns of similarity and variation in the alignments, researchers can uncover functional constraints, conserved structures, and evolutionary relationships.
Multiple sequence alignment is computationally challenging - especially as the number and length of sequences increases. Because exact solutions are computationally impractical, alignment programs often use heuristics to efficiently approximate optimal results, balancing speed, accuracy, and scalability. Different tools use different strategies and may therefore yield different results.
CodonCode Aligner offers several widely used algorithms for multiple sequence alignment, including MUSCLE and Clustal Omega, as well as a fast built-in option for simpler cases. In addition, CodonCode Aligner supports multiple options for the related task of aligning multiple sequences to a reference sequence.
How do I align sequences with Clustal Omega?
Clustal Omega is fast and scalable, suitable for aligning a large numbers of sequences. To align sequences with Clustal Omega in CodonCode Aligner, select the sequences or contigs to align, and then choose Contig → Align with Clustal.
Example data download: multi-sequence-align.fasta.gz
Note: To use this dataset, create a new project in CodonCode Aligner, and drag and drop the downloaded file onto the project window.
Clustal expects all sequences to be in the same orientation. To avoid problems from sequences that are in the wrong orientation, CodonCode Aligner examines the orientation of all sequences before starting the Clustal alignment, and reverse-complements sequences or contigs as needed.
When the alignment is completed, the resulting contig will be imported. Double-click on the aligned contig to open the contig view window:
The contig view shows an overview of the aligned sequences on top, which can be used for quick navigation. The bottom panel shows the aligned bases and allows manual editing of the alignment.
How do I align sequences with MUSCLE?
MUSCLE is another popular program for multiple sequence alignment. MUSCLE may produce more accurate alignments than Clustal Omega in some cases, for example when sequences have large deletions or insertions.
To align sequences with MUSCLE in CodonCode Aligner, select the sequences or contigs to align, and then choose Contig → Align with Muscle.
For the example data, MUSCLE produces an alignment that is almost identical to the Clustal alignment, except for the order of the sequences:
In the screen shot above, the difference table is shown instead of the overview on top, and in the aligned base view below, bases that match the consensus are masked (shown as dot) - two features in CodonCode Aligner that can be useful when analyzing discrepancies between aligned sequences.
How do I align sequences with Aligner’s built-in algorithm?
Aligner includes a built-in algorithm for quickly aligning multiple sequences. While it does not perform true iterative refinement, it can be useful in certain cases:
- for quick comparisons and checks
- when some sequences contain large deletions (for example when aligning cDNA to genomic DNA)
- when sequences contain low-quality sequence at the start and/or end, but removal by manual or automatic trimming is not feasible or too tedious (in this case, using the local alignment algorithm can be helpful)
To align sequences with CodonCode Aligner's build-in algorithm, select the sequences or contigs to align, and then choose Contig → Assemble.
Since assembly uses a greedy algorithm to successively merge sequences, without any iterative refinements, the sequences in the resulting contig are not ordered, as they are when using Clustal or MUSCLE, although the alignment is otherwise identical in this simple example case:
In the screen shot above, a different method ("box") to highlight bases that do not match the
consensus sequence was chosen in the "Highlighting" preferences.
To order the sequences by similarity, you can use CodonCode Aligner's
Build Tree ... function in the Contig menu:
After generating a tree, an additional contig editing option becomes available: you can right-click on tree branches and choose Remove Selected Branch to remove single sequences, or split a contig into two parts by similarity.
How do I align contigs directly?
Generating multiple sequence alignments is often a multi-step process:
- First, several sequences for a region of interest, for example a gene, are assembled into separate contigs for each clone, isolate, or patient, and "consensus" sequences for the assembled contigs are generated.
- Next, the consensus sequences for the contigs are aligned to each other.
- Differences in the multiple sequence alignment are checked; in Sanger sequencing projects, this includes going back to the original sequence chromatograms to differentiate between sequencing errors and real discrepancies.
CodonCode Aligner can greatly simplify this process by supporting the direct alignment of contigs, which keeps the connection to the underlying sequence traces. In contrast, other sequence analysis programs typically align only copies of the contig's consensus sequences. This leads to data duplication, and makes the checking of discrepancies tedious and error-prone.
Example data download: multi-contig-align.zip
Note: To use this dataset, unpack the downloaded ZIP file, and double-click on
the file multi-contig-align.ccap to open the project in CodonCode Aligner.
This example data set contains 23 contigs, each consisting of a forward and reverse Sanger sequence:
To align these 23 contigs directly to each other with Clustal Omega, select the 23 contigs, and then choose Contig → Align with Clustal. CodonCode Aligner will use the consensus sequences of the contigs to create an alignment with Clustal, and import the alignment results, creating a "contig of contigs " that maintains the links to the underlying sequence traces.
Double-click on CtgComparison1 to open the contig view for the aligned contigs. You may notice that this alignment is reverse-complemented relative to the alignments shown before; select Edit → Reverse Complement to flip the alignment, then scroll to the start of the contig.
At base 53 in the contig alignment, 22 of the 23 sequences have a C, and only one contig (MEM) has a T. To check whether this is due to a real discrepancy or the result of a sequencing error, double-click on this base in the contigs MEM and ZT to open the Sanger chromatograms at this place in the contigs:
The traces of the two contigs show that this is indeed a real mutation. If the discrepancy had been due to a sequencing error instead, you could edit the sequence traces direction, or you could edit the base in the aligned contigs, which would change the bases in all chromatograms directly. Go ahead and try it - you can use Edit → Undo (or the keyboard shortcut Control-Z / Command-Z) to undo the edit.
Can I use reference sequences in multi-sequence alignments?
While biologists often use the term "multiple sequence alignment" for the alignment of sequences to each other as described above, there are several other biomedical workflows that include the alignment of multiple sequences to a known reference sequence. Two common examples are:
- Mutation detection projects, where sequences from patients or isolates are compared to a "wild type" sequence
- Reference-guided assemblies and alignments of NGS data.
CodonCode Aligner offers tools for such workflows through the Contig → Align to Reference function, and the Tools → Align with Bowtie2 function, respectively. For additional information about how to use CodonCode Aligner for mutation detection, check the mutation detection overview page and the How to Detect Point Mutations guide.
Even in "standard" multiple sequence alignments, it can sometimes make sense to use a reference sequence, for example when samples for one isolate do not overlap, or to ensure that contigs are oriented correctly. For additional information, please check the Using Reference Sequences tutorial.