CodonCode Corporation
Better Software for DNA Sequencing

How to Assemble Sequences with CodonCode Aligner

This guide show how to assemble sequences in CodonCode Aligner. It covers basic sequence assembly and introduces several advanced assembly options.

What is sequence assembly?

Sequence assembly combines DNA sequence fragments into a larger sequence by identifying overlaps between sample sequences. Samples that can be joined together are put into "contigs". Joins may fail because samples do not share overlaps that are long enough with other samples or contigs, or because the overlap contains too many discrepancies. Any sequences that cannot be put into contigs remain unassembled.

This tutorial explains the various tools in CodonCode Aligner to assemble Sanger sequences and similar text sequences. The assembly of next generation sequencing (NGS) data will be covered separately.

How do I assemble sequences in CodonCode Aligner?

Before assembling, open an existing project in CodonCode Aligner, or create a new project, and add the sequences you want to assemble (for example by dragging the Sanger sequence files onto the project window).

To assemble sequences:

The result of the assembly can be one or more contigs, plus samples that remain in the "Unassembled Samples" folder. If none of the samples can be joined, no contigs will be formed.

You can assemble any mix of unassembled samples and previously assembled contigs. Contigs are assembled as they are, they are not dissolved first. However, the consensus sequence of contigs may change where new samples are added.

Example data

We will use a project that is included with CodonCode Aligner for this tutorial: the Example1 project in the Example Files folder in the folder where CodonCode Aligner is installed (usually C:\Program Files\CodonCode Aligner\ on Windows, or /Applications/CodonCode Aligner/ on macOS).

For this data set, CodonCode Aligner will notice that the samples contain low-quality sequence at the end of the reads, and therefore show the following warning dialog:

Warning dialog suggesting to clip sequences before assembly

We will learn how to automatically clip sequences before assembly in the next section; for now, click on the Assemble button.
CodonCode Aligner will briefly display a progress dialog, and then show the contigs in the project view:

Project view showing that assembly resulted in 2 contigs

The information area at the bottom shows that the assembly took 0.44 seconds and resulted in two contigs. Clicking in this area opens the status history dialog which contains additional information:

Status history dialog showing reasons why contigs were not merged

The status history shows why the two resulting contigs were not merged into one contig: the percent identity in the overlap region (58.9%) was lower than the required minimum identity of 70%. This was caused by the unclipped low-quality sequence at the end of the samples.

In the next section, we will use the Assemble with Preprocessing to fix this problem by automatically clipping low-quality sequences at the ends before assembly. First, though, let us dissolve the two contigs by selecting them in the project view, and then choosing Contig → Unassemble (or the Unassemble button in the toolbar).

How do I trim sequences automatically?

For Sanger sequences that contain low-quality sequence at the ends, end clipping can give better assemblies with fewer contigs and discrepant bases. CodonCode Aligner can automatically clip sequences with base-specific quality scores, either as a separate step, or when assembling.

To automatically clip ends and then assemble sequences:

This will show a dialog where you can select how CodonCode Aligner should preprocess samples before assembly:

Dialog to choose preprocessing options

In the "Preprocess" tab, select the Preprocess unassembled samples and the Clip ends checkboxes, then press Assemble.

CodonCode Aligner will clip ends of all unassembled samples in your selection, and then assemble the clipped sequences. With the low-quality sequences at the ends removed, this assembly results in a single contig:

Project view showing that assembly after end clipping resulted in one contig

Double-click on the contig to open the contig view:

Contig view showing overview and aligned bases

The upper part of the contig view shows an overview of the samples in the contig, and is useful for navigating. The lower half shows the aligned bases.

Navigate to base 590, where one sequence has a discrepancy, and double-click on the consensus base to open the trace view for the sequences at this position:

Trace view showing aligned Sanger chromatograms

You can edit sequences in the base view or the contig view, but note that editing is not required here, since CodonCode Aligner takes the highest quality base as the "Consensus" base (after also considering confirmation on the other strand).

How do I assemble sequences in groups (by name)?

In sequencing projects where the same region is studied in multiple different isolates (clones, species, patients, or similar), CodonCode Aligner can automatically groups sequences based on their names, and generate separate contigs for each isolate. This requires that sequences are named consistently in a way that allows for the automatic identification of the isolate (group) a sequence belongs to.

To assemble sequences in groups based on their name, select the sequences you want to assemble, and choose Contig → Advanced Assembly → Assemble in Groups. For detailed instructions, check the How to Assemble in Groups tutorial.

How do I compare contigs?

Many sequencing workflows include first generating contigs for multiple clones or isolates, and then comparing the contigs to each other. CodonCode Aligner can make this complicated process easy by supporting the direct alignment of contigs, which keeps the connection to the underlying sequence traces.

To align contigs to each other, select the contigs (and optionally also unassembled sequences), and choose Contig → Advanced Assembly → Compare Contigs. This will show the "Compare Contigs" dialog:

Trace view showing aligned Sanger chromatograms

You can choose between different algorithms to generate the contig alignment: Clustal Omega, MUSCLE, MACSE, or the built-in assembly algorithm. If your selection contains unassembled sequences, you can also choose preprocessing options for these in the "Preprocess" tab.

The result of comparing contigs will be one or more new "contig of contigs", named "CtgComparison1" or similar. You can double-click on this contig to view the alignment of the contigs and samples in it, and double-click on bases in any of the aligned contigs to bring up the sequences or chromatograms of this contig. This allows for rapid verification and editing of differences \ between the aligned contigs.

How do I reassemble contigs?

Occasionally, you may want to reassemble the samples in one or more contigs, for example to test other assembly options, or after extensive editing. To reassemble contigs, simply select the contig(s), and choose Contig → Advanced Assembly → Assemble from Scratch. This will first unassemble the selected contigs, and then reassemble them with the current assembly settings.

How do I assemble sequences with PHRAP?

CodonCode Aligner makes it easy to assemble sequences with PHRAP, a sequence assembly program that played a major role during the Human Genome Project. PHRAP has pioneered the use of many algorithmic ideas that since have become standard, for example the use of base-specific quality scores to generate accurate consensus sequences.

Due to the fact that many of the ideas initially implemented in PHRAP are now standard in other assembly tools, using PHRAP to generate assemblies is advantageous only in certain special cases, for example in BAC-sized shotgun sequencing projects and for comparison to other assemblers. For general use and casual users, the local alignment algorithm used by PHRAP that can generate "unaligned ends" can be confusing.

To assemble sequences with PHRAP, select the samples, and then choose Contig → Advanced Assembly → Assemble with Phrap.
Please note that using PHRAP is free only for academic use; commercial users have to purchase a separate license to use PHRAP.