How to Assemble Sequences in Groups with CodonCode Aligner
This tutorial shows how to assemble sequences in groups, based on their names, in CodonCode Aligner.
How does assemble in groups (by name) work?
To assemble sequences in groups, based on their name, CodonCode Aligner separates the names of the selected samples into name parts, and groups sequences together based on one of the name parts that you can select. The sequences in each groups are then assembled, which can result in one or more contigs; contigs are named based on the name part used to group the samples.
This means that the samples must be named consistently, in a way that allows CodonCode Aligner to groups the samples. For illustration, look at the sample names in this example project:
In this project, the samples have the following name parts:
- Gene name (CFTR), followed by an underscore
- Exon (ex19), followed by an underscore
- Patient identifier (AB, AMS, etc.), followed by an period
- Direction (F or R)
If necessary, you can rename samples directly in the project view, or in the Sample Information dialog, available through Sample menu or from popup menus when right-clicking on a sample.
Example data download: assemble-by-name.zip
Note: To use this dataset, unpack the downloaded ZIP file, and open the "assemble-by-name.ccap" project..
How do I assemble sequences in groups (by name)?
To assemble sequences in groups, import the sequences into a project in CodonCode Aligner, and select the sequences you want to assemble (you can also select the "Unassembled Samples" folder to select all samples in it).
Then, choose Contig → Advanced Assembly → Assemble in Groups...
This will show the "Assemble in Groups" dialog shown below. If you have never used the "Assemble in Groups" before, you will be prompted to first define the name parts:
Defining how sample names are interpreted
Clicking on Define names... will open a new dialog where you can specify how sample names should be broken into parts, and what the meanings of these parts are. CodonCode Aligner will analyze your sample names to make an initial suggestion on how the names can be parsed:
In this example project, CodonCode Aligner recognized that he first two name parts end in underscores, and the third name part ends in a period.
Defining delimiters
Some of the most commonly used delimiters are pre-defined, but you can also add new delimiters by clicking on the Define delimiters... button:
One of the useful options for defining name parts is to use a pre-defined length. Advanced users can also define regular expression patterns to be used for parsing; this can be useful if your samples have a complicated but consistent naming scheme (tip: AI can be really useful to know which pattern to use if you're not a "regex" expert).
In our example, the automatically detected delimiters were correct, so we can just edit the meanings that the name parts have. Here's what the final result should look like:
Previewing name parts
You can check how names in the project are parsed by pressing the Preview... button, which opens up a new window:
Close the "Name Parts Preview" and the "Sample Name Options" dialog, and then click the Assemble button in the "Assemble in Groups dialog" to start the assembly.
Assembly results
When the assembly is finished, the assembled contigs for the 6 patients will be shown in the project view:
To have a quick look at the results, double-click on the "DW" folder to bring up the contig view. In the contig view, navigate to base 56, and double-click on the red consensus base to bring up the traces at this position:
You can see that the discrepancy here is caused by a background peak in the CFTR_ex19_DW.F sample. This part of the sequence has low quality, as indicated by the green background. By default, CodonCode Aligner will use the higher quality base from the R sample at this position as the consensus sequences, so there is no need to edit the F sequence. Of course, you could edit it by selecting it, and pressing the G key.
A logical next step would be to align the contigs to each other to see differences between the patient sequences, using Contig → Align with Clustal or Contig → Advanced Assembly → Compare Contigs. This would generate a multi-sequence alignment of the contigs (a "contig of contigs"), which preserves the connection to the underlying traces, thereby allowing quick verification of any discrepancies.