Converting Ensembl Compara gene tree DNA alignment to corresponding amino acid alignment

Converting Ensembl Compara gene tree DNA alignment to corresponding amino acid alignment

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have Ensembl compara gene tree alignments (Compara.gene_trees.57.fasta.gz downloaded from in nucleotide format. According to documentation, it says that the file "contains the peptide alignment for every genetree in fasta format".

I was wondering what might be a handy tool to obtain the corresponding amino acid alignments from the file.



As a general rule, both within and without the world of bioinformatics, public FTP sites contain README files tat explain what each file offered by the FTP server contains. The file README.protein_trees states:


contains the peptide alignment for every genetree in emf alignment format


contains the peptide alignment for every genetree in fasta format

This means that both Compara.gene_trees.57.emf.gz and Compara.gene_trees.57.fasta.gz contain the protein alignments. I had a quick look at the files and it looks likeCompara.gene_trees.57.fasta.gzactually contains nucleototide sequences butCompara.gene_trees.57.emf.gzdoes indeed contain a protein alignment:


So, to answer your question, the file you want is the.emf.gzone.

Getting Started in Gene Orthology and Functional Analysis

Copyright: © 2010 Fang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: We acknowledge support from the NIH and from the AL Williams Professorship funds. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


A phylogenetic tree is an estimate of the relationships among taxa (or sequences) and their hypothetical common ancestors ( Nei and Kumar 2000 Felsenstein 2004 Hall 2011). Today most phylogenetic trees are built from molecular data: DNA or protein sequences. Originally, the purpose of most molecular phylogenetic trees was to estimate the relationships among the species represented by those sequences, but today the purposes have expanded to include understanding the relationships among the sequences themselves without regard to the host species, inferring the functions of genes that have not been studied experimentally ( Hall et al. 2009), and elucidating mechanisms that lead to microbial outbreaks ( Hall and Barlow 2006) among many others. Building a phylogenetic tree requires four distinct steps: (Step 1) identify and acquire a set of homologous DNA or protein sequences, (Step 2) align those sequences, (Step 3) estimate a tree from the aligned sequences, and (Step 4) present that tree in such a way as to clearly convey the relevant information to others.

Typically you would use your favorite web browser to identify and download the homologous sequences from a national database such as GenBank, then one of several alignment programs to align the sequences, followed by one of many possible phylogenetic programs to estimate the tree, and finally, a program to draw the tree for exploration and publication. Each program would have its own interface and its own required file format, forcing you to interconvert files as you moved information from one program to another. It is no wonder that phylogenetic analysis is sometimes considered intimidating!

MEGA5 ( Tamura et al. 2011) is an integrated program that carries out all four steps in a single environment, with a single user interface eliminating the need for interconverting file formats. At the same time, MEGA5 is sufficiently flexible to permit using other programs for particular steps if that is desired. MEGA5 is, thus, particularly well suited for those who are less familiar with estimating phylogenetic trees.

Step 1: Acquiring the Sequences

Ironically, the first step is the most intellectually demanding, but it often receives the least attention. If not done well, the tree will be invalid or impossible to interpret or both. If done wisely, the remaining steps are easy, essentially mechanical, operations that will result in a robust meaningful tree.

Often, the investigator is interested in a particular gene or protein that has been the subject of investigation and wishes to determine the relationship of that gene or protein to its homologs. The word “homologs” is key here. The most basic assumption of phylogenetic analysis is that all the sequences on a tree are homologous, that is, descended from a common ancestor. Alignment programs will align sequences, homologous or not. All tree-building programs will make a tree from that alignment. However, if the sequences are not actually descended from a common ancestor, the tree will be meaningless and may quite well be misleading. The most reliable way to identify sequences that are homologous to the sequence of interest is to do a Basic Local Alignment Search Tool (BLAST) search ( Altschul et al. 1997) using the sequence of interest as a query.

Step 1.1

When you start MEGA5, it opens the main MEGA5 window. From the Align menu choose Do Blast Search. MEGA5 opens its own browser window to show a nucleotide BLAST page from National Center for Biotechnology Information (NCBI). There is a set of five tabs near the top of that page (blastn, blastp, blastx, tblastn, and tblastx). By default the blastn (Standard Nucleotide BLAST) tab is selected. If your sequence is that of a protein click the blastp tab to show the Standard Protein BLAST page.

Note that NCBI frequently changes the appearance of the BLAST page, so it may differ in some details from that described here.

There is a large text box (Enter accession number … ) where you enter the sequence of interest. You can paste the query sequence directly into that box. However, if your query sequence is already itself in one of the databases, you can paste its accession number or gi number. If your DNA sequence is part of a genome sequence, you can enter the genome's accession number then, in the boxes to the right (Query subrange) enter the range of bases that constitute your sequence. (You really do not want to use a several megabase sequence as your query!)

The middle section of the page allows you to choose the databases that will be searched and to constrain that search if you so desire. The default is Nucleotide collection (nr/nt), but the drop-down text box with triangle allows you to choose among a large number of alternatives, for example, Human Genomic or NCBI genomes.

The optional Organisms text box allows you to limit your search to a particular organism or to exclude a particular organism. For instance, if your sequence is from humans you might want to exclude Humans from the search, so that you do not pick up a lot of human variants when you are really interested in homologs in other species. To include more organisms click the little + sign next to the options box.

The Exclude option allows you to exclude, for instance, environmental samples.

Step 1.2: Which BLAST Algorithm to Use?

The bottom section of the page allows you to choose the particular variant of BLAST that best suits your purposes. For nucleotides, the choices are megablast for highly similar sequences, discontiguous megablast for more dissimilar sequences, or blastn for somewhat similar sequences. The default is blastn, but if you are only interested in identifying closely related homologs tick megablast. This is the first choice that really demands some thought. The sequences that will be on your tree are very much determined by the choice you make at this point.

At the very bottom of the page click the BLAST button to start the search do not tick the “show results in a new window” box. A results window will appear, possibly with a graphic illustrating domains that have been identified, typically with a statement similar to “this page will be automatically updated in 5 seconds.” Eventually, the final results window will appear. The top panel summarizes the properties of the query sequences and a description of the database that was searched. Below that is a graphic that illustrates the alignment for the top 100 “hits” (sequences identified by the search). Scroll down below that to see the list of sequences producing significant alignment scores. For each sequence, there is an Accession number (a clickable link), a description, a Max Score (also a clickable link), a total score, a Query coverage, and E value and a Max ident. You use that information to decide which of those sequences to add to your alignment and thus to include on your tree.

The description helps decide whether you are interested in that particular sequence. There may be several sequences from the same species do you want all of those or perhaps only one representative of a species—or even of a genus? If you are possibly interested in that sequence look at Query coverage. Are you interested in a homolog that only aligns with 69% of the query? If not, ignore that sequence and move on. Are you interested in a sequence that is 100% identical to your query? If you are only interested in more distantly related homologs, you may not be. If you want the most inclusive tree possible, you may be. You must decide there is no algorithm that can tell you what to include.

If you decide that you are interested in a hit sequence, click the “Max score” link to take you down to the series of alignments. What you see depends on whether your query was a DNA sequence or a protein sequence.

Step 1.3: DNA Sequences

The alignment of the query to the hit begins with a link to sequence file via its gi and accession numbers. If that link is to a genome sequence, or even to a large file that includes sequences of several genes, you will not want to include the entire sequence in your alignment. There are two ways to deal with the issue. 1) Look at the alignment itself and note the range of nucleotides in the subject. Be sure to notice whether the query aligns with the subject sequence itself (Strand = plus/plus) or with its complement (Strand = plus/minus). Click the link to bring up the sequence file. At the top right click the triangle in the gray Change region shown box, then enter the first and last nucleotides of the range, then click the Update View button. In the gray Customize view region, below, tick the Show sequence box, and if Strand = plus/minus also tick the Show reverse complement box, then click the Update View button. Finally, click the Add to Alignment button (a red cross) near the top of the window. (2) If your query is a coding sequence or is some other notable feature you may see Features in this part of subject sequence : just below the sequence description with a link to the feature. Click that feature link to bring up the sequence file already showing the region of interest. Check to be sure whether the sequence shown is the reverse complement of the query, and if it is tick the Show reverse complement box in the Customize view region, update the view, then click the Add to Alignment button (a red cross) near the top of the window.

Step 1.31. When you click the Add to Alignment button, MEGA5's Alignment Explorer window opens and the sequence is added to that window. After adding a sequence to the Alignment Explorer use the back arrow in the BLAST window to return to the list of homologous sequences and add another sequence of interest.

Step 1.4: Protein Sequences

The main difference from nucleotide searches is that you may see accession number links to several protein sequence files. These all have the same amino acid sequence, although their underlying coding sequences may differ. Click any one of the links to bring up the protein sequence file, then click the Add to Alignment button.

You may find that all the hits that are returned from your search are from very closely related organisms that is, if your query was an Escherichia coli protein, all the hits may be from E. coli, Salmonella, and closely related species. If the hits all show a high maximum identity and you are pretty sure the sequence occurs in more distantly related sequences you have probably come up against the default maximum of 100 target sequences. Repeat the search, but before you click the BLAST button to start the search notice that immediately below that button is a cryptic line “+ Algorithm Parameters.” Click the plus sign to reveal another section of the BLAST setup page. Set the Max Target Sequences to a larger value and repeat the search. You may also want to exclude some closely related species in the Choose Search Set section above. Enter a taxon, for example, E. coli, in the box and tick the Exclude box. If you want to exclude more than one species click the plus sign to the right of Exclude to add another field. You can exclude up to 20 species.

When you try to return to the list of hits you may get a page that says “How Embarrassing! Error: −400 Cache Miss.” Click the circular arrow next to the Add to Alignment button. You will be sent to the main BLAST page but do not despair. At the top right of that page is a Your Recent Results section. The top link in the list is your most recent search. Just click that link to get back to your results.

When you have added all the sequences that you want to, just close the MEGA5 browser window.

In the Alignment Editor window save the alignment by choosing Save Session from the Data menu. I like to use a name such as Myfile_unaligned just to remind myself that the sequences have not been aligned. The file will have the extension .mas.

Step 1.5: Alternatives to MEGA5 for Identifying and Acquiring Sequences

Step 1.51. You can access NCBI BLAST through any web browser that NCBI supports at In the Basic BLAST section click the nucleotide blast or protein blast link to get to the page identical to that described earlier. Everything is the same as when using MEGA5's browser except that you cannot click a convenient button to add the sequences to the Alignment Editor.

Step 1.52. Open a new file in a text editor. You can use MEGA5's built in text editor by choosing Edit a Text File from the File menu. That editor has several functions for editing molecular sequences, including reverse complementing and converting to several common formats including Fasta. Alternatively, use Notepad for Windows or TextWrangler for Mac ( Save the file with a meaningful name with the extension.fasta, for example, myfile.fasta. Do not use Microsoft Word, Word Pad, TextEdit (Mac), or another word processor!

Step 1.53. When you have identified the sequence that you want to add and clicked the link to take you the page for that sequence file, adjust the Region Shown and Customize View if necessary. Notice the Display Settings link near the top left of the page. The default setting is GenBank (full). Change that to Fasta (text), select everything, copy it then paste into the text editor file. As you add sequences to the file, it is convenient, but not necessary, to leave blank lines between the sequences.

Identifying and acquiring sequences is discussed in more detail in Chapter 3 of Phylogenetic Trees Made Easy, 4th edition (PTME4) ( Hall 2011).

The next section explains how to import those sequences into MEGA5's alignment editor.

Step 2: Aligning the Sequences

If the Alignment Explorer window is not already open, in MEGA5's main window choose Open a File/Session from the File menu. Choose the MEGA5 alignment file (.mas) or the sequence file (.fasta) that you saved in Step 1. In the resulting dialog choose Align.

The Alignment Explorer shows a name for each sequence at the left, followed by the sequence, with colored residues. Typically the name is very long. That name is what will eventually appear on the tree, and long names are generally undesirable. This is the time to edit those names, in fact it is the only practical time to edit the names, so do not miss the opportunity. Simply double click each name and change it to something more suitable.

If your sequence is DNA you will see two tabs: DNA Sequences and Translated Protein Sequences. The DNA sequences tab is chosen by default. Click the Translated Protein Sequences tab to see the corresponding protein sequence.

Step 2.1

Now is the time to align the sequences. Two alignment methods are provided: ClustalW ( Thompson et al. 1994) and MUSCLE ( Edgar 2004a, 2004b). Either can be used, but in general MUSCLE is preferable. In the tool bar, near the top of the window, Clustal alignment is symbolized by the W button, and MUSCLE by an arm with clenched fist to “show a muscle.” Click one of those buttons or choose Clustal or Muscle from the Alignment menu. If your sequence is DNA you will see two choices: Align DNA and Align Codons. If your sequence is a DNA coding sequence it is very important to choose Align Codons. That will ensure that the sequences are aligned by codons, a much more realistic approach than direct alignment of the DNA sequences because that avoids introducing gaps into positions that would result in frame shifts in the real sequences.

Step 2.2

Choosing an alignment method opens a settings window for that method. For MUSCLE, I recommend that you accept the default settings. For ClustalW, the default settings are fine for DNA, but for proteins, I recommend changing the Multiple Alignment Gap Opening penalty to 3 and the Multiple Alignment Gap Extension penalty to 1.8.

Step 2.3

Click the OK button to start the alignment process. Depending on the number of sequences involved and the method you chose, alignment may take anywhere from a few seconds to a few hours. When the alignment is complete Save the session. I like to save the aligned sequences under a different name, thus if my original file was Myfile_unaligned.mas, I would save the aligned sequence as just Myfile.mas.

Step 2.4

MEGA5 cannot use the .mas file directly to estimate a phylogenetic tree, so you must also choose Export Alignment from the Data menu and export the file in MEGA5 format where it will get a .meg extension. You will be asked to input a title for the data. You can leave the title blank if you wish, but it is helpful to add some sort of title that is meaningful to you. If it is an alignment of DNA sequences you will also be asked whether they are coding sequences.

Alignment is discussed in more detail in Chapter 4 of PTME4 ( Hall 2011).

Step 2.5: An Alternative to Aligning with MEGA5

Once the alignment is complete, you will see that gaps have been introduced into the sequences. Those gaps represent historical insertions or deletions, and their purpose is to bring homologous sites into alignment in the same column. It should be appreciated that just as a phylogenetic tree is an “estimate” of relationships among sequences, an alignment is just an estimate of the positions of historical insertions and deletions. The quality of the alignment can affect the quality of a phylogenetic tree, but MEGA5 offers no way to judge the quality of the alignment. The web-based program Guidance ( provides five different methods of alignment, but more importantly, it evaluates the quality of the alignment and identifies regions and sequences that contribute to reducing the quality of the alignment. Discussion of Guidance ( Penn et al. 2010) is beyond the scope of this article, but the topic is covered in detail in Chapter 12 of PTME4 ( Hall 2011).

Guidance requires that the unaligned sequences are provided in a file in Fasta format. See Hall (2011) for a detailed description of the Fasta format. If you downloaded the sequences through your favorite web browser and saved them as a .fasta file that file can be used as the input for Guidance. If you used MEGA5 to download the sequences into the Alignment Explorer you can export the unaligned sequences in FASTA format by choosing Export Alignment from the Data menu, then choosing FASTA format. If you forgot to keep the unaligned sequences you can select all the sequences (Control-A), then choose Delete Gaps from the Edit menu before you export the sequences in FASTA format.

Step 3: Estimate the Tree

There are several widely used methods for estimating phylogenetic trees (Neighbor Joining, UPGMA Maximum Parsimony, Bayesian Inference, and Maximum Likelihood [ML]), but this article will deal with only one: ML.

Step 3.1

In MEGA5's main window choose Open a File/Session from the File menu and open the .meg file that you saved in Step 2.

Step 3.2

ML uses a variety of substitution models to correct for multiple changes at the same site during the evolutionary history of the sequences. The number of models and their variants can be absolutely bewildering, but MEGA5 provides a feature that chooses the best model for you. From the Models menu choose Find Best DNA/Protein Models (ML) … . A preferences dialog will appear, but you are safe enough accepting the default setting. Click the Compute button to start the run. Models can take quite awhile to consider all the available models, but a progress bar shows how things are coming along.

When complete a window appears that lists the models in order of preference. Note the preferred model, then estimate the tree using that model. For the examples below, the WAG + G + I model was the best.

Step 3.3

From the Phylogeny menu choose Construct/Test Maximum Likelihood Tree … . A preferences dialog similar to that in figure 1 will appear.


Rhodopsin molecular evolution

When estimating rhodopsin’s evolutionary rate via the simple M0 model, in which a single dN/dS ratio is estimated for the entire sequence (see Materials and Methods), the sequence was found to have evolved, as expected, predominantly under strong purifying selection (dN/dS = 0.045, log-likelihood (lnL) = −12788). Thus, in general, novel non-synonymous nucleotide substitutions in this gene tend to be removed by natural selection. However, positive selection is expected to occur at individual, adaptive sites, which the M0 model cannot be used to detect. We thus performed two tests of positive selection using site-substitution models, which allow each site in the sequence to have evolved at an independent rate. Positive selection is inferred by comparing two nested models in which the alternative model includes sites with dN/dS greater than 1, and significance is determined via a likelihood-ratio test with the p-value derived from the χ 2 distribution. These tests, if significant, would largely provide evidence of sites that have undergone positive selection throughout the phylogenetic divergence. Both tests of pervasive positive selection failed to produce significant evidence (models M1a vs. M2a: M1a log-likelihood (lnL) = −12672, M2a lnL = −12672, degrees of freedom (d.f.) = 2, p = 1 models M7 vs M8: M7 lnL = −12364, M8 lnL = −12364, d.f. = 2, p = 1). Clearly, results of pervasive positive selection should not be used and the detailed analysis has to concentrate in very specific positions within the protein/DNA sequence.

We next tested the hypothesis that rhodopsin underwent episodic positive selection specifically during therian divergence via a test using a branch-site model (Fig. 1). This test allows for sites to have undergone positive selection on a specific branch of the phylogeny and significance is determined similarly to the previous tests. Rhodopsin was found to have significant evidence of positive selection on the branch leading to the therian mammals (null lnL = −12734, alternative lnL = −12731, d.f. = 1, p = 0.014). The three sites identified as having the highest posterior probability of having been targets of positive selection were (bovine coordinates and posterior probabilities given): M13F (Prob = 0.995), R225Q (Prob = 0.982) and S346A (Prob = 0.888). These three positions (the first two with strong statistical significance) clearly acquired some relevant function that resulted in being positively selected in the basal branch leading to therian mammals.

Electrophoretic and spectroscopic characterization of the rhodopsin substitutions

These three specific sites (13, 225, and 346 in the bovine opsin background), determined by the statistical analysis, were chosen for experimental characterization due to their high posterior probability of being positively selected. While site 346 does not meet the canonical probability cut-off of 0.95, its location in a region known to be physiologically relevant prompted us to include it in the experimental analysis. The ancestral mutations F13M, Q225R and A346S were constructed in the bovine opsin gene by site directed mutagenesis. These amino acids are located at the intradiscal N-terminal domain (F13), the cytoplasmic end of transmembrane helix V (Q225), and the C-terminal tail of the photoreceptor protein opsin (A346), respectively (Fig. 2).

Sites mutated in the present study, 13, 225 and 346 are circled in red.

We used electrophoretic analysis in order to determine the glycosylation and oligomerization state of the mutants which are important structural determinants of their functionality. To this end, the recombinant mutated proteins, F13M, Q225R and A346S, were expressed in COS-1 cells, immunopurified and subsequently analyzed by SDS-PAGE. The electrophoretic pattern of Q225R and A346S mutants was very similar to that of the WT (Fig. 3, left panel) showing the characteristic trailing smear typical of rhodopsin expressed in COS-1 cells and attributed to heterogeneous glycosylation 13 . However, the F13M mutant showed a clearly altered pattern, with a series of discrete bands and the appearance of lower bands below the opsin main band (at about 40 KDa) that can be attributed to non-glycosylated 14 or truncated opsin species 15 .

Left panel . ROS (rhodopsin from rod outer segments), WT rhodopsin and Q225R, A346S and F13M mutants are indicated in the corresponding lanes. All the mutants show a similar electrophoretic behaviour as the WT except for F13M which shows an altered pattern consistent with altered glycosylation. Right-panel. Western blot of immunopurified protein samples detected with Rho-1D4 monoclonal antibody. WT rhodopsin, F13M in the N2C/D282C background and F13M rhodopsin. Note that a 28 kDa band is clearly detectable in the F13M mutant lane.

One of the main adaptively relevant properties of rhodopsins is their light absorption capacity. Thus, the spectral behaviour of the purified proteins was analyzed by UV-vis spectroscopy and its light absorption properties were determined in their dark-adapted state (Figs 4 and 5C). Wild-type (WT) rhodopsin showed the characteristic visible band at 500 nm and the mutants Q225R and A346S showed visible bands at the same wavelength (Fig. 4). These two mutants showed similar levels of chromophore regeneration with retinal than WT rhodopsin as judged by their A280 nm/A500 nm ratios (see Table 1). The photobleaching and acidification spectra were determined immediately after illumination (with light of λ > 495 nm) and after acidification respectively. Upon illumination, both mutants showed a typical absorbance band at 380 nm, corresponding to the active Meta II conformation. Subsequent acidification of the samples shifted the absorption maximum from 380 nm to 440 nm which corresponds to the reprotonation of the Schiff base nitrogen. Thus, we find a WT-like behaviour for the Q225R and A346S mutants in the photobleaching and acidification assays (Fig. 4, insert), suggesting that these amino acid changes did not alter the pathway of photointermediates leading to the activated receptor.

UV-vis spectra of WT, Q225R and A346S in dark. Insets show the corresponding dark (λmax = 498 nm), photobleached (λmax = 380 nm), and acidified spectra (λmax = 440 nm). Note that the mutants show photobleaching and acidification behaviours analogous to those of WT rhodopsin.

Transfected cells with WT rhodopsin (A) and F13M mutant (B) were analyzed by fluorescence microscopy. The blue colour corresponds with the nucleus of the cells, and opsins are labelled in green. (C) UV-vis absorption spectra of F13M in the dark (top panel) showing no chromophore regeneration in the visible region. When the mutation is obtained in the background of the N2C/D282C double mutant, then chromophore regeneration can be rescued to WT levels (lower panel). Inset, photobleaching and acidification spectra of the rescued mutant.

A specific behaviour was observed in the case of the F13M mutation, at the N-terminal domain of the receptor, which did not show detectable chromophore regeneration as detected by the lack of absorbance in the visible region (Fig. 5C, upper panel). This lack of chromophore regeneration ability could reflect protein misfolding. It is known that misfolded opsins are retained in the endoplasmic reticulum or can form intracellular inclusions due to a failure in the intracellular transport to the plasma membrane. Thus, we analyzed the subcellular localization of the F13M mutant, expressed in COS-1 cells, and compared it to that of WT rhodopsin in order to confirm structural misfolding of this mutant. A clearly different pattern was observed in the two cases with WT opsin being trafficked to the plasma membrane (Fig. 5A), whereas F13M appeared not to be localized effectively to the plasma membrane, and formed intracellular inclusions with higher frequency, in a pattern consistent with protein misfolding (Fig. 5B).

Rescue of chromophore regeneration for the F13M mutant

It was of interest to find out if the misfolded phenotype for the F13M could be rescued by means of an experimental strategy. Therefore, pharmacological chaperone rescue was carried out with the F13M mutant. For this, COS-1 cells transfected with this mutant gene where incubated in the presence of 9-cis-retinal. Previous studies showed that defective N-terminal mutants supplied with 11-cis-retinal or 9-cis-retinal, during protein biosynthesis, could recover WT-like rhodopsin chromophore regeneration levels 16,17 . In the case of F13M we could not get any detectable chromophore regeneration for this mutant using this strategy (supplementary Figure S1).

We hypothesized that the inability of the F13M mutant to bind retinal was due to the fact that this mutation, at the N-terminal domain of the receptor, could be destabilizing the protein conformation, thus affecting receptor folding and altering at the same time glycosylation at the proximal N15 residue 18 . In order to stabilize the structure, we introduced the F13M mutation in the background of the N2C/D282C double mutant that forms a disulphide bond between Cys2 and Cys282 increasing opsin stability 19 . By using this strategy we could recover full chromophore regeneration for F13M mutant to a similar extent to that of WT rhodopsin (Fig. 5C, lower panel). Furthermore, Western blot analysis of F13M showed a clear distinctive lower band at approximately 28 KDa that was not detected when the mutant was obtained in the Cys2/Cys282 background (Fig. 3 right panel). In this latter case a pattern similar to that of WT could be observed, consistent with the rescue observed for chromophore regeneration.

In order to rule out that the retinal could be binding to other Lys residues 20 in the F13M mutant (other than the natural K296 at transmembrane helix 7) we constructed the quadruple mutant F13M/N2C/D282C/K296G where the site of retinal attachment was eliminated by the K296G mutation 21 . We could not get any chromophore formation for this mutant indicating that retinal was binding to the native K296 in the rescued triple mutant (Supplementary Figure S2).

Conformational stability and functionality of WT and mutant opsins

One of the important aspects underlying the functionality of rhodopsin in visual perception is the structural stability of both the dark-adapted and the illuminated photoactivated states. Specific amino acid substitutions can have a profound impact on the stability of the protein so it is of interest to determine their thermal and chemical stabilities in the dark and also the stability of the activated Meta II state. The influence of the mutations on the specific function, i.e. G-protein activation, is also a relevant parameter that can shed light on the importance of a given amino acid position in the molecular evolution of the protein.

Dark-state chemical stability

We determined the hydroxylamine reactivity of WT and the mutants in the dark state. Hydroxylamine cannot access the compact WT rhodopsin binding pocket in the dark state but if the conformation becomes more open, as in the case of mutation, then it can enter the binding pocket forming a retinal oxime with 11-cis-retinal 22 . Thus, hydroxylamine is used to measure the chemical stability of rhodopsin in the dark as an indirect measure of the accessibility of the Schiff base linkage under these conditions. WT has a high stability towards hydroxylamine in the dark (Table 1) indicating that the retinal Schiff base is not accessible under these conditions. Both Q225R and A346S mutants show a slightly increased sensitivity towards hydroxylamine in the dark (Fig. 6A) that would reflect a less compact structure around the Schiff base linkage in the chromophore binding pocket.

(A) Chemical stability in the presence of 50 mM hydroxylamine. The decrease in absorbance at the visible λmax was measured over time. WT rhodopsin ( ● ), Q225R ( ▾ ) and A346S ( ○ ) (B). Relative Gt activation initial rate. Error bars represent S.E. in both panels.

Dark-state thermal stability

Another assay used to assess the stability in the dark state, is to follow the decay of the visible band at 48 °C. At this temperature the Q225R mutant showed similar thermal bleaching kinetics as the rhodopsin WT, while the A346S mutant showed slightly faster bleaching kinetics (Table 1). On the other hand, the mutant F13M/N2C/D282C has a high stability as expected from the stabilizing effect of the additional engineered disulfide bond (t1/2 > 120 min at this temperature) as previously described 23 .

Meta II stability

Meta II decay was determined, in real time, by monitoring Trp fluorescence increase upon rhodopsin photoactivation. Our data showed a similar decay time for the Q225R and A346S mutants when compared to WT (Table 1).

Design and Implementation

Model simplifications

Biological cases of disrupted reading frames are rare (e.g. in programmed frameshift mutations or pseudogenes) but sequencing errors that lead to apparent frameshifts are much more frequent. Such frameshifts occur through indels that are not multiples of three when one or two consecutive nucleotides are either deleted or inserted. To distinguish these kinds of frameshifts, we respectively denote as those induced by deletions, and by those induced by insertions. There are two main differences between our solution and other pairwise coding sequence algorithms (e.g. [23], [24], [26]). Firstly, our objective function is only based on sequence AA translations and secondly it ignores events. These two approximations allow us to extend our pairwise algorithm to MSA.

As mentioned in the introduction, Hein [23] and Pedersen et al [25] proposed defining the overall cost of the alignment as the sum of the costs of the two alignments. One can argue that the NT level is at least partly taken into account within classical AA substitution matrices such as PAM [41] or Blosum [42]. Using summation also raises the question of the relative importance of these two information levels in the alignment process since, as mentioned by the authors [25], other cost combinations could also be used. Hence, following the three-step strategy, we prefer to consider only the AA alignment cost which has the advantage of simplicity resulting in a faster solution.

Pairwise alignment algorithm accounting for frameshifts [24], [25], [27] explicitly model events (those representing the presence of one or two extra nucleotides in a sequence). Representing such events in the output alignment require either to remove the corresponding extra nucleotides from the sequence or to display it as a partial codon (e.g. “! ! C”) facing a “ghost” codon in the other sequence (“! ! !”) that is neither a real gap nor codon. None of these solutions is adapted to the classical strategy used to extend pairwise alignment algorithm to MSA (this strategy, based on alignment of alignments, is detailed at the end of this section). Removing the extra nucleotides prevents questioning this choice afterwards. Meanwhile, using a ghost codon (“! ! !”) is problematic, especially for correctly evaluating the costs of gap opening/closing when aligning two alignments. Indeed these costs are efficiently estimated based on the local configuration of gap and non-gap characters but since a ghost codon is neither one nor the other the standard solutions (e.g. [43], [44]) no longer work. This difficulty to handle events is certainly the main reason for which previous pairwise solutions have never been extended to MSA. Note that ignoring is not so dramatic since they can always be explained as a event in the concerned sequence facing a codon deletion in others (e.g. “! ! C” facing “– – –”). This is a practical approximation with little, if any, impact when only two sequences are aligned. In the case of MSA, this approach overpenalizes events (by adding deletions to other sequences), but it does not seem to have a major impact in practice. We acknowledge that an exact handling of events would be preferable. Yet, as none have been found since Hein seminal work published in 1994, we think that it is time to consider approximate solutions to extend his pairwise model to a useful MSA tool.

Defining the objective function of pairwise alignments containing frameshifts and stop codons

An alignment of two sequences , can be seen as a transformation process to turn into as illustrated in Fig. 5. Once a cost is associated with each elementary transformation (changing one letter into another, inserting/removing letters), the overall cost of the transformation process associated with an alignment can be computed by simply summing up the cost of its elementary transformations. An optimal alignment is then one with the minimum total transformation cost. To obtain a biologically meaningful alignment, the various elementary costs must be carefully chosen. The cost of turning one amino acid X into another Y depends on their physicochemical properties and is denoted as . The cost of an insertion/deletion of AAs is generally defined as where is a high value penalizing gap opening while is a smaller value penalizing gap extension. This reflects the fact that indels are rare events (compared to substitutions) and that longer indels are even rarer. Note that this kind of gap cost is independent of the symbols that are inserted or deleted.

This alignment describes a way to transform into by deleting the E, inserting an I after the first M, changing the last M into an N, and deleting the two final I.

As explained above, our objective function only considers the AA alignment cost. From this point of view, it is sufficient to define the transformation cost related at the AA level to the two additional symbols used to represent frameshifting indels (“!”) and stop codons (“*”). Note that the probability of observing a frameshift or a stop codon in a sequence is relatively independent of what is observed in other sequences at the same site. The way to account for them is thus similar to the way indels are classically accounted for. Note that this is more than a coincidence for frameshift symbols since they indeed represent improbable indels of one or two nucleotides. The presence of “!” in front of any symbol is thus penalized with a high cost denoted as . Similarly, the presence of “*” in front of any symbol has also a high cost denoted as . As a consequence, the presence of a “*” facing a “!” has a total cost of .

Finally, stop codons appearing at the end of a sequence should not be penalized whereas frameshifting indels at sequence extremities must not be penalized more than other indels. From an algorithmic point of view, this is taken into account in our program in a way similar to indel costs that are generally handled to avoid penalizing those appearing at sequence ends.

Finding the optimal alignment of two coding sequences with frameshifts and stop codons

Our solution, as most existing pairwise alignment methods of molecular sequences, is an improvement on the classical “Needleman-Wunsch” algorithm [45]–[47]. We thus start by recalling its basis. Having a sequence , we denote its length, and the subsequence of comprised between its and characters. Note that is thus the character of and that, by convention, is the empty sequence (“”) if or . The first key observation is that the optimal alignment of two sequences can easily be deduced from the optimal alignments of the two sequences shortened by at most one character. More precisely, being the optimal alignment between two sequences and and its cost , the overall cost of an optimal alignment between the two sequences can be recursively computed using the following formula (as long as and ): (1)

The recursion stops when at least one sequence is empty. An efficient solution for this recursive problem is to store each sub-problem solution. This only requires memory space while saving exponential computation time. The cost of each sub-problem solution is stored in a two-dimensional array of size × that we denote such that . The first row and column of correspond to alignment containing an empty sequence with straightforward costs, e.g. . Once the first row and the first column are initiated, other cells of are considered in a left/right, top/down order. Hence each value of can be computed in constant time using the recursive formula (1) that relies on the three sub-problem costs stored in , and . The last computed value ( ) is the cost of an optimal alignment of and . An optimal alignment can be obtained from the filled array by using a backtracking algorithm. This algorithm starts from the last entry of (i.e. ) and determines which of its three neighbors has been used to obtain its optimal value. If the value comes from the left, it indicates an insertion of the last character of from the top, it is a deletion of this character and from the diagonal, it is a substitution between the last two characters of , . The algorithm then moves to the corresponding neighbor and the same process is repeated until the top left of the array is reached.

As we are looking for an alignment that takes into account the AA translation of the NT sequences, we need to introduce a new notation to link these two sequence levels. We will use to denote the raw translation of a nucleotidic sequence into AAs. This raw translation is realized using the first reading-frame, incomplete codons are converted into “!” and stop codons are converted into “*” without interrupting the translation. Considering two protein-coding nucleotide sequences without frameshifts and , the array used to align and can be viewed as a compression of the corresponding array that would have been used to align and . Indeed, each row (resp. column) of represents three rows (resp. columns) of . An alignment equivalent to the one produced by backtracking can thus be obtained using given that only movements corresponding to an AA substitution, insertion, or deletion are considered. These restrictions lead to considering only cells and to estimating their values based on the following formula (as long as and ): where and .

Considering frameshift possibilities is a generalization of this approach where all cells are considered and their values are estimated using all cells inside the square neighborhood delimited by , , and . This 4×4 square thus defines 15 neighbor cells of (Fig. 6). During the backtracking process, all movements from toward these 15 neighbors are considered. Three of them correspond to classical AA translations, while the 12 others induce 1 or 2 frameshifts. Fig. 7 shows the site alignments corresponding to these 15 possible movements. The resulting pairwise algorithm of two coding DNA sequences with respect to a frameshift and stop codon aware NT/AA model is detailed in Algorithm S1. Note that in this algorithm, values of are accessed through a “get_C(i,j)” method that returns when and are valid indices, and otherwise. The advantage is that the value does not interfere with the search for a minimum value, so that only the needs to be initialized while other cells in the three first rows (and columns) are handled like any others.

Like for classical Needleman-Wunsch, an array is used to store the cost of an optimal alignment between prefixes of ( = ATTTCGAAATG) and prefixes of ( = ATCGAGATG). The AA translations of those sequences are used to detect STOP codons and to evaluate codon substitutions based on their AA translations. The value of each cell is computed using 15 nearby cells. For instance, the bold cell value is computed based on its 15 colored neighbors. Among those 15 cells, some induced frameshifts in one or both sequences (see Fig. 7 for details). For instance cells marked with a “0” cause no frameshift, those marked by “1” cause a frameshift for but not for . The optimal path (indicated by arrows) is determined using a backtracking process similar to the classical one, except that 15 possible moves are now considered. The alignment corresponding to this arrow path is depicted in the dashed box.

Suppose that the backtracking process has led to the bold cell. The next movement will go from this cell toward one of its 15 colored neighbors and one site will be added to the alignment constructed by the backtracking process. The site to be added is indicated for each cell.

This dynamic programming algorithm is described using constant gap costs, i.e. the cost of an indel of size is just . The implemented version is extended to handle the more realistic affine gap costs where the cost of an indel is . This is done by using three matrices , and containing the optimal costs of partial alignment ending, respectively, by an Insertion, a Deletion or a Match/Substitution (e.g. [48]).

Since for each cell we consider 15 neighbors instead of the three considered in the standard Needleman-Wunsch algorithm, our approach is, theoretically, five times slower. Having a fast pairwise algorithm and a valid alignment representation, we can now apply classical MSA strategy based on this NT/AA model accounting for frameshifts and stop codons.

Multiple alignment of protein-coding nucleotide sequences using an NT/AA model accounting for frameshifts and stop codons

A multiple alignment of sequences ,…, induces a pairwise alignment for any pair of sequences , ( ) obtained by removing from all other sequences and those sites that have a gap for both and . The cost of a multiple alignment is often defined as the sum of the cost of the pairwise alignment it induces. This criterion is called the sum-of-pairs (SP) score. Having two alignments and on disjoint sets of sequences and , a variant of the dynamic programming algorithm used for two sequences allows an alignment of to be found, among those inducing and , that has the lowest SP score. In this variant, a substitution cost is computed to reflect the sum-of-pairs criterion, i.e. it is a sum of elementary substitution costs for transforming AAs (resp. NTs) present in into those present in . Gap extension costs can also be easily derived from the number of sequences included in both alignments, plus the gap frequencies of any of their sites. The only real difficulty is to correctly estimate the exact cost of gap creation that should be added to the SP score when considering an insertion/deletion event. Although this number can be computed exactly [44], the much easier way to compute “pessimistic gap count” estimation proposed by Altschul [43] appears to produce MSA of good quality [49].

The MSA produced by MACSE uses a progressive alignment strategy to obtain an initial draft MSA that is subsequently refined. Variants of this widespread strategy are used, for instance, by ClustalW [12], Muscle [15] and OPAL [49]. The influences of each step variant (such as the method used to measure sequence similarity) are extensively analyzed in the OPAL paper [49] and we considered its conclusions when designing MACSE. In particular, following their conclusions, we fixed the substitution matrix at BLOSUM62 [42]. The MSA strategy used in MACSE is obviously not the core of the present paper since we use the classical approach to extend our original pairwise alignment of coding sequences into a useful MSA. However, we briefly describe it below to explain the choice of our main variants.

Firstly, all pairwise sequence similarities are estimated based on the frequencies of their nucleotide k-mers, i.e. their sub-sequences of k nucleotides [50]. Those similarities are used to infer a dichotomic rooted guide tree using the UPGMA distance method [51]. By using UPGMA, the goal is clearly not to infer a phylogeny of the sequences but rather to build a guide tree that groups similar sequences, which must be aligned first [49]. The leaves of this tree are associated with the sequences to be aligned, whereas its internal nodes are associated with the MSA of the sequences included in the corresponding clade. The internal nodes are then processed bottom up, and the alignment of a node is obtained by aligning the previously computed alignments of its two descendants. Note that, following the conclusions of the OPAL paper, we choose to “align alignments” using the pessimistic gap count, as detailed in [48], rather than to align profiles, which is often the case e.g. [12], [15]. Since the profiles only consider the character frequencies of each site, they are less time and space consuming but do not contain enough information to compute gap cost according to the “pessimistic gap count”. The resulting MSA of the root node is then used as our initial draft of the desired MSA. We then use the classical 2-cut refinement strategy to improve it. This strategy consists of partitioning the current solution into two sub-alignments that are subsequently re-aligned. The resulting MSA replaces the previous one if its SP score is improved. This 2-cut refinement strategy also uses the guide tree: it iteratively considers each clade of the guide tree and splits the current global alignment so that one of the two sub-alignments contains the exact sequence of the clade concerned. Once all clades have been tested, a new guide tree is inferred using UPGMA based on sequence similarity estimated according to the sequence normalized contributions to the SP score of the current MSA [49]. Note that if the guide tree changes, some new 2-cut refinements will be tested. The refinement process stops when no more improvements are found, or when the maximum number of refinement iterations is reached.

Availability, main features, and future directions

The MACSE program is distributed as an open source Java file executable with available source code. Since it is written in Java, MACSE is provided as a single jar file that works on every standard operating system (Windows, Linux, Mac OS). Once downloaded, it can be launched using the basic command line instruction e.g., “java -jar MACSE.jar -i my_seq.fasta -o my_output_prefix” (in the absence of any parameters, MACSE will print some help describing its options and providing some command line examples.) This allows to easily integrate MACSE in a bioinformatics pipeline. MACSE can also be used via a web interface at:

Main features and options of MACSE

MACSE takes input sequences in the FASTA format and provides as output two alignments of those sequences in the same format (one at the NT level and one at the AA level). The name of the input file and the basename to be used for the two output alignments are the only compulsory parameters of MACSE. One can easily define two sets of sequences that use different frameshift and stop codon costs by splitting sequences to be aligned into two different input files. This allows standard use cases to be handled when one wants to align either protein coding DNA sequences with pseudogenized ones, or curated sequences from public databases with sequences resulting from the raw output of new generation high-throughput sequencing technologies. The alignments outputted by MACSE can be examined using the SEAVIEW program [52], [53] which has a well suited codon view option.

The parameter values for gap opening/extension costs strongly influence the alignment produced by any MSA approach. Despite all efforts to design an automatic strategy to adjust these costs, the results obtained with such adjusted parameters are still disappointing compared to those that could have been obtained by the same MSA method if the true parameters were known [49]. The MACSE documentation includes some guidelines to choose cost penalties associated with gap opening/extension and with frameshift and internal stop codon occurrences for the most common usages – e.g. alignment of (pseudo)genes. Note also that since the user can provide an initial alignment that MACSE will use as a starting point for its 2-cut refinement strategy, one can rapidly test different parameter sets.

MACSE also integrates the alternative genetic codes, and provides options to specify the default genetic code to be used and/or to specify different codes to be used depending on sequence names. For the latter option, MACSE relies on a separate option file compatible with the one used by TranslatorX.

Future directions

Future works include further optimization to speed up the program and the development of a more elaborated penalty model to take into account, for instance, the fact that frameshifts are more frequent within homopolymer portions of sequences. We also work on handling untranslated regions (UTR) that can appear at the beginning and/or end of the EST sequences. This can be done by adapting our algorithm to allow local alignment together with identification of start and stop codons at their extremities. Finally, we plan to collaborate with the SEAVIEW developer team to provide MACSE as a SEAVIEW plug-in.

Protein-coding model building

The model-building phase involves the alignment of protein, cDNA, EST and RNA-seq sequences to the genome assembly. The methods used in this phase depend on the input data available at the time of annotation. Input datasets are selected taking provenance into account, with same-species data preferred over data from other species, and with annotated sequences preferred over computed sequences. The final output of this section of the genebuild is a collection of databases that contain sequence alignments and a large set of potential protein-coding transcript models.

Targeted pipeline

The Targeted (same-species) pipeline uses same-species protein sequences to first identify the rough genomic location of protein-coding genes, and then to produce coding models using GeneWise (68). This two-step method aims to speed up the process by reducing the search space made available to GeneWise to a subsection of the genome, which has similarity to the protein sequence being aligned.

Same-species protein sequences are downloaded from UniProt and RefSeq (69), with the aim of restricting these to a set of high-confidence input sequences. For UniProt, we download only Swiss-Prot and TrEMBL protein sequences labeled as PE level 1 and PE level 2. In the case of RefSeq, we download sequences with ‘NP’ and 𠆊P’ accessions, which are the annotated protein sequences. RefSeq computed protein sequences including the ‘XP’ accessions are not downloaded. The combined set of downloaded UniProt and RefSeq protein sequences form the input for the Targeted pipeline.

We locate the approximate genomic location of transcripts by aligning protein sequences to the genome using Pmatch (R. Durbin, unpublished software) with a threshold of ‘-T 14’. This threshold indicates the number of consecutive amino acids that must exactly match the genomic DNA, and is an efficient method for aligning proteins when they have high identity to the genome. It is important not to lose too many same-species input sequences at this early stage of the genebuild process. Thus, if Pmatch does not align all input proteins, we then align the remaining protein sequences using Exonerate (70).

Every Pmatch hit will correspond to translated exonic sequence. Pmatch hits from each input protein sequence are grouped along the lengths of genomic sequences, using the module [also referred to as a Runnable (56)] BestPmatch, so that the genomic range of the hits roughly corresponds to the location of the input protein’s transcript. The genomic range identified by BestPmatch is extended by 200 kb in both directions and the DNA sequence for this region is passed to GeneWise, along with the original input protein sequence. GeneWise aligns the protein sequence to the DNA using a splice-aware algorithm and generates a protein-coding transcript model as output.

For human, mouse and selected other species, we run GeneWise at least twice across the genome: a first time requiring consensus splicing and a second to allow nonconsensus splice sites. While consensus splicing is more common than nonconsensus splicing, the second run of GeneWise provides flexibility for those coding models with real nonconsensus splice sites and permits alignment of the protein sequence in regions where there are genomic sequence errors. Some models produced by GeneWise contain small 𠆏rameshift introns’ of 1, 2, 4 or 5਋p long where errors, insertions or deletions in the genomic sequence would otherwise introduce translation frameshifts. When translated off the genomic sequence, the coding sequence for these models is more likely to be full length, which is particularly useful in lower quality draft genomes.

In Curwen et al. (48), we described passing ‘MiniSeqs’ to GeneWise. However, we no longer use this approach. We now use 𠆏ullSeqs’ that include all genomic sequence from the first to last Pmatch alignments intronic genomic sequence is no longer removed. This FullSeq method is possible due to increased computational resources and optimization of the GeneWise program. It is preferred because it allows GeneWise to search the full genomic sequence and to correctly place short exons, while genomic sequences for short exons were not always present in the MiniSeqs.

In addition to GeneWise, we also use Exonerate’s cdna2genome tool (70) to generate protein-coding gene models. This is achieved by downloading cDNA sequences that have a coding sequence (CDS) range annotated in the INSDC record cDNA sequences without an annotated CDS in the INSDC record are not used in this step. Combined alignment of a cDNA and its annotated CDS by Exonerate has the advantage of adding untranslated regions (UTRs) to the protein-coding models in one step, and of ensuring that the correct UTR is added to a coding model. This step is only run for the handful of species that have large numbers of annotated protein-cDNA pairings. As Exonerate produces models whose translation include stop codons, we search each of the resulting models and remove those with more than one internal stop. For models with only a single internal stop codon, a small frameshift intron is introduced in its place.

From the multiple GeneWise and Exonerate methods described above, each original protein sequence may have produced multiple coding transcript models at one location, with slightly different exon structures and translated sequences, depending on the degree to which the protein sequence matches the genome. In order to identify the model whose translation most closely matches the input sequence, the translation from each of these models is aligned back to the original protein sequence by the BestTargeted module, using Exonerate’s �ine:local’ model. This is a local alignment that uses the affine gap penalty, similar to the Smith–Waterman–Gotoh algorithm (71). For each original protein sequence, the Ensembl model producing the highest Exonerate score is selected to be the final output for the Targeted pipeline.

Similarity pipeline

As with the Targeted pipeline, the aim of the Similarity pipeline is to identify the rough genomic location of protein-coding transcripts and then to produce coding models using GeneWise. Unlike the Targeted pipeline, which restricts its input to only same-species proteins, the Similarity pipeline takes as input UniProt proteins from a wide range of species. This approach is especially useful for species that do not have many same-species proteins suitable for use in the Targeted pipeline such as elephant or anole lizard, but is less so for well-described species with many proteins in UniProt, such as human and mouse.

The method for reducing the genomic search space passed to GeneWise is different in the Similarity pipeline compared to the Targeted pipeline. Instead of using Pmatch to identify the rough placement of protein sequences, we use the UniProt BLAST results produced in the raw compute pipeline. Although BLAST requires more compute resource than Pmatch to run, it is more tolerant of the sequence mismatches that typically occur when aligning proteins from the broad range of species used in the Similarity pipeline.

The UniProt BLAST results are first classified across three axes according to the information provided by UniProt: by PE level, by source (Swiss-Prot or TrEMBL) and by taxonomy. This division of UniProt subsets allows us to prioritize the reviewed protein sequences that are more closely related to the species being annotated.

UniProt proteins that mapped to a Genscan peptide sequence during the raw computes step are then aligned to the full genomic sequence underlying the Genscan model, again using BLAST. This step allows hits to be identified outside of the Genscan exons. It is these results that define the regions on which GeneWise is subsequently run.

The output of the Similarity pipeline is a set of models, based on protein sequences from a variety of species, which supplements the models already generated by the Targeted pipeline.

RNA-seq pipeline

With the rapid adoption of high-throughput transcriptome sequencing (i.e. RNA-seq) as an experimental method, the amount of available transcribed sequence data is increasing dramatically (72). The quality of such sequence data is expected to continue to increase over the next few years, making it a valuable resource in the gene annotation process.

The main difficulty in using short reads for gene annotation is that the full length of an mRNA is not represented in one contiguous sequence. These short sequences must be combined to generate longer transcript models without full knowledge of the splicing pattern of the exons in each expressed isoform. The paired reads provide more informative alignments than single reads because reads that align as a pair have a higher confidence level of being aligned correctly (73). It is also possible to take the expected insert size for paired reads into account when validating their alignments. Stranded reads are particularly useful for cases in which transcripts overlap on opposite strands, and assignment of a read to the correct strand can be ambiguous, although for un-stranded reads, a transcript’s strand can normally be determined from the direction of splice sites. Most of the RNA-seq data with which we have worked have been paired-end reads of 50 bases or longer, generated by Illumina machines.

Because short read data do not allow the confident construction of full-length splicing models, the Ensembl RNA-seq pipeline is usually configured to produce only one transcript model per gene as output. This conservative approach aims to prevent the introduction of false transcript structures that result from incorrectly combining exons and introns along the length of a model.

RNA-seq-based models are produced from a two-step alignment process with only minor modifications to that described by Collins et al. (74). Firstly, raw reads are now aligned to the genome using BWA (75). These alignments are collapsed to create alignment blocks that roughly correspond to transcribed exons. Read pairing information is then used to group putative exons into approximate transcript structures called proto-transcripts. In the second alignment step, the reads that were partially aligned by BWA are extracted and aligned to the proto-transcripts, or more commonly to the underlying genomic sequence, using Exonerate. Exonerate is splice-aware, providing alignments that allow us to infer introns. Finding clear exon–intron junctions is a challenge when the raw reads have been sequenced from a mixture of fully processed and partially processed transcripts reads sequenced from retained intronic sequence can lead to the annotation of one long, false exon that should have been annotated as one intron surrounded by two exons. These false exons are removed when detected they are identified by searching within the genomic range of each putative exon for evidence of spliced reads. The result of the Exonerate alignment step is a set of spliced alignments representing canonical and noncanonical introns. Transcript models are created by combining the transcribed regions from the proto-transcripts with the observed (intronic) spliced alignments to create all possible transcript isoforms indicated by the aligned data. We usually configure the system to only keep the isoform with the most read support across its splice junctions and exons.

Read length and depth of coverage are both important when identifying introns. When read coverage is high, it is more likely that the set of raw reads contains sequences that can be aligned across an intron. When reads are longer, it is more likely they will span an intron. Having reads that align across every intron in a transcript makes it possible for us to build a complete transcript model. If the coverage is very low, some splice boundaries may not be covered by a read in the raw data set. Without read support, these introns will not be generated in the Exonerate step, which can result in fragmented models or models with retained introns.

The RNA-seq pipeline produces both protein-coding and noncoding transcript models. The final step in this process is to BLAST UniProt PE 1 and PE 2 proteins against the set of RNA-seq models so as to identify the protein-coding transcript models. Our standard thresholds for the UniProt alignments are 80% identity and 80% coverage of the sequences.

For the reads from each input sample, and for the merged set of reads from all samples, the output of the RNA-seq pipeline includes an indexed BAM file of the reads aligned by BWA, a set of intron features produced by aligning intron-spanning reads with Exonerate, and a set of transcript models. These data can be viewed as separate tissue tracks in the Ensembl browser. They can also be obtained through a programmatic interface.

Transcript models are produced separately for each of the tissue samples, as well as for the merged set. Transcript models from a single tissue input sample are often more fragmented than transcript models from the merged set. (The data in the merged set are deeper, and this allows more splice junctions to be detected and therefore more consecutive exons to be joined to produce longer models.) For this reason, transcript models resulting from typically only the merged set of reads are used for incorporating into the final gene set.

Intron features from the set of merged reads are used later on in the annotation process by the TranscriptConsensus module to filter Similarity models (described below). Transcript models from the set of merged reads may be used for adding UTRs to Targeted and Similarity models, and may also be included as part of the main gene set during the LayerAnnotation pipeline (also described below).

Ortholog recovery pipeline

In preparing a set of preliminary transcript models produced by the model-building pipelines, comparative data may be used for both assessing the completeness of the transcript set and for supplementing the transcript set where appropriate. Transcript structures may be absent from a preliminary set for a number of reasons, most commonly because the genomic sequence is missing from the assembly or because the Targeted and Similarity pipelines did not produce a model. For the latter case, it may still be possible to annotate models using our ortholog recovery pipeline. The RNA-seq pipeline described above will also identify genes not found by the Targeted and Similarity pipelines, and so use of the ortholog recovery pipeline has become less common since RNA-seq data became more widely available.

The OrthologueEvaluator module was developed to identify and annotate additional transcript models based on orthology. OrthologueEvaluator takes as input the preliminary transcript set with the gene sets from at least two well-annotated species, usually human and mouse. A set of orthology predictions is generated by best reciprocal BLAST hits across the input sets. These predictions are then used to fill in gaps and to supplement truncated models. In both cases, the Ensembl protein sequence of an ortholog from one of the well-annotated species is selected for alignment, with Exonerate, to the genome being annotated. When Exonerate generates a good alignment the resulting model is added to the preliminary transcript set.

Projection pipeline

The Targeted and Similarity steps rely on the alignment of complete protein sequences to the genome sequence. This method is unsuitable for low-coverage fragmented assemblies where missing genomic sequence, mis-orientations and misplacements occur more frequently than in the higher quality draft genome assemblies. In fragmented assemblies many genes will be represented only partially (or not at all) in the assembly, and many others (particularly those genes with large genomic extent) will be found in pieces, distributed across more than one scaffold.

In order to improve gene annotation on species with fragmented assemblies, we developed a methodology that relies on a whole genome alignment (WGA) to an annotated reference genome—usually the human genome. This method was used, as follows, to annotate all of the low-coverage mammal genomes produced by the 29 Mammals Project (76). For each of the low coverage target genomes, the whole-genome alignment between the human genome and target was generated using BLASTz (77). The resulting set of local alignments was linked into chains using axtTools (78). A custom filter was then applied to ensure that each base pair in the target genome aligned to no more than one position in the human genome. The WGA block underlying each annotated gene structure in the human genome was used as a guide to bring together scaffolds from the target species and join them into longer ‘GeneScaffolds’ ( Figure 3 ) that could contain complete gene structures. The inferred GeneScaffolds created a virtual assembly on top of the target species’ primary assembly. Genes from the human genome were then ‘projected’ (copied) down on to the target genome. In regions where the WGA implied that the target assembly was missing genomic sequence containing an internal exon, the projected exon was placed on the gap sequence. This resulted in a string of Xs corresponding in length to the projected translation. The creation of GeneScaffolds altered the set of toplevel sequences that were initially loaded into the Ensembl database, so the raw compute analyses were run across the new GeneScaffolds. This method of altering the toplevel sequences is no longer used because it would hinder navigation between Ensembl and other genome browsers such as UCSC and NCBI.

Projection of human FGF10 to alpaca. The FGF10 gene in alpaca was annotated by aligning the human and alpaca assemblies using BLASTz, and then projecting (copying) the human gene onto the alpaca genome. A novel structure, GeneScaffold_2975, was generated in the alpaca assembly by bringing together the shorter scaffolds that aligned to the human region containing the FGF10 gene.

This method of whole-genome alignment and projection of annotation from the human genome to the target assembly was also applied to higher primates. However, the creation of GeneScaffolds was unnecessary because the primate assemblies were of better quality or were created using order and orientation information from the human assembly.

Extending protein-coding models into their UTRs

Protein-coding models generated from protein-to-genome alignments in the Targeted, Similarity and Ortholog recovery pipelines will not have UTRs annotated. Targeted models produced by Exonerate’s cdna2genome model, on the other hand, do not require UTR extension because they are based on the alignment of cDNA and will already have UTRs annotated.

Models made from RNA-seq, cDNA or EST sequences can be used to add UTRs to the coding models. We have already described the RNA-seq pipeline and how these models are generated. For cDNAs, models are generated by aligning the cDNA sequences to the softmasked genome using Exonerate. ESTs are aligned in the same way as cDNAs, and these alignments are collapsed into models using the EST2genes or TranscriptCoalescer modules. These two modules combine spliced EST alignments into longer transcript structures.

The variable quality of EST data, which often come from multiple labs using different protocols, makes the sequences difficult to incorporate into an annotation system that expects data to be of a consistently high quality. We do not use EST models for UTR addition unless a species has a large number of EST sequences and very little cDNA or RNA-seq data.

The UTR_Builder module traverses each toplevel sequence and identifies protein-coding models that are overlapped by RNA-seq, cDNA or EST models. When the start and end boundaries of the first intron of a protein-coding model are matched by an RNA-seq, cDNA or EST structure, this sequence evidence can be used to add a UTR at the 5-prime end. The same rule applies to the last intron of a protein-coding model when adding the 3-prime UTR. For single-exon transcripts, the exon start and end must lie within the corresponding sequence evidence in order to add a UTR. When a translation does not start with a Methionine, the UTR is searched upstream of the CDS for the first in-frame Methionine. Similarly, when a translation does not end in a stop codon the UTR is searched up to 150 bases downstream of the CDS for the first in-frame stop codon.

CAGE (79) and paired-end tags (ditags) (80) provide information on the transcription start and end positions. We have adapted our UTR pipeline to make use of these data so as to define UTR boundaries more precisely. The genomic locations of CAGE tags and ditags are compared against the cDNA models, which allows scoring of each potential pairing of protein model to cDNA. The UTR_Builder module prioritizes the cDNA model with the most CAGE and ditag support. This has been applied in human and mouse where deep sequencing data are available.

The output of the UTR_Builder step is an updated set of protein-coding transcript models that have been extended to include UTRs where evidence was available ( Figure 4 ). The cDNA and EST models are used in filtering steps later on and are also displayed on the website along with the ESTgenes.

Sample transcript models with supporting evidence for untranslated regions (UTRs). This figure shows sample transcript models from HAVANA (yellow) and Ensembl (red) aligned with supporting evidence from cDNAs (green), ESTs (purple) and proteins (orange). Darker colors in the alignments correspond with exons. Unfilled boxes at the ends of the transcripts represent UTRs. Support for the UTRs comes from the aligned cDNAs and ESTs but not from the proteins.

Special types of protein-coding genes

The protein-coding gene annotation process described above creates high quality gene models throughout most of the genome. The annotation process relies on aligning protein sequences to the genome and is suitable for most protein-coding genes.

There are certain types of protein-coding genes, however, where the above approach is not suitable. These include Immunoglobulin/T-cell receptor genes and selenoproteins. We have developed separate approaches to improve annotation for both such cases.

Immunoglobulins and T-cell receptors

The Immunoglobulin/T-cell receptor clusters are difficult to annotate because the underlying genomic region undergoes somatic recombination. This process of genome rearrangement combines multiple genes from the cluster—known as Variable (V), Constant (C), Diverse (D) and Joining (J) genes𠅋y excising the intervening DNA. This generates a functional immunoglobulin gene sequence that encodes a complete immunoglobulin/T-cell receptor.

We aim to annotate the individual V, D, J and C genes. However, many records of proteins in UniProt and cDNAs in ENA are full-length products of transcripts expressed after the associated V(D)J somatic recombination events. Each of these records contains sequence for multiple genes, which would need to be separated to generate the correct annotation.

The V, D, J and C gene boundaries are often incorrectly predicted when aligned back to the un-rearranged reference genome using a spliced-alignment program such as GeneWise or Exonerate. This is because the junctions are not generated by the standard splicing machinery, and therefore do not display the standard splicing signals.

Annotation for T-cell receptors and immunoglobulin genes has been improved for human and mouse by collaborating with other annotators who contribute to the International Immunogenetics information system (IMGT) (81). This database contains annotations of individual genes on RNA and genomic DNA reference entries. The IMGT genes are aligned to the genome using Exonerate and are then merged with our gene annotations. Existing transcript models that overlap at the exon level with the aligned IMGT genes are removed.


Selenocysteines are encoded by UGA, one of the three codons responsible for translation termination. To represent these codons as encoding selenocysteines instead of stop codons, we align UniProt records with the ‘SEL_CYS’ tag to the genome using Exonerate. The stop codons at the relevant positions specified by these records are then replaced with selenocysteine residues.

Orthology is a key evolutionary concept in many areas of genomic research. It provides a framework for subjects as diverse as the evolution of genomes, gene functions, cellular networks and functional genome annotation. Although orthologous proteins usually perform equivalent functions in different species, establishing true orthologous relationships requires a phylogenetic approach, which combines both trees and graphs (networks) using reliable species phylogeny and available genomic data from more than two species, and an insight into the processes of molecular evolution. Here, we evaluate the available bioinformatics tools and provide a set of guidelines to aid researchers in choosing the most appropriate tool for any situation.

We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies .


The WFDC1 gene is frequently down-regulated or lost in prostate cancer, and the encoded protein, ps20, has been implicated in epithelial cell behaviour and angiogenesis. However, ps20 remains largely uncharacterised with respect to its structure and interacting partners. This study characterised the evolution, functionality and structural characteristics of WFDC1/ps20 using phylogenetic reconstruction and other computational approaches. Bayesian phylogenetic analyses suggested that ps20 appeared in a common ancestor of deuterostomes-protostomes. The rate of evolutionary change within the coding regions of vertebrate WFDC1 genes and the synteny conservation in mammals differed from that of other vertebrate clades, indicating a possible functional diversity of ps20 homologues. A gene set enrichment analysis of the genes around WFDC1 (conserved synteny) showed functional relationships between the WFDC1, CDH13, CRISPLD2, IRF8 and TFPI2 genes. The molecular evolution of ps20 has been driven by purifying selection, particularly in the segments corresponding to exons 3 and 4, which encode the most conserved regions of the protein. A co-evolution analysis showed that residues within these regions co-vary with each other during the evolution of ps20. These results show that the regions corresponding to exons 3 and 4 are ps20-specific structure-function modules. Homology modelling of the exon 2-encoded polypeptide and subsequent dynamics calculus using a Gaussian network model showed that residues with high conformational flexibility are part of a loop region involved in protein-protein recognition, given the similarity with other serine protease inhibitors. Residues C96, R94, L105, and C66 are critical for the integrity and functionality of this ps20 region.


If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest [3] that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. [4] A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches.

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation. [5]

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available [ dead link ] , such as READSEQ and EMBOSS. There are also several programming packages which provide this conversion functionality, such as BioPython, BioRuby and BioPerl. The SAM/BAM files use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions). [6]

CIGAR Format Edit

CIGAR: 2S5M2D2M where:
2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence)
5M = 5 matches or mismatches
2D = 2 deletions
2M = 2 matches or mismatches

The original CIGAR format from the exonerate alignment program did not distinguish between mismatches or matches with the M character.

The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous.

  • “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
  • H can only be present as the first and/or last operation.
  • S may only have H operations between them and the ends of the CIGAR string.
  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
  • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place. [4]

Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap. [7] Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.

Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. Optical computing approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [1].

Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods [1] however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned.

Maximal unique match Edit

One way of quantifying the utility of a given pairwise alignment is the 'maximal unique match' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness. [8] in the multiple sequence alignment of genomes in computational biology. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as MUMmer. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

  • it is maximal, that is, it cannot be extended on either end without incurring a mismatch and
  • it is unique in both sequences" [9]

Dot-matrix methods Edit

The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.

Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.

Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.

Dynamic programming Edit

The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices.

Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from open-source software such as GeneWise.

The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.

Word methods Edit

Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. [1] Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.

In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.

Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems. [10] [11] Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

Dynamic programming Edit

The technique of dynamic programming is theoretically applicable to any number of sequences however, because it is computationally expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package. [12]

Progressive methods Edit

Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.

Many variations of the Clustal progressive implementation [13] [14] [15] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee. [16]

Iterative methods Edit

Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an objective function based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in. [17]

Motif finding Edit

Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved sequence motifs among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.

Techniques inspired by computer science Edit

A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Hidden Markov models have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions. [18] Genetic algorithms and simulated annealing have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article multiple sequence alignment.

The Burrows–Wheeler transform has been successfully applied to fast short read alignment in popular tools such as Bowtie and BWA. See FM-index.

Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence, [19] structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.

Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction [20] because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information. [20]


The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. [21] It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at DALI and the FSSP is located at The Dali Database.


SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments, [22] and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds. [23] The CATH database can be accessed at CATH Protein Structure Classification.

Combinatorial extension Edit

The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment. [24] Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the Combinatorial Extension website.

Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. [25] The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.

Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard. [26]

Assessment of significance Edit

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.

Methods of statistical significance estimation for gapped sequence alignments are available in the literature. [25] [27] [28] [29] [30] [31] [32] [33]

Assessment of credibility Edit

Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature. [34]

Scoring functions Edit

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.

It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.

Sequenced RNA, such as expressed sequence tags and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about alternative splicing [35] and RNA editing. [36] Sequence alignment is also a part of genome assembly, where sequences are aligned to find overlap so that contigs (long stretches of sequence) can be formed. [37] Another use is SNP analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population. [38]

The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing and in social sciences, where the Needleman-Wunsch algorithm is usually referred to as Optimal matching. [39] Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs. [40] In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages. [41] Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time. [42]

A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software, but common software tools used for general sequence alignment tasks include ClustalW2 [43] and T-coffee [44] for alignment, and BLAST [45] and FASTA3x [46] for database searching. Commercial tools such as DNASTAR Lasergene, Geneious, and PatternHunter are also available. Tools annotated as performing sequence alignment are listed in the registry.

Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE. [47] The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE. [48] [49] A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP. [50]


ABO Histo-blood Groups and Cancer Laboratory, Cancer Genetics and Epigenetics Program, Institut de Medicina Predictiva i Personalitzada del Càncer (IMPPC), Campus Can Ruti, Badalona, Catalonia, Spain

Fumiichiro Yamamoto, Emili Cid & Miyako Yamamoto

Division of Population Genetics, National Institute of Genetics, Mishima, Japan

IBE - Institute of Evolutionary Biology (UPF-CSIC), Universitat Pompeu Fabra, Barcelona, Catalonia, Spain

Laboratoire d'Immunogénétique Moléculaire (LIMT, EA3034), Faculté de Médecine Purpan, Université Paul Sabatier, (Université de Toulouse III), Toulouse, France

Watch the video: GoldCyto DNA Tutorial (December 2022).