Phylogeny based on marker genes

Overview

Teaching: 10 min
Exercises: 50 min
Questions
  • Can we use marker genes to infer phylogeny and taxonomy?

Objectives
  • Make a phylogeny based on terminase large subunit

As we discussed this morning, we can use the evolutionary history of certain genes to address questions about the evolution and function of viruses. Although there are no universal marker genes for all the viruses, but many viral lineages share one or more marker genes that can be used for to assess relationships between the viruses carrying them. Our samples were metaviromes from the human gut. Like many biomes, the human gut virome contains many bacteriophages, and we can use the large subunit of the terminase (TerL) gene to study how they are related. The TerL gene, present in all members of the Caudovirales, pumps the genome inside an empty procapsid shell during virus maturation by using both enzymatic activities necessary for packaging in such viruses: the adenosine triphosphatase (ATPase) that powers DNA translocation and an endonuclease that cleaves the concatemeric genome at both initiation and completion of genome packaging.

Activate environment

# Download, create and activate environment
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day4/day4_phyl_env.txt
$ conda create --name day4_phyl --file day4_phyl_env.txt
$ conda activate day4_phyl

TerL sequences from bins and database

Use the script get_terl_bins.py to gather the large terminases annotated in the bins. This script looks at the annotatin table generated on day3, select the proteins annotated as terminase or large terminase subunit, and gets their sequences from the FASTA file with all the proteins. Have a look at the help message to see the parameters you need. Save the results in bins_terl.faa.

# Run the script get the bins terminases
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day4/get_terl_genes.py
$ python get_terl_genes.py ...

As reference set we will use the TerL found in the ICTV database. Since this database does not contain Crassvirales (aka crassphages) yet, we will supplement it with a set of representative Crassvirales sequences. Download these sequences and merge them with TerL you just extracted from the bins.

# Download reference set
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/data/day_4/ictv_crass_terl.faa

# merge bins and reference sets
$ cat bins_terl.faa ictv_crass_terl.faa >  bins_ictv_crass_terl.faa

Multiple sequence alignment

Align the sequences using mafft. Check the manual for more information.

Alignment algorithm

Have a look at the different algorithms available with MAFFT. Which one do you think best fits our data? Can you use one of the most accurate?

Once MAFFT has finished, use trimal remove positions in the alignment with gaps in more than 50% of the sequences. Check the help message to know more about the parameters.

Infer the TerL phylogeny

Use fasttree to infer the TerL phylogeny from the multiple sequence alignment. Once finished, upload the tree to iToL. Add taxonomic annotation (itol_ictv_crass_colors.txt) in the Datasets section for a better understanding of the tree.

# Download reference set annotation
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/data/day_4/itol_ictv_crass_colors.txt

Bins in the tree

  • Where do our (bins) terminases fall?

Viral families in the tree

  • Do the families cluster, and can you explain why?

Check this taxonomy proposal from ICTV and Turner et al., 2021

Key Points