Phylogeny based on marker genes
Overview
Teaching: 10 min
Exercises: 50 minQuestions
Can we use marker genes to infer phylogeny and taxonomy?
Objectives
Make a phylogeny based on terminase large subunit
As we discussed this morning, we can use the evolutionary history of certain genes to address questions about the evolution and function of viruses. Although there are no universal marker genes for all the viruses, but many viral lineages share one or more marker genes that can be used for to assess relationships between the viruses carrying them. Our samples were metaviromes from the human gut. Like many biomes, the human gut virome contains many bacteriophages, and we can use the large subunit of the terminase (TerL) gene to study how they are related. The TerL gene, present in all members of the Caudovirales, pumps the genome inside an empty procapsid shell during virus maturation by using both enzymatic activities necessary for packaging in such viruses: the adenosine triphosphatase (ATPase) that powers DNA translocation and an endonuclease that cleaves the concatemeric genome at both initiation and completion of genome packaging.
Activate environment
# Download, create and activate environment
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day4/day4_phyl_env.txt
$ conda create --name day4_phyl --file day4_phyl_env.txt
$ conda activate day4_phyl
TerL sequences from bins and database
Use the script get_terl_bins.py
to gather the large terminases annotated in the
bins. This script looks at the annotatin table generated on day3, select the proteins
annotated as terminase or large terminase subunit, and gets their sequences from
the FASTA file with all the proteins. Have a look at the help message to see the
parameters you need. Save the results in bins_terl.faa
.
# Run the script get the bins terminases
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day4/get_terl_genes.py
$ python get_terl_genes.py ...
As reference set we will use the TerL found in the ICTV database. Since this database does not contain Crassvirales (aka crassphages) yet, we will supplement it with a set of representative Crassvirales sequences. Download these sequences and merge them with TerL you just extracted from the bins.
# Download reference set
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/data/day_4/ictv_crass_terl.faa
# merge bins and reference sets
$ cat bins_terl.faa ictv_crass_terl.faa > bins_ictv_crass_terl.faa
Multiple sequence alignment
Align the sequences using mafft
. Check the manual for more information.
Alignment algorithm
Have a look at the different algorithms available with MAFFT. Which one do you think best fits our data? Can you use one of the most accurate?
Once MAFFT has finished, use trimal
remove positions in the alignment with gaps in more than
50% of the sequences. Check the help message to know more about the parameters.
Infer the TerL phylogeny
Use fasttree
to infer the TerL phylogeny from the multiple sequence alignment. Once
finished, upload the tree to iToL. Add taxonomic annotation
(itol_ictv_crass_colors.txt
) in the Datasets section for a better understanding of the tree.
# Download reference set annotation
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/data/day_4/itol_ictv_crass_colors.txt
Bins in the tree
- Where do our (bins) terminases fall?
Viral families in the tree
- Do the families cluster, and can you explain why?
Check this taxonomy proposal from ICTV and Turner et al., 2021
Key Points