Viral taxonomy and phylogeny II
Overview
Teaching: 240 min
Exercises: 0 minObjectives
Run vConTACT3 to work out the taxonomy of your viruses
Interpret this taxonomy
Compare with relatedness of viruses based on the terL gene
What kind of phages exist in your dataset?
For this task, we will first build a phylogenetic tree using the terL marker gene, and second build a gene-sharing network using vConTACT3. Lastly, we will compare the taxonomic assignments given by vConTACT3 and the terL tree for a few viruses.
TerL phylogenetic tree
The terminase large subunit is a widely used marker gene for phages. This protein packages the DNA into an empty pro-capsid. It has two parts: 1) an ATP driven motor for viral DNA translocation and 2) a endonuclease to cleave the viral DNA to signal the end of the packaging reaction, once the capsid is full. TerL is quite essential in the assembly of full icosahedral capsids, and therefore makes a good marker gene for the Caudoviricetes order. Consider also that some phages do not require, and therefore, do not encode a terL gene.
Basics of making a tree
- Start with amino acid sequences in a fasta file
- Make a multiple sequence alignment (MSA) - this identifies biologically informative positions for tree inference
- INSPECT your MSA - you might need to trim poorly aligned columns
- Build a tree - there are several tree building algorithms - we will use maximum likelihood
Exercise - Make a terL phylogenetic tree with your sequences
For this exercise, we will use the ICTV reference phages. We have extracted the terL amino acid sequences from the ICTV reference phages using pharokka, and pre-built a multiple sequence alignment. You will have to add your sequences to this alignment using mafft and then trim the alignments using trimAl and build a tree using FastTree 2
- Get a fasta file of your terL amino acid sequences (output by Pharokka) and locate the pre-built MSA
# location of the terL sequences from pharokka output ./1.4_annotation/10_pharokka/terL.faa # location of the reference pre-built MSA /vast/groups/VEO/shared_data/Viromics2025_workspace/data/alignments/terL_ICTV_ref_phage_alignment.fasta # copy the terL alignment to your own directory mkdir -p viromics/data/alignments # make a directory if it doesn't exist cp /vast/groups/VEO/shared_data/Viromics2025_workspace/data/alignments/terL_ICTV_ref_phage_alignment.fasta viromics/data/alignments
- Add your terL sequences to the reference alignment using mafft. You need the following command and it will take ~22min
/home/groups/VEO/tools/mafft/v7.505/bin/mafft --reorder --keeplength --addfragments contigs_terL.fasta reference_alignment.fasta > new_MSA.fasta
- Trim your alignment using TrimAl (-gappyout)
trimal -in new_MSA.fasta -out new_MSA_trimmed.fasta -gappyout
- Build a tree using FastTree. Please include a graphic of tree in your lab books with your own contigs labeled or highlighted somehow.
/home/groups/VEO/tools/fastTreeMP/v2.1.11/FastTreeMP -lg -pseudo < new_MSA_trimmed.fasta > new_MSA_trimmed.tree # -lg : Uses the LG model. Note: the LG model is an updated model that describes how likely it is that one amino acid is replaced by another based on patterns found in real protein families. The default model for Fasttree is the JTT - which is older but similar model. # -pseudo : Adds pseudo counts (for gappy alignments)
sbatch script, from terL sequences to tree
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short,standard,interactive #SBATCH --mem=50G #SBATCH --time=2:00:00 #SBATCH --job-name=terL_tree #SBATCH --output=./2.2_taxonomy/10_terL_tree/terL.slurm.out.%j #SBATCH --error=./2.2_taxonomy/10_terL_tree/terL.slurm.err.%j # Run mafft to add the sequences mafft="/home/groups/VEO/tools/mafft/v7.505/bin/mafft" new_seqs="../1.4_gene_annotation/20_pharokka/20_meta_hmm_phanotate/terL.faa" # have to add sequences to untrimmed alignment, since mafft doesn't know what was trimmed $mafft --add $new_seqs --reorder viromics/data/alignments/terL_ICTV_ref_phage_alignment.fasta > ./10_terL_tree/terL_MSA.fasta # Trim the alignment using trimAl trimal="/home/groups/VEO/tools/trimal/v1.5.0/trimal/source/trimal" $trimal -in ./10_terL_tree/terL_MSA.fasta -out ./10_terL_tree/terL_MSA_trimmed.fasta -gappyout # Build a tree using fasttree fastTreeMP="/home/groups/VEO/tools/fastTreeMP/v2.1.11/FastTreeMP" $fastTreeMP -lg -pseudo < ./10_terL_tree/terL_MSA_trimmed.fasta > ./10_terL_tree/terL_MSA_trimmed.tree
TerL tree outputs
Phylogenetic trees come in several output types but here the above commands produce a tree in a newick format. This is a text file and you can open it with any text editor to see what it looks like. It takes the shape: (A:0.1, B:0.2, (C:0.3, D:0.4)E:0.5)F;
This .tree
file can be opened with dendroscope (if you downloaded it locally) or iTOL or another tree-visualization software.
The final terL tree might look something like this:
The full terL tree will have 5000+ leaves, which might be difficult to load into tree-viewers. Instead, we will post-process the tree using the ete3
library in Python. Two scripts are available for this and you can choose either one to produce the output tree to add to your lab books. The locations of the tree-pruning scripts will be in your python_scripts directory python_scripts/collapse_non_target_clades.py
Prune the tree
# trim_tree_to_500_neighbors.py
User inputs:
- Newick
.tree
file - A user-selected viral contig of interest
What the script does: Prunes the terL tree around your selected contig by selecting 500 of the closest neighbours (based on branch lengths)
Script outputs:
- A pruned tree with only 500 leaves
- An itol annotation file
Usage
python3 trim_tree_to_500_neighbors.py terL_MSA_trimmed.tree contig_name terl_contigXX_pruned.tree contigXX_itol_annotation.txt
Collapse tree branches
# collapse_non_target_clades.py
User inputs:
- Newick
.tree
file - A file
user_leaves.txt
containing a list of contig ids of interest like below# user_leaves.txt contig_0001_CDS_0001 contig_0002_CDS_0005 contig_0003_CDS_0023
What the script does: Collapses clades in the tree that are over 100 leaves and do not contain your contigs. Note: these clades are no longer available to view in the collapsed tree.
Script outputs:
- A tree with collapsed clades
- An itol annotation file for the user-selected contigs
Usage
python3 collapse_non_target_clades.py terL_MSA_trimmed.tree user_leaves.txt terl_collapsed.tree collapsed_itol_annotation.txt
Exercise - Post-process your tree
- Use the above scripts to either prune or collapse your tree outputs. SEE: Exercise question #10 below - you might want to plot this ‘pet contig’ here
sbatch script to post-process tree
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short,standard,interactive #SBATCH --mem=50G #SBATCH --time=2:00:00 #SBATCH --job-name=tree_postprocess #SBATCH --output=./2.2_taxonomy/10_terL_tree/tree_postprocess.slurm.out.%j #SBATCH --error=./2.2_taxonomy/10_terL_tree/tree_postprocess.slurm.err.%j # activate the python virtual environment with the packages we need source ../py3env/bin/activate # change your contig name here contig_name=contig_0001_CDS_0001 # prune the tree based on your selected contig python3 python_scripts/trim_tree_to_500_neighbors.py ./2.2_taxonomy/10_terL_tree/terL_MSA_trimmed.tree $contig_name ./2.2_taxonomy/10_terL_tree/terl_${contig_name}_pruned.tree ./2.2_taxonomy/10_terL_tree/${contig_name}_itol_annotation.txt # collapse tree branches python3 python_scripts/collapse_non_target_clades.py ./2.2_taxonomy/10_terL_tree/terL_MSA_trimmed.tree ./2.2_taxonomy/10_terL_tree/user_leaves.txt ./2.2_taxonomy/10_terL_tree/terl_collapsed.tree ./2.2_taxonomy/10_terL_tree/collapsed_itol_annotation.txt
We will visualize the tree together on iTOL with a demo :)
vConTACT3
vConTACT3 has an underlying assumption that the fraction of shared genes between two viruses represents their evolutionary relationship. The vConTACT3 gene-sharing network closely correlates with the ICTV taxonomy.
Exercise - run vConTACT3
- Run vcontact3 on your virome dataset
You will need the following commands for the conda environment and database
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh conda activate vcontact3_3.0.5-0 # Need to use with db path? vcontact3 --db-path /veodata/03/databases/vcontact3/v20250721/
See your options:
vcontact3 run --help
Example command
Note: Take care to request enough memory (RAM) for processing your data by adjusting the –mem=xxG command in your sbatch script. vConTACT3 loads a full viral network into a graph
vcontact3 run --nucleotide virus_genomes.fasta --output output_directory --db-path /veodata/03/databases/vcontact3/v20250721/ -e cytoscape # other commands you might use # -t : threads # --distance-metric : {VirClust,SqRoot,Shorter,Jaccard} # --min-shared-genes : minimum number of shared genes for two genomes to be connected # -e : exporting formats/files
sbatch script to run vcontact3
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short,standard,interactive #SBATCH --mem=50G #SBATCH --time=2:00:00 #SBATCH --job-name=vcontact3 #SBATCH --output=./2.2_taxonomy/20_vcontact_results/vcontact3.slurm.out.%j #SBATCH --error=./2.2_taxonomy/20_vcontact_results/vcontact3.slurm.err.%j # activate conda environment source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh conda activate vcontact3_3.0.5-0 database='/veodata/03/databases/vcontact3/v20250721/' assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta' outdir='./2.2_taxonomy/20_vcontact_results' #options # -e : how do you want the output formatted? See the vcontact3 help for more options vcontact3 run --nucleotide $assembly --output $outdir --db-path $database -e cytoscape
Output hints
final_assignments.csv
contains your taxonomic assignments
.cyjs
files are network files for Cytoscape
.pkl.gz
files are compressed pickle files (read with Python-pandas)
Exercise - comparing your taxonomic annotations
General Questions
- What percentage of your total viral contigs were classified by vConTACT3?
- How can you judge the accuracy of your taxonomic classifications?
- What is the maximum taxonomic distance between viruses that are still connected in the vConTACT3 graph?
Pick one pet viral contig that has a terL gene and also a taxonomic classification by vConTACT3
- What is the taxonomy classifcation of your pet contig from vConTACT3 and the terL tree?
- How many genes are shared with known reference viruses? (vConTACT3 results)
- If your classifications are/were different between the terL tree and the vConTACT3 - which classification would you trust more?
Optional/Group Discussions
- Explore the cytoscape network visualization and/or terL tree together
- What do long branches on your terL tree mean?
- Where are the contigs assigned to long branches in the network from vConTACT3?
- What are the limitations of using the terL tree? and could you overcome these limitations?
Hint on querying pickle files using python
import pandas as pd # read in the shared genes pickle file # this pickle file contains genes shared at even 30% identity - there are other files for 40, 50 ,60 and 70% identity df_3 = pd.read_pickle("HMMprofile.0.3_SqRoot_shared_genes.pkl.gz") df_3 # check how many genes contig_518 shares with the other contigs df_3_contig_518 = df_3.loc[df_3['contig_518'] > 0, 'contig_518'] df_3_contig_518
Key Points
VConTACT3 is used to classify viral taxonomy of sequences based on the number of shared genes
Marker genes such as the terminase large subunit (terL) can also be used to judge how related viruses are and in some cases classify the taxonomy as lower taxa ranks