Viral taxonomy and phylogeny II

Overview

Teaching: 240 min
Exercises: 0 min
Objectives
  • Run vConTACT3 to work out the taxonomy of your viruses

  • Interpret this taxonomy

  • Compare with relatedness of viruses based on the terL gene

What kind of phages exist in your dataset?

For this task, we will first build a phylogenetic tree using the terL marker gene, and second build a gene-sharing network using vConTACT3. Lastly, we will compare the taxonomic assignments given by vConTACT3 and the terL tree for a few viruses.

TerL phylogenetic tree

The terminase large subunit is a widely used marker gene for phages. This protein packages the DNA into an empty pro-capsid. It has two parts: 1) an ATP driven motor for viral DNA translocation and 2) a endonuclease to cleave the viral DNA to signal the end of the packaging reaction, once the capsid is full. TerL is quite essential in the assembly of full icosahedral capsids, and therefore makes a good marker gene for the Caudoviricetes order. Consider also that some phages do not require, and therefore, do not encode a terL gene.

Basics of making a tree

  1. Start with amino acid sequences in a fasta file
  2. Make a multiple sequence alignment (MSA) - this identifies biologically informative positions for tree inference
  3. INSPECT your MSA - you might need to trim poorly aligned columns
  4. Build a tree - there are several tree building algorithms - we will use maximum likelihood

Exercise - Make a terL phylogenetic tree with your sequences

For this exercise, we will use the ICTV reference phages. We have extracted the terL amino acid sequences from the ICTV reference phages using pharokka, and pre-built a multiple sequence alignment. You will have to add your sequences to this alignment using mafft and then trim the alignments using trimAl and build a tree using FastTree 2

  1. Get a fasta file of your terL amino acid sequences (output by Pharokka) and locate the pre-built MSA
# location of the terL sequences from pharokka output
./1.4_annotation/10_pharokka/terL.faa

# location of the reference pre-built MSA
/vast/groups/VEO/shared_data/Viromics2025_workspace/data/alignments/terL_ICTV_ref_phage_alignment.fasta

# copy the terL alignment to your own directory
mkdir -p viromics/data/alignments   # make a directory if it doesn't exist
cp /vast/groups/VEO/shared_data/Viromics2025_workspace/data/alignments/terL_ICTV_ref_phage_alignment.fasta viromics/data/alignments
  1. Add your terL sequences to the reference alignment using mafft. You need the following command and it will take ~22min
/home/groups/VEO/tools/mafft/v7.505/bin/mafft --reorder --keeplength --addfragments contigs_terL.fasta  reference_alignment.fasta >  new_MSA.fasta
  1. Trim your alignment using TrimAl (-gappyout)
trimal -in new_MSA.fasta -out new_MSA_trimmed.fasta -gappyout
  1. Build a tree using FastTree. Please include a graphic of tree in your lab books with your own contigs labeled or highlighted somehow.
/home/groups/VEO/tools/fastTreeMP/v2.1.11/FastTreeMP -lg -pseudo < new_MSA_trimmed.fasta > new_MSA_trimmed.tree

# -lg : Uses the LG model. Note: the LG model is an updated model that describes how likely it is that one amino acid is replaced by another based on patterns found in real protein families. The default model for Fasttree is the JTT - which is older but similar model.
# -pseudo : Adds pseudo counts (for gappy alignments)

sbatch script, from terL sequences to tree

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=terL_tree
#SBATCH --output=./2.2_taxonomy/10_terL_tree/terL.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/10_terL_tree/terL.slurm.err.%j

# Run mafft to add the sequences
mafft="/home/groups/VEO/tools/mafft/v7.505/bin/mafft"
new_seqs="../1.4_gene_annotation/20_pharokka/20_meta_hmm_phanotate/terL.faa"

# have to add sequences to untrimmed alignment, since mafft doesn't know what was trimmed
$mafft --add $new_seqs --reorder viromics/data/alignments/terL_ICTV_ref_phage_alignment.fasta > ./10_terL_tree/terL_MSA.fasta

# Trim the alignment using trimAl
trimal="/home/groups/VEO/tools/trimal/v1.5.0/trimal/source/trimal"
$trimal -in ./10_terL_tree/terL_MSA.fasta -out ./10_terL_tree/terL_MSA_trimmed.fasta -gappyout

# Build a tree using fasttree
fastTreeMP="/home/groups/VEO/tools/fastTreeMP/v2.1.11/FastTreeMP"
$fastTreeMP -lg -pseudo < ./10_terL_tree/terL_MSA_trimmed.fasta > ./10_terL_tree/terL_MSA_trimmed.tree

TerL tree outputs

Phylogenetic trees come in several output types but here the above commands produce a tree in a newick format. This is a text file and you can open it with any text editor to see what it looks like. It takes the shape: (A:0.1, B:0.2, (C:0.3, D:0.4)E:0.5)F;

This .tree file can be opened with dendroscope (if you downloaded it locally) or iTOL or another tree-visualization software.

The final terL tree might look something like this:

example_unrooted_terL

The full terL tree will have 5000+ leaves, which might be difficult to load into tree-viewers. Instead, we will post-process the tree using the ete3 library in Python. Two scripts are available for this and you can choose either one to produce the output tree to add to your lab books. The locations of the tree-pruning scripts will be in your python_scripts directory python_scripts/collapse_non_target_clades.py

Prune the tree

# trim_tree_to_500_neighbors.py

User inputs:

  1. Newick .tree file
  2. A user-selected viral contig of interest

What the script does: Prunes the terL tree around your selected contig by selecting 500 of the closest neighbours (based on branch lengths)

Script outputs:

  1. A pruned tree with only 500 leaves
  2. An itol annotation file

Usage

python3 trim_tree_to_500_neighbors.py terL_MSA_trimmed.tree contig_name terl_contigXX_pruned.tree contigXX_itol_annotation.txt

Collapse tree branches

# collapse_non_target_clades.py

User inputs:

  1. Newick .tree file
  2. A file user_leaves.txt containing a list of contig ids of interest like below
    # user_leaves.txt
    contig_0001_CDS_0001
    contig_0002_CDS_0005
    contig_0003_CDS_0023
    

What the script does: Collapses clades in the tree that are over 100 leaves and do not contain your contigs. Note: these clades are no longer available to view in the collapsed tree.

Script outputs:

  1. A tree with collapsed clades
  2. An itol annotation file for the user-selected contigs

Usage

python3 collapse_non_target_clades.py terL_MSA_trimmed.tree user_leaves.txt terl_collapsed.tree collapsed_itol_annotation.txt

Exercise - Post-process your tree

  1. Use the above scripts to either prune or collapse your tree outputs. SEE: Exercise question #10 below - you might want to plot this ‘pet contig’ here

sbatch script to post-process tree

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=tree_postprocess
#SBATCH --output=./2.2_taxonomy/10_terL_tree/tree_postprocess.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/10_terL_tree/tree_postprocess.slurm.err.%j

# activate the python virtual environment with the packages we need
source ../py3env/bin/activate

# change your contig name here
contig_name=contig_0001_CDS_0001

# prune the tree based on your selected contig
python3 python_scripts/trim_tree_to_500_neighbors.py ./2.2_taxonomy/10_terL_tree/terL_MSA_trimmed.tree $contig_name ./2.2_taxonomy/10_terL_tree/terl_${contig_name}_pruned.tree ./2.2_taxonomy/10_terL_tree/${contig_name}_itol_annotation.txt

# collapse tree branches
python3 python_scripts/collapse_non_target_clades.py ./2.2_taxonomy/10_terL_tree/terL_MSA_trimmed.tree ./2.2_taxonomy/10_terL_tree/user_leaves.txt ./2.2_taxonomy/10_terL_tree/terl_collapsed.tree ./2.2_taxonomy/10_terL_tree/collapsed_itol_annotation.txt

We will visualize the tree together on iTOL with a demo :)

vConTACT3

vConTACT3 has an underlying assumption that the fraction of shared genes between two viruses represents their evolutionary relationship. The vConTACT3 gene-sharing network closely correlates with the ICTV taxonomy.

Exercise - run vConTACT3

  1. Run vcontact3 on your virome dataset

You will need the following commands for the conda environment and database

source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh
conda activate vcontact3_3.0.5-0

# Need to use with db path? 
vcontact3 --db-path /veodata/03/databases/vcontact3/v20250721/

See your options:

vcontact3 run --help

Example command

Note: Take care to request enough memory (RAM) for processing your data by adjusting the –mem=xxG command in your sbatch script. vConTACT3 loads a full viral network into a graph

vcontact3 run --nucleotide virus_genomes.fasta --output output_directory --db-path /veodata/03/databases/vcontact3/v20250721/ -e cytoscape

# other commands you might use
# -t : threads
# --distance-metric : {VirClust,SqRoot,Shorter,Jaccard}
# --min-shared-genes : minimum number of shared genes for two genomes to be connected
# -e : exporting formats/files

sbatch script to run vcontact3

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=vcontact3
#SBATCH --output=./2.2_taxonomy/20_vcontact_results/vcontact3.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/20_vcontact_results/vcontact3.slurm.err.%j

# activate conda environment
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh
conda activate vcontact3_3.0.5-0

database='/veodata/03/databases/vcontact3/v20250721/'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./2.2_taxonomy/20_vcontact_results'

#options
# -e : how do you want the output formatted? See the vcontact3 help for more options

vcontact3 run --nucleotide $assembly --output $outdir --db-path $database -e cytoscape

Output hints

final_assignments.csv contains your taxonomic assignments .cyjs files are network files for Cytoscape .pkl.gz files are compressed pickle files (read with Python-pandas)

Exercise - comparing your taxonomic annotations

General Questions

  1. What percentage of your total viral contigs were classified by vConTACT3?
  2. How can you judge the accuracy of your taxonomic classifications?
  3. What is the maximum taxonomic distance between viruses that are still connected in the vConTACT3 graph?

Pick one pet viral contig that has a terL gene and also a taxonomic classification by vConTACT3

  1. What is the taxonomy classifcation of your pet contig from vConTACT3 and the terL tree?
  2. How many genes are shared with known reference viruses? (vConTACT3 results)
  3. If your classifications are/were different between the terL tree and the vConTACT3 - which classification would you trust more?

Optional/Group Discussions

  1. Explore the cytoscape network visualization and/or terL tree together
  2. What do long branches on your terL tree mean?
  3. Where are the contigs assigned to long branches in the network from vConTACT3?
  4. What are the limitations of using the terL tree? and could you overcome these limitations?

Hint on querying pickle files using python

import pandas as pd

# read in the shared genes pickle file
# this pickle file contains genes shared at even 30% identity - there are other files for 40, 50 ,60 and 70% identity
df_3 = pd.read_pickle("HMMprofile.0.3_SqRoot_shared_genes.pkl.gz")
df_3
# check how many genes contig_518 shares with the other contigs
df_3_contig_518 = df_3.loc[df_3['contig_518'] > 0, 'contig_518']
df_3_contig_518

Key Points

  • VConTACT3 is used to classify viral taxonomy of sequences based on the number of shared genes

  • Marker genes such as the terminase large subunit (terL) can also be used to judge how related viruses are and in some cases classify the taxonomy as lower taxa ranks