Viral taxonomy and phylogeny II

Overview

Teaching: 240 min
Exercises: 0 min

Objectives

Run vConTACT3 to work out the taxonomy of your viruses

Interpret this taxonomy

What kind of phages exist in your dataset?

For this task, we will use vConTACT3 and geNomad.

vConTACT3

vConTACT3 has an underlying assumption that the fraction of shared genes between two viruses represents their evolutionary relationship. Thus, the vConTACT3 gene-sharing network closely correlates with the ICTV taxonomy.

Exercise - run vConTACT3

Run vcontact3 on your virome dataset

You will need the following commands for the conda environment and database

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9 

# Need to use with db path? 
vcontact3 --db-path /work/groups/VEO/databases/vcontact3/v20240513

See your options:

vcontact3 run --help

Example command

Note: Take care to have enough memory (RAM) for processing. vConTACT3 loads a full viral network into a graph

vcontact3 run --nucleotide virus_genomes.fasta --output output_directory --db-path /work/groups/VEO/databases/vcontact3/v20231101/ -e cytoscape,tree

# other commands you might use
# -t : threads
# --distance-metric : {VirClust,SqRoot,Shorter,Jaccard}
# --min-shared-genes : minimum number of shared genes for two genomes to be connected
# -e : exporting formats/files

sbatch script to run vcontact3

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=vcontact3
#SBATCH --output=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.err.%j

# activate conda environment
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9

database='/work/groups/VEO/databases/vcontact3/v20231101/'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./2.2_taxonomy/10_vcontact_results'

#options
# -e : how do you want the output formatted? See the vcontact3 help for more options

vcontact3 run --nucleotide $assembly --output $outdir --db-path $database -e cytoscape

Output hints

final_assignments.csv contains your taxonomic assignments .cyjs files are network files for Cytoscape .pkl.gz files are compressed pickle files (read with Python-pandas)

geNomad

geNomad has a different way of assigning taxonomy. It uses a marker-gene-to-taxonomy database. It first identifies marker genes on the contigs, classifies the taxonomy for each gene, and then consolidates it using weights. Read it here. From this pipeline, we will only use the annotate module - which includes functional annotation and taxonomic classification. geNomad can also identify and classify proviruses (integrated viruses) and plasmids.

Exercise - run geNomad

Run geNomad on your virome dataset

The conda environment you will require:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh

conda activate genNomad_v20230721
We will only be using the annotate module from the geNomad pipeline. The basic command for this:
genomad annotate [OPTIONS] INPUT OUTPUT DATABASE

sbatch script to run genomad

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=genomad
#SBATCH --output=./2.2_taxonomy/20_genomad_results/genomad.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/20_genomad_results/genomad.slurm.err.%j

# activate conda environment
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate genNomad_v20230721

database='/work/groups/VEO/databases/geNomad/v1.3'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./2.2_taxonomy/20_genomad_results'

#options

genomad annotate --threads 10 $assembly $outdir $database

Outputs from genomad

phage_contigs_taxonomy.tsv : taxonomic classifications phage_contigs_genes.tsv : functional annotations

One of the advantages of running geNomad is that it also outputs functional annotations as part of its “annotate” pipeline module. These functions are assigned differently to how pharokka assigns them using a marker dataset of 227,897 profiles.

Exercise - comparing your taxonomic annotations

What percentage of your viruses were classified by each tool?

How can you judge the accuracy of your taxonomic classifications?

How many classifications agree/disagree?

What is the maximum taxonomic distance between viruses that are still connected in this graph?

What are the limitations of using these tools?

Pick one of the viral contigs annotated by vConTACT3

How many genes are shared with known reference viruses?

Which genes are shared with the other known viruses?

Visualize your virus on the network. see homework exercises here.

Which viruses are closely related? Which are not?

Hint on querying pickle files using python

import pandas as pd

# read in the shared genes pickle file
# this pickle file contains genes shared at even 30% identity - there are other files for 40, 50 ,60 and 70% identity
df_3 = pd.read_pickle("HMMprofile.0.3_SqRoot_shared_genes.pkl.gz")
df_3
# check how many genes contig_518 shares with the other contigs
df_3_contig_518 = df_3.loc[df_3['contig_518'] > 0, 'contig_518']
df_3_contig_518

Next steps - getting all your data together

Now that we have results from many different tools and programs, we can build a table to compare and correlate them. You can build this table however you would like - using excel or python etc… If you are using python, I recommend using pandas df.merge or df.join methods for this task. You will likely need to use it multiple times. Here’s a sample of the type of table you need to build:

Contig	Contig length (bp)	Coverage	Completeness	Contamination	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Taxonomy	Taxonomy
		total			CheckV	CheckV	CheckV	Phannotate	Phannotate	Phannotate	GeNomad	GeNomad	GeNomad	vConTACT3	GeNomad
					Total	Viral	Bacterial	Total	Viral	Bacterial	Total	Viral	Bacterial
Contig_1	1000	97	0.3	50	45	2	50	47	2	50	47	2	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes	unknown

py script for merging all the output tables from all the tools!

# %%
import pandas as pd
import functools as ft

assembly_filename = "assembly_info.txt"
jaeger_filename = "assembly_default_jaeger.tsv"
checkv_filename = "quality_summary.tsv"
pharokka_filename = "pharokka_cds_functions.tsv"
genomad_filename = "assembly_genes.tsv"
genomad_taxonomy_filename = "assembly_taxonomy.tsv"
vcontact_filename = "final_assignments.csv"
out_filename = "viral_contigs_all_outputs_merged.csv"


# %%
df_jaeger = pd.read_csv(jaeger_filename, header=0, sep='\t',
                       usecols=[0,2,16]).set_index('contig_id')

df_jaeger = df_jaeger.loc[(df_jaeger['prediction'] == 'phage')]

print(df_jaeger)

# %%
df_assembly = pd.read_csv(assembly_filename, header=0, sep='\t',
                         usecols=[0,1,2,3],
                         names=['contig_id','length','coverage','is_circular']).set_index('contig_id')
print(df_assembly)

# %%
df_checkv = pd.read_csv(checkv_filename, header=0, sep='\t',
                       usecols=[0,4,6,7,9,11],
                       names=['contig_id','checkv_total_genes','checkv_viral_genes','checkv_quality','completeness','contamination']).set_index('contig_id')

df_checkv.fillna(0, inplace=True)
# print(df_checkv)

df_checkv = df_checkv.loc[(df_checkv['completeness'] > 50) & (df_checkv['contamination'] < 5)]

print(df_checkv)

# %%
df_pharokka = pd.read_csv(pharokka_filename, header=0, sep='\t')

df_pharokka = df_pharokka.pivot(index='contig',columns='Description')
df_pharokka.columns = df_pharokka.columns.get_level_values(1)

#df_pharokka = df_pharokka[['Count']]
df_pharokka = df_pharokka[['CDS','CRISPRs','DNA, RNA and nucleotide metabolism','connector', 'head and packaging','integration and excision', 'lysis','moron, auxiliary metabolic gene and host takeover','other','tRNAs','tail','transcription regulation']]

df_pharokka.rename(columns={'CDS':'phanotate_total_genes'}, inplace=True)


print(df_pharokka)

# %%
# genomad genes

df_genomad_genes = pd.read_csv(genomad_filename, header=0, sep='\t',
                      usecols=[0,8],
                      names=['contig_id_cds','marker'])
#print(df_genomad_genes)

# split the column, join to df and add a column name
df_genomad_genes = df_genomad_genes.join(df_genomad_genes['contig_id_cds'].str.rsplit('_',n=1, expand=True).add_prefix('A'))

# rename the columns
df_genomad_genes.rename(columns={'A0':'contig_id','A1':'cds'}, inplace=True)
# print(df_genomad_genes)

# get total genes
df_genomad_genes_counts = df_genomad_genes.groupby(by='contig_id').count()
# print(df_genomad_genes_counts)

# get viral genes base on the marker column containing "VV"
df_genomad_genes['marker'].fillna('bla',inplace=True)
df_genomad_vv_counts = df_genomad_genes[df_genomad_genes['marker'].str.contains('VV')].groupby(by='contig_id').count()
print(df_genomad_vv_counts)

# merge the gene counts and marker gene counts and rename
df_final_genomad = pd.DataFrame.join(df_genomad_genes_counts[['cds']],df_genomad_vv_counts[['marker']],on='contig_id', how='left')
df_final_genomad.rename(columns={'cds':'genomad_total_genes','marker':'genomad_viral_genes'}, inplace=True)

print(df_final_genomad)

# %%
df_rafah = pd.read_csv("rafah_1_Host_Predictions_gtdb.csv", header=0,
                      usecols=[1,2,3],
                      names=['contig_id','rafah_predicted_host','rafah_prediction_score']).set_index('contig_id')
print(df_rafah)


# %%
# vcontact3 taxonomy

df_vcontact = pd.read_csv(vcontact_filename,header=0)
df_vcontact = df_vcontact[df_vcontact['GenomeName'].str.contains('contig')]

cols = ['realm (prediction)','phylum (prediction)','class (prediction)','order (prediction)','family (prediction)','subfamily (prediction)','genus (prediction)']
df_vcontact['taxonomy'] = df_vcontact[cols].apply(lambda row: ';'.join(row.values.astype(str)), axis=1)

df_vcontact_final=df_vcontact[['GenomeName','taxonomy']]
df_vcontact_final.rename(columns={'GenomeName':'contig_id','taxonomy':'vcontact_taxonomy'}, inplace=True)
df_vcontact_final.set_index('contig_id',inplace=True)

print(df_vcontact_final)

# %%
# genomad taxonomy

df_genomad_taxa = pd.read_csv("assembly_taxonomy.tsv",sep='\t',header=0,
                             usecols=[0,2,3,4], names=['contig_id','genomad_agreement','genomad_taxid','genomad_taxonomy']).set_index('contig_id')

print(df_genomad_taxa)

# %%
# merging all the files together

dfs_list = [df_assembly,df_pharokka,df_final_genomad, df_rafah,df_vcontact_final,df_genomad_taxa]
df_jaeger_checkv = df_checkv.join(df_jaeger, how='inner')
print(df_jaeger_checkv)


df_others = ft.reduce(lambda left, right: pd.merge(left,right,how='left',left_index=True,right_index=True), dfs_list)

df_final = df_jaeger_checkv.join(df_others,how='left')

print(df_final)

# %%
df_final.to_csv(out_filename)

sbatch script for submitting the merging script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --mem=5G
#SBATCH --time=00:10:00
#SBATCH --job-name=merge
#SBATCH --output=2.2_taxonomy/30_merge_summary/merge.slurm.%j.out
#SBATCH --error=2.2_taxonomy/30_merge_summary/merge.slurm.%j.err

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate


python3 ./python_scripts/taxonomy/merge_summary.py

# deactivate the environment
deactivate

Key Points

VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies

VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes

Both methods have strong agreement with the ICTV classifications

previous episode

Viromics2024

next episode