Viral taxonomy and phylogeny II
Overview
Teaching: 240 min
Exercises: 0 minObjectives
Run vConTACT3 to work out the taxonomy of your viruses
Interpret this taxonomy
What kind of phages exist in your dataset?
For this task, we will use vConTACT3 and geNomad.
vConTACT3
vConTACT3 has an underlying assumption that the fraction of shared genes between two viruses represents their evolutionary relationship. Thus, the vConTACT3 gene-sharing network closely correlates with the ICTV taxonomy.
Exercise - run vConTACT3
- Run vcontact3 on your virome dataset
You will need the following commands for the conda environment and database
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9 # Need to use with db path? vcontact3 --db-path /work/groups/VEO/databases/vcontact3/v20240513
See your options:
vcontact3 run --help
Example command
Note: Take care to have enough memory (RAM) for processing. vConTACT3 loads a full viral network into a graph
vcontact3 run --nucleotide virus_genomes.fasta --output output_directory --db-path /work/groups/VEO/databases/vcontact3/v20231101/ -e cytoscape,tree # other commands you might use # -t : threads # --distance-metric : {VirClust,SqRoot,Shorter,Jaccard} # --min-shared-genes : minimum number of shared genes for two genomes to be connected # -e : exporting formats/files
sbatch script to run vcontact3
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short,standard,interactive #SBATCH --mem=50G #SBATCH --time=2:00:00 #SBATCH --job-name=vcontact3 #SBATCH --output=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.out.%j #SBATCH --error=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.err.%j # activate conda environment source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9 database='/work/groups/VEO/databases/vcontact3/v20231101/' assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta' outdir='./2.2_taxonomy/10_vcontact_results' #options # -e : how do you want the output formatted? See the vcontact3 help for more options vcontact3 run --nucleotide $assembly --output $outdir --db-path $database -e cytoscape
Output hints
final_assignments.csv
contains your taxonomic assignments
.cyjs
files are network files for Cytoscape
.pkl.gz
files are compressed pickle files (read with Python-pandas)
geNomad
geNomad has a different way of assigning taxonomy. It uses a marker-gene-to-taxonomy database. It first identifies marker genes on the contigs, classifies the taxonomy for each gene, and then consolidates it using weights.
Read it here.
From this pipeline, we will only use the annotate
module - which includes functional annotation and taxonomic classification. geNomad can also identify and classify proviruses (integrated viruses) and plasmids.
Exercise - run geNomad
- Run geNomad on your virome dataset
The conda environment you will require:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh conda activate genNomad_v20230721
We will only be using the
annotate
module from the geNomad pipeline. The basic command for this:genomad annotate [OPTIONS] INPUT OUTPUT DATABASE
sbatch script to run genomad
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short,standard,interactive #SBATCH --mem=50G #SBATCH --time=2:00:00 #SBATCH --job-name=genomad #SBATCH --output=./2.2_taxonomy/20_genomad_results/genomad.slurm.out.%j #SBATCH --error=./2.2_taxonomy/20_genomad_results/genomad.slurm.err.%j # activate conda environment source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate genNomad_v20230721 database='/work/groups/VEO/databases/geNomad/v1.3' assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta' outdir='./2.2_taxonomy/20_genomad_results' #options genomad annotate --threads 10 $assembly $outdir $database
Outputs from genomad
phage_contigs_taxonomy.tsv
: taxonomic classifications
phage_contigs_genes.tsv
: functional annotations
One of the advantages of running geNomad is that it also outputs functional annotations as part of its “annotate” pipeline module. These functions are assigned differently to how pharokka assigns them using a marker dataset of 227,897 profiles.
Exercise - comparing your taxonomic annotations
- What percentage of your viruses were classified by each tool?
- How can you judge the accuracy of your taxonomic classifications?
- How many classifications agree/disagree?
- What is the maximum taxonomic distance between viruses that are still connected in this graph?
- What are the limitations of using these tools?
Pick one of the viral contigs annotated by vConTACT3
- How many genes are shared with known reference viruses?
- Which genes are shared with the other known viruses?
- Visualize your virus on the network. see homework exercises here.
- Which viruses are closely related? Which are not?
Hint on querying pickle files using python
import pandas as pd # read in the shared genes pickle file # this pickle file contains genes shared at even 30% identity - there are other files for 40, 50 ,60 and 70% identity df_3 = pd.read_pickle("HMMprofile.0.3_SqRoot_shared_genes.pkl.gz") df_3 # check how many genes contig_518 shares with the other contigs df_3_contig_518 = df_3.loc[df_3['contig_518'] > 0, 'contig_518'] df_3_contig_518
Next steps - getting all your data together
Now that we have results from many different tools and programs, we can build a table to compare and correlate them. You can build this table however you would like - using excel or python etc… If you are using python, I recommend using pandas df.merge
or df.join
methods for this task. You will likely need to use it multiple times. Here’s a sample of the type of table you need to build:
Contig | Contig length (bp) | Coverage | Completeness | Contamination | Num Genes | Num Genes | Num Genes | Num Genes | Num Genes | Num Genes | Num Genes | Num Genes | Num Genes | Taxonomy | Taxonomy | Host Prediction |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
total | CheckV | CheckV | CheckV | Phannotate | Phannotate | Phannotate | GeNomad | GeNomad | GeNomad | vConTACT3 | GeNomad | |||||
Total | Viral | Bacterial | Total | Viral | Bacterial | Total | Viral | Bacterial | ||||||||
Contig_1 | 1000 | 97 | 0.3 | 50 | 45 | 2 | 50 | 47 | 2 | 50 | 47 | 2 | Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes | Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes | unknown |
py script for merging all the output tables from all the tools!
# %% import pandas as pd import functools as ft assembly_filename = "assembly_info.txt" jaeger_filename = "assembly_default_jaeger.tsv" checkv_filename = "quality_summary.tsv" pharokka_filename = "pharokka_cds_functions.tsv" genomad_filename = "assembly_genes.tsv" genomad_taxonomy_filename = "assembly_taxonomy.tsv" vcontact_filename = "final_assignments.csv" out_filename = "viral_contigs_all_outputs_merged.csv" # %% df_jaeger = pd.read_csv(jaeger_filename, header=0, sep='\t', usecols=[0,2,16]).set_index('contig_id') df_jaeger = df_jaeger.loc[(df_jaeger['prediction'] == 'phage')] print(df_jaeger) # %% df_assembly = pd.read_csv(assembly_filename, header=0, sep='\t', usecols=[0,1,2,3], names=['contig_id','length','coverage','is_circular']).set_index('contig_id') print(df_assembly) # %% df_checkv = pd.read_csv(checkv_filename, header=0, sep='\t', usecols=[0,4,6,7,9,11], names=['contig_id','checkv_total_genes','checkv_viral_genes','checkv_quality','completeness','contamination']).set_index('contig_id') df_checkv.fillna(0, inplace=True) # print(df_checkv) df_checkv = df_checkv.loc[(df_checkv['completeness'] > 50) & (df_checkv['contamination'] < 5)] print(df_checkv) # %% df_pharokka = pd.read_csv(pharokka_filename, header=0, sep='\t') df_pharokka = df_pharokka.pivot(index='contig',columns='Description') df_pharokka.columns = df_pharokka.columns.get_level_values(1) #df_pharokka = df_pharokka[['Count']] df_pharokka = df_pharokka[['CDS','CRISPRs','DNA, RNA and nucleotide metabolism','connector', 'head and packaging','integration and excision', 'lysis','moron, auxiliary metabolic gene and host takeover','other','tRNAs','tail','transcription regulation']] df_pharokka.rename(columns={'CDS':'phanotate_total_genes'}, inplace=True) print(df_pharokka) # %% # genomad genes df_genomad_genes = pd.read_csv(genomad_filename, header=0, sep='\t', usecols=[0,8], names=['contig_id_cds','marker']) #print(df_genomad_genes) # split the column, join to df and add a column name df_genomad_genes = df_genomad_genes.join(df_genomad_genes['contig_id_cds'].str.rsplit('_',n=1, expand=True).add_prefix('A')) # rename the columns df_genomad_genes.rename(columns={'A0':'contig_id','A1':'cds'}, inplace=True) # print(df_genomad_genes) # get total genes df_genomad_genes_counts = df_genomad_genes.groupby(by='contig_id').count() # print(df_genomad_genes_counts) # get viral genes base on the marker column containing "VV" df_genomad_genes['marker'].fillna('bla',inplace=True) df_genomad_vv_counts = df_genomad_genes[df_genomad_genes['marker'].str.contains('VV')].groupby(by='contig_id').count() print(df_genomad_vv_counts) # merge the gene counts and marker gene counts and rename df_final_genomad = pd.DataFrame.join(df_genomad_genes_counts[['cds']],df_genomad_vv_counts[['marker']],on='contig_id', how='left') df_final_genomad.rename(columns={'cds':'genomad_total_genes','marker':'genomad_viral_genes'}, inplace=True) print(df_final_genomad) # %% df_rafah = pd.read_csv("rafah_1_Host_Predictions_gtdb.csv", header=0, usecols=[1,2,3], names=['contig_id','rafah_predicted_host','rafah_prediction_score']).set_index('contig_id') print(df_rafah) # %% # vcontact3 taxonomy df_vcontact = pd.read_csv(vcontact_filename,header=0) df_vcontact = df_vcontact[df_vcontact['GenomeName'].str.contains('contig')] cols = ['realm (prediction)','phylum (prediction)','class (prediction)','order (prediction)','family (prediction)','subfamily (prediction)','genus (prediction)'] df_vcontact['taxonomy'] = df_vcontact[cols].apply(lambda row: ';'.join(row.values.astype(str)), axis=1) df_vcontact_final=df_vcontact[['GenomeName','taxonomy']] df_vcontact_final.rename(columns={'GenomeName':'contig_id','taxonomy':'vcontact_taxonomy'}, inplace=True) df_vcontact_final.set_index('contig_id',inplace=True) print(df_vcontact_final) # %% # genomad taxonomy df_genomad_taxa = pd.read_csv("assembly_taxonomy.tsv",sep='\t',header=0, usecols=[0,2,3,4], names=['contig_id','genomad_agreement','genomad_taxid','genomad_taxonomy']).set_index('contig_id') print(df_genomad_taxa) # %% # merging all the files together dfs_list = [df_assembly,df_pharokka,df_final_genomad, df_rafah,df_vcontact_final,df_genomad_taxa] df_jaeger_checkv = df_checkv.join(df_jaeger, how='inner') print(df_jaeger_checkv) df_others = ft.reduce(lambda left, right: pd.merge(left,right,how='left',left_index=True,right_index=True), dfs_list) df_final = df_jaeger_checkv.join(df_others,how='left') print(df_final) # %% df_final.to_csv(out_filename)
sbatch script for submitting the merging script
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=2 #SBATCH --partition=short #SBATCH --mem=5G #SBATCH --time=00:10:00 #SBATCH --job-name=merge #SBATCH --output=2.2_taxonomy/30_merge_summary/merge.slurm.%j.out #SBATCH --error=2.2_taxonomy/30_merge_summary/merge.slurm.%j.err # activate the python virtual environment with the packages we need source ./py3env/bin/activate python3 ./python_scripts/taxonomy/merge_summary.py # deactivate the environment deactivate
Key Points
VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies
VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes
Both methods have strong agreement with the ICTV classifications