Host Prediction II
Overview
Teaching: 0 min
Exercises: 240 minObjectives
Run RaFAH to predict a host genus for each contig.
Run BLASTn to find homologies between metagenome and metavirome.
Host prediction
Today, we will use the tool RaFAH to predict a genus of a putative host for our contigs. The tool predicts proteins in each viral sequence and then assigns them to a set of orthologous groups. The annotated proteins are then used to predict a host genus with a pretrained random forest model. This model matches the set of proteins with host proteins it was trained on.
After running RaFAH, we will compare the resulting prediction with the bacterial metagenome you briefly analyzed in one of last weeks homeworks. The bacterial metagenome was taxonomically annotated using GTDB. We will get into taxonomic classification of viruses tomorrow, for today it is important to know that there are different methods of taxonomic classification. RaFAH uses the NCBI taxonomy, so we will have to translate the output of RaFAH into the GTDB taxonomy to be able to compare the predictions with our bacterial metagenome.
Next, we will use BLAST as an alternative way of linking viruses to their hosts. With BLAST, we will align our viral contigs to putative bacterial and archael genomes (Metagenome-Assembled-Genomes, MAGS) to find sequence matches between the viruses and potential hosts.
Finally, we will combine the results from RaFAH and BLAST to compare the two methods and see how well they agree.
RaFAH
Exercise - Use RaFAH to predict hosts for our contigs
RaFAH requires a single file for each contig to run. You first have to write a python script which separates the combined assembly into single files. You can use the package biopython installed in your virtual environment for this:
import os, sys from Bio import SeqIO # define file paths from the arguments ... # loop through the records in the combined assembly with open(assembly_path) as handle: for record in SeqIO.parse(handle, "fasta"): # set a filename per record out_fasta = os.path.join(out_dir, f"{record.id}.fasta") # write the record to the file with open(out_fasta, "w") as fout: SeqIO.write([record], fout, "fasta")
Next, you will write a python script, which translates the host genus predictions from the NCBI taxonomy into the GTDB taxonomy. You can copy the translation table from this location on draco:
$ cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.1_host_prediction/ncbi_to_gtdb.csv ./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv
There are hundreds of thousands of taxa in both taxonomy schemes, so we need to efficiently map one onto the other. Here are some hints:
import os, sys import pandas as pd # Extract the genus part of the NCBI taxonomy. # This should correspond to RaFAH's output def get_ncbi_genus(s): for ss in s.split(";"): if ss.startswith("g__"): return ss[3:] return "none" # Extract the complete taxonomic classification up to genus from GTDB def get_gtdb_upto_genus(s): return s.split(";s__")[0] # set or get filenames from sys.argv and read tables with pandas ... # create a dictionary for efficient mapping between the two taxonomies translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()} # use the dictionary to translate the predictions in the RaFAH output table ...
You can run the scripts in the same sbatch script as RaFAH. Remember to source and deactivate our python virtual environment accordingly. Here you can find a description of the parameters you can pass to RaFAH (the page is a bit hard to read). The tool is programmed in perl and you can find it here on draco:
# activate the conda environment with the dependencies RaFAH requires source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1 # set a variable to call the RaFAH script rafah='/home/groups/VEO/tools/rafah/RaFAH.pl' # create an output folder or use the one you set for the slurm logs: mkdir 10_results_hostprediction_rafah # run RaFAH with the appropriate file paths and arguments perl "$rafah" --predict --genomes_dir 10_results_hostprediction_rafah/split_contigs --extension .fasta
RaFAH is computationally expensive and can use multiple threads. We recommend you set the following sbatch parameters:
- #SBATCH –cpus-per-task=32
- #SBATCH –partition=standard,short
- #SBATCH –mem=50G
RaFAH uses the random forest model to compute a probability for all host genuses included in its training. You can find these probabilities in the output file
*Host_Predictions.tsv
. The file*Seq_Info_Prediction.tsv
contains per contig the genus with the highest probability.
- How many contigs have a probability score larger than your chosen cutoff?
- Does the probability score relate to the completeness of the contigs? (qualitative answer)
python script for splitting the assembly into separate files
import os, sys from Bio import SeqIO def main(): # define file paths from the arguments assembly_path = os.path.abspath(sys.argv[1]) assert assembly_path.endswith(".fasta") # set an output directory and create it if it does not exist out_dir = os.path.abspath(sys.argv[2]) if not os.path.exists(out_dir): os.makedirs(out_dir) # loop through the records in the combined assembly with open(assembly_path) as handle: for record in SeqIO.parse(handle, "fasta"): # set a filename per record out_fasta = os.path.join(out_dir, f"{record.id}.fasta") # write the record to the file with open(out_fasta, "w") as fout: SeqIO.write([record], fout, "fasta") if __name__ == "__main__": main()
python script for translating NCBI genus IDs to GTDB
import os, sys import pandas as pd # Extract the genus part of the NCBI taxonomy. # This should correspond to RaFAH's output def get_ncbi_genus(s): for ss in s.split(";"): if ss.startswith("g__"): return ss[3:] return "none" # Extract the complete taxonomic classification up to genus from GTDB def get_gtdb_upto_genus(s): return s.split(";s__")[0] def main(): # set the filename of the rafah output file containing the best prediction per contig rafah_filename = os.path.abspath(sys.argv[1]) assert rafah_filename.endswith(".tsv") # set the filename of the translation table with NCBI and GTDB taxonomic IDs translation_filename = os.path.abspath(sys.argv[2]) assert translation_filename.endswith(".csv") # set the output filename out_filename = os.path.abspath(sys.argv[3]) assert out_filename.endswith(".csv") # read the input tables using pandas translate_df = pd.read_csv(translation_filename) rafah_df = pd.read_csv(rafah_filename, sep='\t') # we are only interested in the first 3 columns rafah_df = rafah_df.iloc[:, :3] # create a dictionary for efficient matching of NCBI to GTDB translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()} # translate all NCBI genus predictions to GTDB. # We don't manipulate the table while iterating over it because this could lead to problems. translated_hosts = [] for i, row in rafah_df.iterrows(): translated_hosts.append(translation_dict[row["Predicted_Host"]] if row["Predicted_Host"] in translation_dict else "no_translation") # add the translated predictions to the dataframe and save the table. rafah_df["Predicted_Host"] = translated_hosts rafah_df.to_csv(out_filename,index=False) if __name__ == "__main__": main()
sbatch script for host prediction with RaFAH
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=32 #SBATCH --partition=short,standard #SBATCH --mem=50G #SBATCH --time=02:00:00 #SBATCH --job-name=rafah #SBATCH --output=./2.1_host_prediction/10_rafah/rafah.slurm.%j.out #SBATCH --error=./2.1_host_prediction/10_rafah/rafah.slurm.%j.err assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta' contigs='./2.1_host_prediction/10_rafah/split_contigs' # RaFAH expects each genome in a separate file. Activate our virtual environment # and run a python script to split our filtered assembly into single files source ./py3env/bin/activate # The script requires the assmbly path and a directory for outputting the contigs python ./python_scripts/host_prediction/10_split_assembly_file.py $assembly $contigs # deactivate the python environment, just to be sure not to cause problems with the conda environment RaFAH needs deactivate # activate the conda environment with the dependencies RaFAH requires source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1 # set a variable to call the RaFAH script rafah='/home/groups/VEO/tools/rafah/RaFAH.pl' # rafah parameters (https://gensoft.pasteur.fr/docs/RaFAH/0.3/) # --predict: run the pipeline for predicting hosts # --genomes_dir: the directory with the separate files for each contig # --extension: the extension of the contig files # --file_prefix: can specify an output dir here and "run name" (rafah_1) here perl "$rafah" --predict --genomes_dir $contigs --extension .fasta --file_prefix ./2.1_host_prediction/10_rafah/rafah_1 # deactivate RaFAH's conda environment conda deactivate # Our python script for translating the Taxonomy needs our python environment again source ./py3env/bin/activate # set the path to the table with the translation between NCBI and GTDB and to RaFAH output translation='./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv' rafahtable='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions.tsv' # run our script with the appropriate file names as parameters python3 ./python_scripts/host_prediction/20_translate_taxonomy.py $rafahtable $translation ./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv deactivate
Host Prediction using Blastn
While RaFAH uses a database of reference bacterial/archaeal genomes to find hosts, these bacteria/archaea may or may not be in your samples. One of the simplest ways to look for potential hosts is to do a quick blastn. Viral genomes often contain fragments of host DNA, for example an auxiliary metabolic gene. Conversely, viral DNA can also be found in host genomes; e.g. prophages, CRISPR spacers, fragments of viral genes, tRNAs etc.
For this task, we assembled metagnome-assembled-genomes (MAGs or bins) of the prokaryotic community from the same three samples given to you. This resulted in 110 MAGs. However, not all of these are complete and many are contaminated. After filtering for a completeness >= 50% and a contamination <= 6%, 19 “good MAGs” remained.
These “good MAGs” were concatenated and are located /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa
. These MAGs will be the “database” against which we align our viral contigs (query sequences)
You also require the GTDB taxonomic classifications for the bins. These are located /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv
Exercise - copy the MAGs and the gtdb bin classification over
Please copy this fasta file into your own directory
# make a new directory for this analysis mkdir ./2.1_host_prediction/20_blastn/ # copy over the good MAGs cp /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa ./2.1_host_prediction/20_blastn/ # copy over the taxonomic classification for the MAGs cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv ./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv
Next, we will create a custom blast database using our MAGs fasta file and use blastn (nucleotide-nucleotide) to do the alignment
Exercise - makeblastdb and blastn
Create an sbatch script for the makeblastdb and blastn
- makeblastdb is located at
/home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb
- blastn is located at
/home/groups/VEO/tools/blast/v2.14.0/bin/blastn
- A conda environment is NOT required for this tool!
The basic usage for the makeblastdb and blastn are the following
NOTE: the database you create in makeblastdb has to be fed into blastn
# make a custom database makeblastdb -in mags_file.fasta -out mags_file.fasta -dbtype nucl # run blastn with the file.fasta database blastn -db mags_file.fasta -query filtered_cross_assembly.fasta -out viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6
Resources for blastn can vary! Large databases require a lot of computational power and will take a long time but luckily our task is very small. We recommend the following resources
#SBATCH –cpus-per-task=32
#SBATCH –partition=short,standard
#SBATCH –mem=5G
#SBATCH –time=01:00:00
-outfmt 6
: is an output format that is tabular. Your output contains the following 12 columns in THIS order
query
, subject
, pid
, aln_length
, mismatch
, gapopen
, qstart
, qend
, sstart
, send
, evalue
, bitscore
query
: query id - this is your viral contig
subject
: subject id - this is the bin and contig it matches
pid
: percent id of the match
aln_length
: alignment length
mismatch
: number of mismatches
gapopen
: number of gaps
qstart
, qend
: start and end positions of query sequence
sstart
, send
: start and end positions of subject sequence
evalue
: expect value (the lower, the better)
bitscore
: a score based on gaps, mismatches, % id and alignment length
More resources for blast output formats here and blast manual
sbatch script for blast tasks
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=32 #SBATCH --partition=short,standard #SBATCH --mem=5G #SBATCH --time=01:00:00 #SBATCH --job-name=blastn #SBATCH --output=./2.1_host_prediction/20_blastn/blastn.slurm.%j.out #SBATCH --error=./2.1_host_prediction/20_blastn/blastn.slurm.%j.err # the blast tools makeblastdb='/home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb' blastn='/home/groups/VEO/tools/blast/v2.14.0/bin/blastn' # the query (filtered cross assembly) and subject (MAGs fasta) assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta' bins_fasta='./2.1_host_prediction/20_blastn/good_bins_all.fa' $makeblastdb -in $bins_fasta -out $bins_fasta -dbtype nucl $blastn -db $bins_fasta -query $assembly -out ./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6
Exercise - analyzing blast results
- How many total hits did you get from blastn?
- How many of these hits would you trust?
- How many unique viral contigs were you able to get host predictions for from blastn?
- Was there cases where 1 viral contig hit multiple MAGs?
a) which interaction would you then trust?
Combining results for blast and RaFAH
Now we can write a python script to combine the results from RaFAH and blastn to compare the host predictions from different method. You might use pandas dataframes and pd.join or pd.merge to write this.
You will need the following 3 files:
- the blast output:
./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out
file - the taxonomic classifications for the MAGs:
./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv
- the rafah results with the converted taxonomy:
./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv
py script for merging the blastn and rafah results
import pandas as pd import sys # read in the files blast = sys.argv[1] taxa = sys.argv[2] rafah = sys.argv[3] out = sys.argv[4] df_rafah = pd.read_csv(rafah, header=0) print(df_rafah) df_bins_gtdb = pd.read_csv(taxa, sep='\t' ,header=0) print(df_bins_gtdb) df_blast = pd.read_csv(blast, sep='\t', header=None, names=['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']) print(df_blast.head(n=10)) # add MAG taxonomy to the blast results df_blast['bin'] = df_blast['subject'].str.split('.fa').str[0] print(df_blast.head(n=10)) df_blast_gtdb_merged = df_blast.merge(df_bins_gtdb[['user_genome','classification']], how='left', left_on='bin', right_on='user_genome') print(df_blast_gtdb_merged.head(n=10)) # merge blast results with rafah df_blast_gtdb_rafah_merged = df_blast_gtdb_merged.merge(df_rafah,how='left',left_on='query',right_on='Sequence',suffixes=(None,'_rafah')) df_blast_gtdb_rafah_merged # select relevant columns df_blast_gtdb_rafah_merged=df_blast_gtdb_rafah_merged[['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'bin', 'classification','Predicted_Host', 'Winner_Score']] # rename the columns to meaningful headers df_blast_gtdb_rafah_merged = df_blast_gtdb_rafah_merged.rename(columns = {'query':'viral_contig', 'subject':'host_bin_contig','bin':'host_bin','classification':'blast_predicted_host','Predicted_Host':'rafah_predicted_host', 'Winner_Score':'rafah_score'}) print(df_blast_gtdb_rafah_merged) # write out to a csv df_blast_gtdb_rafah_merged.to_csv(out,index=False)
sbatch script for blast tasks
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH --partition=short #SBATCH --mem=5G #SBATCH --time=01:00:00 #SBATCH --job-name=combine #SBATCH --output=./2.1_host_prediction/30_combined_results/combine.slurm.%j.out #SBATCH --error=./2.1_host_prediction/30_combined_results/combine.slurm.%j.err # Our python script for combining blastn and rafah needs our python environment again source ./py3env/bin/activate # file paths blast='./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out' gtdb='./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv' rafah='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv' output='./2.1_host_prediction/30_combined_results/combined_blast_rafah_host_predictions.csv' python3 ./python_scripts/host_prediction/30_combine_host_prediction.py $blast $gtdb $rafah $output
Exercise - comparing RaFAH and blast results
- For how many host predictions, do RaFAH and blastn agree and upto which taxonomic level?
- Are you able identify CRISPRs or integrated phage genomes?
a) Explain your decision
b) How might you confirm your CRISPR/prophage identification?- Pick one example where you would trust RaFAH predictions more and another where you would trust blast predictions more and explain why.
Key Points
RaFAH uses a random forest model to predict hosts to the genus level for phages
RaFAH returns a probability for each host genus it can predict.
Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction
Machine learning and classical host prediction methods have a trade-off between recall and sensitivity