Host Prediction II

Overview

Teaching: 0 min
Exercises: 240 min
Objectives
  • Run RaFAH to predict a host genus for each contig.

  • Run BLASTn to find homologies between metagenome and metavirome.

Host prediction

Today, we will use the tool RaFAH to predict a genus of a putative host for our contigs. The tool predicts proteins in each viral sequence and then assigns them to a set of orthologous groups. The annotated proteins are then used to predict a host genus with a pretrained random forest model. This model matches the set of proteins with host proteins it was trained on.

After running RaFAH, we will compare the resulting prediction with the bacterial metagenome you briefly analyzed in one of last weeks homeworks. The bacterial metagenome was taxonomically annotated using GTDB. We will get into taxonomic classification of viruses tomorrow, for today it is important to know that there are different methods of taxonomic classification. RaFAH uses the NCBI taxonomy, so we will have to translate the output of RaFAH into the GTDB taxonomy to be able to compare the predictions with our bacterial metagenome.

Next, we will use BLAST as an alternative way of linking viruses to their hosts. With BLAST, we will align our viral contigs to putative bacterial and archael genomes (Metagenome-Assembled-Genomes, MAGS) to find sequence matches between the viruses and potential hosts.

Finally, we will combine the results from RaFAH and BLAST to compare the two methods and see how well they agree.

RaFAH

Exercise - Use RaFAH to predict hosts for our contigs

RaFAH requires a single file for each contig to run. You first have to write a python script which separates the combined assembly into single files. You can use the package biopython installed in your virtual environment for this:

import os, sys
from Bio import SeqIO

# define file paths from the arguments
...

# loop through the records in the combined assembly
with open(assembly_path) as handle:
    for record in SeqIO.parse(handle, "fasta"):
        # set a filename per record
        out_fasta = os.path.join(out_dir, f"{record.id}.fasta")
        # write the record to the file
        with open(out_fasta, "w") as fout:
            SeqIO.write([record], fout, "fasta")

Next, you will write a python script, which translates the host genus predictions from the NCBI taxonomy into the GTDB taxonomy. You can copy the translation table from this location on draco:

$ cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.1_host_prediction/ncbi_to_gtdb.csv ./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv

There are hundreds of thousands of taxa in both taxonomy schemes, so we need to efficiently map one onto the other. Here are some hints:

import os, sys
import pandas as pd

# Extract the genus part of the NCBI taxonomy.
# This should correspond to RaFAH's output
def get_ncbi_genus(s):
    for ss in s.split(";"):
        if ss.startswith("g__"): return ss[3:]
    return "none"

# Extract the complete taxonomic classification up to genus from GTDB
def get_gtdb_upto_genus(s):
    return s.split(";s__")[0]

# set or get filenames from sys.argv and read tables with pandas
...

# create a dictionary for efficient mapping between the two taxonomies
translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()}

# use the dictionary to translate the predictions in the RaFAH output table
...

You can run the scripts in the same sbatch script as RaFAH. Remember to source and deactivate our python virtual environment accordingly. Here you can find a description of the parameters you can pass to RaFAH (the page is a bit hard to read). The tool is programmed in perl and you can find it here on draco:

# activate the conda environment with the dependencies RaFAH requires
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1

# set a variable to call the RaFAH script
rafah='/home/groups/VEO/tools/rafah/RaFAH.pl'

# create an output folder or use the one you set for the slurm logs:
mkdir 10_results_hostprediction_rafah

# run RaFAH with the appropriate file paths and arguments
perl "$rafah" --predict --genomes_dir 10_results_hostprediction_rafah/split_contigs --extension .fasta

RaFAH is computationally expensive and can use multiple threads. We recommend you set the following sbatch parameters:

  • #SBATCH –cpus-per-task=32
  • #SBATCH –partition=standard,short
  • #SBATCH –mem=50G

RaFAH uses the random forest model to compute a probability for all host genuses included in its training. You can find these probabilities in the output file *Host_Predictions.tsv. The file *Seq_Info_Prediction.tsv contains per contig the genus with the highest probability.

  1. How many contigs have a probability score larger than your chosen cutoff?
  2. Does the probability score relate to the completeness of the contigs? (qualitative answer)

python script for splitting the assembly into separate files

import os, sys
from Bio import SeqIO

def main():

    # define file paths from the arguments
    assembly_path = os.path.abspath(sys.argv[1])
    assert assembly_path.endswith(".fasta")

    # set an output directory and create it if it does not exist
    out_dir = os.path.abspath(sys.argv[2])
    if not os.path.exists(out_dir): os.makedirs(out_dir)

    # loop through the records in the combined assembly
    with open(assembly_path) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            # set a filename per record
            out_fasta = os.path.join(out_dir, f"{record.id}.fasta")
            # write the record to the file
            with open(out_fasta, "w") as fout:
                SeqIO.write([record], fout, "fasta")

if __name__ == "__main__":
    main()

python script for translating NCBI genus IDs to GTDB

import os, sys
import pandas as pd

# Extract the genus part of the NCBI taxonomy.
# This should correspond to RaFAH's output
def get_ncbi_genus(s):
    for ss in s.split(";"):
        if ss.startswith("g__"): return ss[3:]
    return "none"

# Extract the complete taxonomic classification up to genus from GTDB
def get_gtdb_upto_genus(s):
    return s.split(";s__")[0]

def main():

    # set the filename of the rafah output file containing the best prediction per contig
    rafah_filename = os.path.abspath(sys.argv[1])
    assert rafah_filename.endswith(".tsv")

    # set the filename of the translation table with NCBI and GTDB taxonomic IDs
    translation_filename = os.path.abspath(sys.argv[2])
    assert translation_filename.endswith(".csv")
    
    # set the output filename
    out_filename = os.path.abspath(sys.argv[3])
    assert out_filename.endswith(".csv")

    # read the input tables using pandas
    translate_df = pd.read_csv(translation_filename)
    rafah_df = pd.read_csv(rafah_filename, sep='\t')

    # we are only interested in the first 3 columns
    rafah_df = rafah_df.iloc[:, :3]

    # create a dictionary for efficient matching of NCBI to GTDB
    translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()}

    # translate all NCBI genus predictions to GTDB.
    # We don't manipulate the table while iterating over it because this could lead to problems.
    translated_hosts = []
    for i, row in rafah_df.iterrows():
        translated_hosts.append(translation_dict[row["Predicted_Host"]] if row["Predicted_Host"] in translation_dict else "no_translation") 

    # add the translated predictions to the dataframe and save the table.
    rafah_df["Predicted_Host"] = translated_hosts
    rafah_df.to_csv(out_filename,index=False)

if __name__ == "__main__":
    main()

sbatch script for host prediction with RaFAH

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=50G
#SBATCH --time=02:00:00
#SBATCH --job-name=rafah
#SBATCH --output=./2.1_host_prediction/10_rafah/rafah.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/10_rafah/rafah.slurm.%j.err

assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
contigs='./2.1_host_prediction/10_rafah/split_contigs'

# RaFAH expects each genome in a separate file. Activate our virtual environment
# and run a python script to split our filtered assembly into single files
source ./py3env/bin/activate

# The script requires the assmbly path and a directory for outputting the contigs
python ./python_scripts/host_prediction/10_split_assembly_file.py $assembly $contigs

# deactivate the python environment, just to be sure not to cause problems with the conda environment RaFAH needs
deactivate

# activate the conda environment with the dependencies RaFAH requires
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1

# set a variable to call the RaFAH script
rafah='/home/groups/VEO/tools/rafah/RaFAH.pl'

# rafah parameters (https://gensoft.pasteur.fr/docs/RaFAH/0.3/)
# --predict: run the pipeline for predicting hosts
# --genomes_dir: the directory with the separate files for each contig
# --extension: the extension of the contig files
# --file_prefix: can specify an output dir here and "run name" (rafah_1) here

perl "$rafah" --predict --genomes_dir $contigs --extension .fasta --file_prefix ./2.1_host_prediction/10_rafah/rafah_1

# deactivate RaFAH's conda environment
conda deactivate

# Our python script for translating the Taxonomy needs our python environment again
source ./py3env/bin/activate

# set the path to the table with the translation between NCBI and GTDB and to RaFAH output
translation='./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv'
rafahtable='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions.tsv'

# run our script with the appropriate file names as parameters
python3 ./python_scripts/host_prediction/20_translate_taxonomy.py $rafahtable $translation ./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv

deactivate

Host Prediction using Blastn

While RaFAH uses a database of reference bacterial/archaeal genomes to find hosts, these bacteria/archaea may or may not be in your samples. One of the simplest ways to look for potential hosts is to do a quick blastn. Viral genomes often contain fragments of host DNA, for example an auxiliary metabolic gene. Conversely, viral DNA can also be found in host genomes; e.g. prophages, CRISPR spacers, fragments of viral genes, tRNAs etc.

For this task, we assembled metagnome-assembled-genomes (MAGs or bins) of the prokaryotic community from the same three samples given to you. This resulted in 110 MAGs. However, not all of these are complete and many are contaminated. After filtering for a completeness >= 50% and a contamination <= 6%, 19 “good MAGs” remained.

These “good MAGs” were concatenated and are located /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa. These MAGs will be the “database” against which we align our viral contigs (query sequences)

You also require the GTDB taxonomic classifications for the bins. These are located /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv

Exercise - copy the MAGs and the gtdb bin classification over

Please copy this fasta file into your own directory

# make a new directory for this analysis
mkdir ./2.1_host_prediction/20_blastn/

# copy over the good MAGs
cp /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa ./2.1_host_prediction/20_blastn/

# copy over the taxonomic classification for the MAGs
cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv ./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv

Next, we will create a custom blast database using our MAGs fasta file and use blastn (nucleotide-nucleotide) to do the alignment

Exercise - makeblastdb and blastn

Create an sbatch script for the makeblastdb and blastn

  • makeblastdb is located at /home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb
  • blastn is located at /home/groups/VEO/tools/blast/v2.14.0/bin/blastn
  • A conda environment is NOT required for this tool!

The basic usage for the makeblastdb and blastn are the following

NOTE: the database you create in makeblastdb has to be fed into blastn

# make a custom database
makeblastdb -in mags_file.fasta -out  mags_file.fasta -dbtype nucl

# run blastn with the file.fasta database
blastn -db mags_file.fasta -query filtered_cross_assembly.fasta -out viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6

Resources for blastn can vary! Large databases require a lot of computational power and will take a long time but luckily our task is very small. We recommend the following resources
#SBATCH –cpus-per-task=32
#SBATCH –partition=short,standard
#SBATCH –mem=5G
#SBATCH –time=01:00:00

-outfmt 6 : is an output format that is tabular. Your output contains the following 12 columns in THIS order

query, subject, pid, aln_length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore

query : query id - this is your viral contig
subject : subject id - this is the bin and contig it matches
pid : percent id of the match
aln_length : alignment length
mismatch : number of mismatches
gapopen : number of gaps
qstart, qend : start and end positions of query sequence
sstart, send : start and end positions of subject sequence
evalue : expect value (the lower, the better)
bitscore : a score based on gaps, mismatches, % id and alignment length

More resources for blast output formats here and blast manual

sbatch script for blast tasks

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=5G
#SBATCH --time=01:00:00
#SBATCH --job-name=blastn
#SBATCH --output=./2.1_host_prediction/20_blastn/blastn.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/20_blastn/blastn.slurm.%j.err

# the blast tools
makeblastdb='/home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb'
blastn='/home/groups/VEO/tools/blast/v2.14.0/bin/blastn'

# the query (filtered cross assembly) and subject (MAGs fasta)
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
bins_fasta='./2.1_host_prediction/20_blastn/good_bins_all.fa'

$makeblastdb -in $bins_fasta -out $bins_fasta -dbtype nucl

$blastn -db $bins_fasta -query $assembly -out ./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6 

Exercise - analyzing blast results

  1. How many total hits did you get from blastn?
  2. How many of these hits would you trust?
  3. How many unique viral contigs were you able to get host predictions for from blastn?
  4. Was there cases where 1 viral contig hit multiple MAGs?
    a) which interaction would you then trust?

Combining results for blast and RaFAH

Now we can write a python script to combine the results from RaFAH and blastn to compare the host predictions from different method. You might use pandas dataframes and pd.join or pd.merge to write this.

You will need the following 3 files:

py script for merging the blastn and rafah results

import pandas as pd
import sys

# read in the files
blast = sys.argv[1]
taxa = sys.argv[2]
rafah = sys.argv[3]
out = sys.argv[4]

df_rafah = pd.read_csv(rafah, header=0)
print(df_rafah)

df_bins_gtdb = pd.read_csv(taxa, sep='\t' ,header=0)
print(df_bins_gtdb)

df_blast = pd.read_csv(blast, sep='\t', header=None,
                      names=['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore'])
print(df_blast.head(n=10))


# add MAG taxonomy to the blast results
df_blast['bin'] = df_blast['subject'].str.split('.fa').str[0]
print(df_blast.head(n=10))

df_blast_gtdb_merged = df_blast.merge(df_bins_gtdb[['user_genome','classification']], how='left', left_on='bin', right_on='user_genome')
print(df_blast_gtdb_merged.head(n=10))

# merge blast results with rafah
df_blast_gtdb_rafah_merged = df_blast_gtdb_merged.merge(df_rafah,how='left',left_on='query',right_on='Sequence',suffixes=(None,'_rafah'))
df_blast_gtdb_rafah_merged

# select relevant columns
df_blast_gtdb_rafah_merged=df_blast_gtdb_rafah_merged[['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'bin',	'classification','Predicted_Host', 'Winner_Score']]

# rename the columns to meaningful headers
df_blast_gtdb_rafah_merged = df_blast_gtdb_rafah_merged.rename(columns = {'query':'viral_contig', 'subject':'host_bin_contig','bin':'host_bin','classification':'blast_predicted_host','Predicted_Host':'rafah_predicted_host', 'Winner_Score':'rafah_score'})

print(df_blast_gtdb_rafah_merged)

# write out to a csv
df_blast_gtdb_rafah_merged.to_csv(out,index=False)

sbatch script for blast tasks

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short
#SBATCH --mem=5G
#SBATCH --time=01:00:00
#SBATCH --job-name=combine
#SBATCH --output=./2.1_host_prediction/30_combined_results/combine.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/30_combined_results/combine.slurm.%j.err

# Our python script for combining blastn and rafah needs our python environment again
source ./py3env/bin/activate

# file paths
blast='./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out'
gtdb='./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv'
rafah='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv'
output='./2.1_host_prediction/30_combined_results/combined_blast_rafah_host_predictions.csv'

python3 ./python_scripts/host_prediction/30_combine_host_prediction.py $blast $gtdb $rafah $output

Exercise - comparing RaFAH and blast results

  1. For how many host predictions, do RaFAH and blastn agree and upto which taxonomic level?
  2. Are you able identify CRISPRs or integrated phage genomes?
    a) Explain your decision
    b) How might you confirm your CRISPR/prophage identification?
  3. Pick one example where you would trust RaFAH predictions more and another where you would trust blast predictions more and explain why.

Key Points

  • RaFAH uses a random forest model to predict hosts to the genus level for phages

  • RaFAH returns a probability for each host genus it can predict.

  • Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction

  • Machine learning and classical host prediction methods have a trade-off between recall and sensitivity