Identifying viral contigs II

Overview

Teaching: 0 min
Exercises: 180 min
Objectives
  • Select medium-complete and low-contaminated contigs

  • Identify phage contigs in assembled viromes and metagenomes

  • Interpret Jaeger’s output

  • Select viral contigs for further analysis

  • Assess assembly completeness

  • Understand differences between viromes and metagenomes

In this section, we will identify viral sequences among our assembled contigs. This morning, you studied how Wu et al. benchmarked nine bioinformatic virus identification tools. As many of those tools are quite slow, we will use a different tool that was developed by the VEO-MGX Groups: Jaeger. Jaeger was developed after the Wu et al. benchmarking study, so we will test how well it performs ourselves by running it on the cross-assembled virome contig dataset, and also on contigs from total community metagenomes from the same Unterwarnow estuary samples. We will compare the percentage of detected contigs to the numbers reported in the Wu et al. paper.

Document your activities and answer the questions below in your lab book. Do not forget to cite all relevant sources in your work.

Identify viral contigs

We will start to identify viral contigs with Jaeger.

Challenge

  1. Viromes and metagenomes are obtained in the wet lab through different filtration steps. Smaller filters remove larger particles such as eukaryotes and prokaryotes. Where do you expect the most hits from Jaeger (virome or metagenome) and why?

  2. Why is it important to use tools like Jaeger on viromes?

Note that Jaeger should require a bit more than 1 GB for the sbatch job, so make sure to allocate at least 2 GB of memory. Jaeger can run on CPU nodes, but it’s speed is optimal when run on GPU nodes. In the parameter --partition of the sbatch script, you could add gpu,short to allocate the job to a GPU node.

Exercise

Run Jaeger for cross-assembly:

  • Read and interpret the output following Jaeger’s webpage
module load nvidia/cuda/12.1.0
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh
conda activate jaeger_dev
python3 /home/groups/VEO/tools/jaeger/v1.1.30a0/Jaeger/bin/jaeger run -i <file> -o <output_file>

Inspect the output files After running Jaeger, answer the questions below.

head <output>

sbatch script for jaeger

>#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=standard
#SBATCH --mem=5G
#SBATCH --time=2:00:00
#SBATCH --job-name=jaeger
#SBATCH --output=./1.3_virus_identification/10_jaeger/jaeger.slurm.out.%j
#SBATCH --error=./1.3_virus_identification/10_jaeger/jaeger.slurm.err.%j

#The most updated version of Jaeger is from 20240806

module load nvidia/cuda/12.1.0
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh 
conda activate jaeger_dev

# from here on, we will just use the cross-assembly for our analyses

assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
outdir='./1.3_virus_identification/10_jaeger'
jaeger='python3 /home/groups/VEO/tools/jaeger/v1.1.30a0/Jaeger/bin/jaeger run'

mkdir -p $outdir

$jaeger -i $assembly -o $outdir/results_jaeger

Questions

  1. Do the results corroborate your expectations?
  2. Why are not all contigs in the virome identified as viral contigs?
  3. Take the longest contig from the dataset and BLAST it using blastn. What are the top hits? Are they expected?

Optional: How many viral contigs are there in the virome?

bash command for getting the number of phage contigs

cut -f 3 <file> | sort | uniq -c

To get a command to get the sequence of a contigID, you coul go to “Extract contig fasta” of the tutorial of Gene Calling and Functional Annotation II.

Estimating genome completeness

There are tools to assess the completeness of bacterial and viral genome sequences. For viruses we use CheckV. The tool identifies genes on the query sequence and compares them to a database of viral and bacterial marker genes. Each taxonomic group of bacteria or phages, e.g. a species or a family, has certain marker genes on its genome. So based on the number and types of marker genes, CheckV can figure out to which taxon a query contig belongs and estimate how much of the genome it represents (estimated completeness). (We will improve the taxonomic annotation in the “Viral Taxonomy and Phylogeny” section next week.) CheckV also checks for unexpected marker genes on the sequence (estimated contamination), and whether part of a phage contig likely represents bacterial sequence, in which case the fragment could be part of a host genome with an integrated prophage.

# create a folder for the assessment (or let sbatch create it when you assign the output and error log files)
$ mkdir 30_results_assessment_checkv

# activate the conda environment containing the checkv installation
$ source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate checkv_v1.0.1

# run checkV on both assemblies
$ checkv end_to_end ...

sbatch script for running checkV

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=22
#SBATCH --partition=standard
#SBATCH --mem=20G
#SBATCH --time=02:30:00
#SBATCH --job-name=assessment_checkv
#SBATCH --output=./1.3_virus_identification/20_results_assessment_checkv/assessment_checkv.slurm.%j.out
#SBATCH --error=./1.3_virus_identification/20_results_assessment_checkv/assessment_checkv.slurm.%j.err

# run CheckV to assess the completeness of single-contig virus genomes.
# First, activate the conda environment which holds the CheckV installation on draco:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate checkv_v1.0.1

# CheckV parameters (https://bitbucket.org/berkeleylab/checkv/src/master/#markdown-header-running-checkv)
# checkv end-to-end runs the CheckV pipeline from end to end :). It expects an input fasta file 
# with the assembly and an output path.
# -t: threads

# assigning variables for readability

database='/work/groups/VEO/databases/checkv/v1.5'
outdir='./1.3_virus_identification/20_results_assessment_checkv'
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
single_assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'

checkv end_to_end -t 20 -d $database $cross_assembly $outdir/cross_assembly
checkv end_to_end -t 20 -d $database $single_assembly $outdir/single_assemblies

Go through the CheckV results

CheckV produces many output files, and also saves files for the intermediate steps of the tools that are used to find viral marker genes (diamond and hmmsearch). A summary of all results can be found in the file quality_summary.tsv. Open the file and familiarize yourself with the information presented in the table. On CheckV’s website, you can find information about the output in the sections “How it works” and “Output files”.

  1. How are completeness and length related? (qualitative answer, name examples)
  2. What are reasons for contigs with low completeness to appear in the assembly?
  3. How do completeness and contig coverage relate? (qualitative answer, name examples)

Filter contigs

Exercise - select high-quality phage contigs

Use Jaeger’s predictions (‘phage’ and ‘non-phage’) and CheckV’s estimates of completeness and contamination to separate phage from non-phage contigs and reduce our assembly to high-quality contigs of high completeness. If you are an experienced programmer, write (a) script(s) for that. If you are not, use the solutions below. For the solutions below, we used file assembly_default_jaeger.tsv, however, you could also use assembly_default_phages_jaeger.tsv. The results of CheckV are located in the file quality_summary.tsv located in CheckV’s output folder. As a rule of thumb, you could keep all contigs with completeness >50% and contamination <5%. These values could be changed depending on the data and on the project. Note that filtering for low completeness can remove some conserved regions.

To read the tabular files, you can use python’s pandas package. If you didn’t do this yesterday, install it in your virtual environment now.

Then, you can load a .tsv file and select content from it like this:

import pandas as pd

# the .tsv format separates cells by a tab ('\t')
jaeger_df = pd.read_csv(jaeger_results_path, sep='\t')

# create a python set of contig names that stick to the if-statement in the loop
jaeger_selection = {row['contig_id'] for index, row in jaeger_df.iterrows() if row['prediction'] == 'phage'}

# do the same for the checkv results and use set operations to get the contigs selected by both tools
joint_selection = jaeger_selection.intersection(checkv_selection)

# modify the code from yesterday (rename and filter contigs) to go through the assembly and save contigs in the joint selection
...

python script for selecting viral contigs

import os, sys
import pandas as pd
from Bio import SeqIO

def main():

    assembly_path = os.path.abspath(sys.argv[1])
    assert assembly_path.endswith(".fasta")

    jaeger_results_path = os.path.abspath(sys.argv[2])
    assert jaeger_results_path.endswith(".tsv")
    
    checkv_results_path = os.path.abspath(sys.argv[3])
    assert checkv_results_path.endswith(".tsv")
    
    out_fasta = os.path.abspath(sys.argv[4])
    assert out_fasta.endswith(".fasta")

    # read the tsv files as pandas dataframes
    jaeger_df = pd.read_csv(jaeger_results_path, sep='\t')
    checkv_df = pd.read_csv(checkv_results_path, sep='\t')
	
    # collect the sets of contigs which stick to our selection cutoffs
    jaeger_selection = {row['contig_id'] for index, row in jaeger_df.iterrows() if row['prediction'] == 'phage'}
    checkv_selection = {row['contig_id'] for index, row in checkv_df.iterrows() if row['completeness'] > 50 and row['contamination'] < 5}
	
    # use set operation union to get the contigs in the jaeger set AND in the checkv set
    joint_selection = jaeger_selection.intersection(checkv_selection)
	
    # print some numbers
    print(f"Number of input contigs: {len(jaeger_df.index)}, selected by jaeger: {len(jaeger_selection)}, selected by checkv: {len(checkv_selection)}, joint selection: {len(joint_selection)}")
	
    # define list of records to keep and fill it by comparing the contig id of each record to the joint set of selected contigs
    out_records = []
    with open(assembly_path) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            if record.id in joint_selection: out_records.append(record)
    
    # write the selected records into a new file
    with open(out_fasta, "w") as fout:
        SeqIO.write(out_records, fout, "fasta")

if __name__ == "__main__":
    main()

sbatch script for submitting the python script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=filter_contigs
#SBATCH --output=./1.3_virus_identification/40_results_filter_contigs/filter_contigs.slurm.%j.out
#SBATCH --error=./1.3_virus_identification/40_results_filter_contigs/filter_contigs.slurm.%j.err

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# in this sbatch script, its not necessarry to create the directory,
# we already told sbatch to create it for the log files.
# mkdir -p 40_results_filter_contigs

# In this solution, our script takes the assembly and the files
# 'complete_contigs_default_jaeger.tsv' from jaeger and 
# 'quality_summary.tsv' from CheckV as an input. Set them as variables
# for readability
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
jaegerresults='./1.3_virus_identification/10_jaeger/results_jaeger/assembly/assembly_default_jaeger.tsv'
checkvresults='./1.3_virus_identification/20_results_assessment_checkv/cross_assembly/quality_summary.tsv'
outdir='./1.3_virus_identification/40_results_filter_contigs'

# run our script for filtering contigs based on the output of jaeger
# and CheckV as well as the output path. 
python ./python_scripts/identify/40_filter_contigs.py $cross_assembly $jaegerresults $checkvresults $outdir/assembly.fasta

# deactivate the environment
deactivate

Compare the results

CheckV and jaeger follow different approaches and we expect their outputs not to match perfectly. Estimate how different their predictions of the contigs viralness are.

  1. At the selected cutoffs, how many contigs get chosen by jaeger, how many by CheckV?
  2. How many Jaeger hits were annotated by CheckV as having high-quality?

To find how many Jaeger hits were annotated by CheckV as high quality:

cut -f1 assembly_default_phages_jaeger.tsv > phage_contigs_jaegar.list

grep -w -Ff phage_contigs_jaegar.list  20_results_assessment_checkv/cross_assembly/quality_summary.tsv | grep "High-" | wc -l

Key Points

  • Filtering contigs by completeness and contamination is crucial to obtain an informative dataset

  • Tools like Jaeger classify your contigs, enabling you to understand your samples

  • No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results