Course introduction

Overview

Teaching: 60 min
Exercises: 0 min

Objectives

Understand course scope and evaluation criteria

Start and plan an effective lab book

login into Draco

Viromics

Metagenomics of viruses, often referred to as Viromics, is the study of viral communities in a sample or environment through the direct sequencing and analysis of their genetic material. This avoids the need to isolate or culture the viruses, preventing the bias associated with isolation, and allowing the complete viral community to be analyzed. This allows for the identification and characterization of both known and novel viruses within a given ecosystem. In this module, you will learn how to analyse viromics data with computational approaches.

Data

We will be using viromics sequencing data from Dr. Xiu Jia, a Postdoc in the VEO Group. This is data was generated by sequencing samples from the Unterwarnow, an estuary near Rostock. The samples were filtered through a 0.22um filter (commonly used to filter viral particles) and then sequenced on a Oxford Nanopore Technologies Flow Cell (long read sequencing).

Theory

The theoretical parts of the course are covered in the mornings. This includes reading relevant papers, watching video lectures, and discussion of the concepts and tools. Please write down any questions and discussion points about the theory in your Lab Notebook (see below).

Hands-on

The practical parts of the course are covered in the afternoons. You will use different bioinformatics tools, compare and interpret the results, and make conclusions about the sampled viruses. We will be available to help and guide you during this time. Please document how you performed the analyses, and write down any questions and discussion points in your Lab Notebook (see below). Solutions for each step are provided. If necessary, you can use them to move to the next steps.

Homework

For some of the tutorials, we will assign some homework (in lieu of the missed day on Friday 20 September). This is usually a visualization exercise. It will be up to you to complete this in your own time and to include it in your Lab Notebook and presentations.

Presentation

You will give a final presentation on Friday 27 September. This is where you can show off your understanding of the material and share what interesting things you have identified in the data.

Final evaluation

Your final grade is based on an evaluation of three factors.

Your professional performance (active participation, asking questions, helping others): 30%
Your Lab Notebook (approach to address hands-on questions/exercises, tidiness/reproducibility, notes/questions/discussion points): 30%
The presentation of your final project: 40%

Lab Notebook

Documenting your work is crucial in Computational Biology/Bioinformatics. This way you can make sure your work is reproducible, document your commands to later find back what you did, prepare a first draft of text for other documents (e.g. a Methods section in a report or paper), and collect information which you can send to colleagues. Make sure that it is tidy, commented, and clearly written, so that others can easily understand it, including your future self.

You are required to write a Lab Notebook in markdown for this module, which will count as part of your evaluation. We recommend that you start a GitHub repository for the course and write your lab book there.

More details on your lab notebook

Access to Draco

Draco is a high-performance cluster created and maintained by the Universitätsrechenzentrum. It is available for members of Thuringian Universities. To log in, you can use ssh (replace <fsuid> by your FSU login):

ssh <fsuid>@login1.draco.uni-jena.de

Terminal or ssh client

More details on access to Draco

Submitting jobs on Draco

When you login to Draco, you are on the “login node” and this is not the best place to run any programs or heavier scripts. You MUST be on a “compute node” to run scripts and programs

You can either request for resources from a compute node and run programs interactively or submit a job to the job scheduler, which then sends it to a compute node to complete.

See here for what slurm architecture looks like

You can either request resources from a node for an “interactive” shell or you can submit via sbatch

To get resources - see here

To submit a job, you have to make a script my_slurm_script.sh and then submit it as sbatch my_slurm_script.sh . Detailed information on creating these scripts, including descriptions can be found in the “extras” here

Example script here

Key Points

In this course, you will analyze viral sequence data

You need to keep a lab notebook

You need to be able to access the HPC Draco

Introduction to viromics

Overview

Teaching: 60 min
Exercises: 60 min

Objectives

Summarize metagenomics and viromics

Understand what are bacteriophages and how they fit into microbial communities

Compare and contrast microbial and viral diversity

Video on viral metagenomics

Below is a lecture video introducing the concepts of metagenomics, microbial dark matter, and viromics. The case study covered in the video is about crAssphage, a type of bacteriophage that was first found using bioinformatics, by reanalyzing viromics data from the human gut by using the cross-assembly approach. The prevalence and implications of crAssphage are also discussed.

Exercise

Watch the lecture below and write down at least 3 questions and/or discussion points about it.

Click on the image to see lecture video “Viral metagenomics: predicting phage-microbe interactions in the gut” by Prof Bas E. Dutilh (34 minutes):

Additional reading: Virus Bioinformatics (Pappas et al. 2021)

Exercise - mindmap

Create a mindmap summarizing the following points from the video. A mindmap is a brainstorming graphic. You can get creative using PowerPoint or draw it on paper, take a picture, and upload it into your Lab Notebook.

What is metagenomics?

What is viromics?

What are bacteriophages and how prevalent and abundant are they?

What are the roles of phages in microbial communities?

more general, how do they influence the ecology and the environment

Explain at least three ways that phages can have an impact on bacteria.

More specifically, how do they influence their host (lytic/ temperate, HGT etc)

Name at least three differences between the evolution of viral and cellular organisms?

Key Points

Viromics is the study of viruses, and in our case bacteriophages, using next-generation sequencing technologies

Bacteriophages are diverse and ubiquitous across all biomes

Bacteriophages have large implications on their environment including the human gut

Getting to know your files

Overview

Teaching: 120 min
Exercises: 0 min

Objectives

Interpret different file formats used in bioinformatics

Utilize bash commands to look into sequence data

Get a basic understanding of what sequence data looks like

Work with sequence data in Python or R to create basic plots

Navigating your filesystem space

First, let’s copy the sequence data to your own home folders.

Exercise - get your sequence data

The data is in the folder /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome

# make a ./data/sequences directory if it doesn't already exist IN YOUR OWN HOME DIRECTORY

mkdir -p viromics/data/sequences

# go into your sequences directory
cd viromics/data/sequences

# copy over the 3 barcode files you need
cp /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome/full_barcode*.fastq.gz ./

Exercise - check out your files

Take a look at the file structure and files. We have given you a subset of viromics data ./data/sequences. Some helpful commands might be: Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
# list
ls
ll

# path to working directory
pwd

# print first 10 lines
zcat file.fastq.gz | head -10
# print last 10 lines
zcat file.fastq.gz | tail -10

# open file on terminal (press 'q' to exit less)
less
more

# count number of lines
wc -l file.txt

# search for "@" inside file 
grep "@" file.fastq

# or gzipped files
zgrep "@" file.fastq.gz | head -10

# count number of sequence headers in a fasta file
grep -c ">" file.fasta
zgrep -c ">" file.fasta.gz

See more useful commands and one-liners here

For creating new directories - please ensure to name your directories with a number and a meaningful header. An example is here

Understanding bioinformatics file formats

It is always a good idea to check the contents of your data files and make regular “sanity checks” to see if you understand everything. The first video covers different file formats commonly used in bioinformatics. Namely, FASTA, FASTQ, BAM, SAM, VCF, BCF, GFF, GTF and BED files (9 minutes).

The second video covers different types of sequence files, including fasta and fastq, phred scores, files containing metadata (tsv files) and compressed files (11 minutes).

Exercise

What information is contained in sequencing files?
a) Choose 3 file formats and describe them

Choose the barcode62.fastq.gz file in your data/sequences/ folder
a) How many lines does this file have?
b) How many sequences are present in the file?

Print the first 5 lines and the last 5 lines in the terminal.
Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
You don’t need to print this output into your lab-book
zcat barcode62.fastq.gz | cut -b -100 | head -10 - to print the first 100 characters of the first 10 lines
a) What differences do you observe between these lines?

Key Points

Sequence data and information about genes and genomes can come in many different formats

The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files

Sequencing Quality Control

Overview

Teaching: 180 min
Exercises: 0 min

Objectives

Understand the value of sequencing quality control

Discuss how low quality reads affect genomic analysis

Perform quality check on long-read data by creating plots

Interpret quality check results

Perform quality control on reads

Every step of a metagenomics/viromics analysis pipeline needs quality control and sanity checks.

What data am I working with?
What is the quality of the data?
Are my results what I expected them to be? Why?/Why not?

Here, we will assess the quality of DNA sequencing.

Sequencing quality control can include:

Removing barcodes (i.e. oligonucleotides that were artificially added to identify sequences from a particular sample)
Removing sequencing adapters (i.e. oligonucleotides that were artificially added to facilitate the sequencing process)
Removing low-quality nucleotides (sequencing quality often drops towards the end of the reads)
Filtering out reads based on quality scores (some reads are derived from faulty clusters or pores, and are best ignored)

Note that the Nanopore sequencing basecallers (Guppy/Dorado) already do quality control for us, but we will perform an extra check anyway.

Exercise

Discuss with your classmates and TAs:

Why do we need quality control?

What is the impact of including low-quality reads in downstream analyses?

What are some metrics for assessing sequencing quality?

Assessing Sequencing Quality

We will use NanoPlot to assess the quality of our reads. NanoPlot is designed for long reads and uses information in the sequence files and also the metadata files to produce plots to evaluate the quality of our reads.

Exercise

Run NanoPlot on your reads

Assess the sequencing quality for each sample using default parameters

Use fastq files and summary.txt located ./data/sequences/ to generate diff types of quality plots (see the NanoPlot github)

Play around with the parameters such as colours and types of plots to generate
# to run nanoplot - you will need to source and activate the following conda environment:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate nanoplot_v1.41.3

NanoPlot -t ? --plots ? --color ? --fastq barcode1.fastq.gz -o output_dir/barcode1

Resources

Information about Nanopore sequencing quality and Phred scores

An example sbatch script

How to write a for-loop to loop through your files

sbatch script for NanoPlot

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=1G
#SBATCH --time=2:00:00
#SBATCH --job-name=sequencing_QC
#SBATCH --output=1.1_QC/10_nanoplot/nanoplot.slurm.out.%j
#SBATCH --error=1.1_QC/10_nanoplot/nanoplot.slurm.err.%j

# First, we check the quality of the reads using nanoplot
# activate the conda environment containing nanoplot

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate nanoplot_v1.41.3

datadir="./data/sequences"
outdir="./1.1_QC/10_nanoplot"

# Run nanoplot on 3 fastq files in a for loop
# -t : threads (cpus)
# --plots : types of plots to produce, see gallery: https://gigabaseorgigabyte.wordpress.com/2017/06/01/example-gallery-of-nanoplot/ 
# --N50 : draw the N50 vertical line on the plots
# --color : color for the plots
# -o : output directory
# to see other options for colors and prefilters, run : NanoPlot -h

for fn in $datadir/full_barcode6*.fastq.gz
do
base_f=$(basename $fn .fastq.gz)        # select just the first part of the name
NanoPlot -t 10 --plots dot --N50 --color slateblue --fastq $fn -o $outdir/${base_f}/
done

This is an example of a plot you might get from NanoPlot. It is important to look at such a plot to estimate how much data will be lost when you filter by read quality or length.

The top plot shows frequency of read lengths
The large main plot shows how the read quality changes by read length
The right plot shows the frequency of read qualities

Screenshot 2024-07-22 at 15 41 36

Exercise

How does the sequencing quality of the 3 samples compare to each other? Specify your answer with metrics from the NanoPlot results.

Do we need to remove any reads? Why (not)?

What does Q12 mean? How much data will be lost from each sample if we filter at Q12?

Filtering Reads

Many phages naturally have low-complexity regions in their genomes (e.g. ACACACACAC). Nanopore sequencing errors are biased towards low-complexity regions, either by falsely generating them or by exacerbating them. This can create artifacts in the assembled viral genome sequences. High-quality reads can significantly improve assemblies, particularly for error-prone long reads.

Exercise - filter your reads

We will use Chopper to filter our reads based on quality scores and length. Build an sbatch script for chopper based on the following basic usage example. Use a for-loop that processes all your reads files. Use the following filtering criteria:

minimum quality 14

minimum length 1000 bp

maximum length 40000 bp
# activate the conda environment
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate chopper_v0.5.0
# base usage for filtering with minimum quality 10
gunzip -c reads.fastq.gz | chopper -q 10| gzip > filtered_reads.fastq.gz

# check the help for the other options
chopper -h

Chopper sbatch example

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard
#SBATCH --mem=1G
#SBATCH --time=2:00:00
#SBATCH --job-name=sequencing_QC
#SBATCH --output=1.1_QC/20_chopper/chopper.slurm.out.%j
#SBATCH --error=1.1_QC/20_chopper/chopper.slurm.err.%j

# activate the conda environment containing chopper

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate chopper_v0.5.0

datadir="./data/sequences"
outdir="1.1_QC/20_chopper"

# gunzip : unzip the reads
# -q : minimum read quality to keep
# -l : minimum read length to keep
# gzip: rezip the filtered reads

for fn in ./data/sequences/*.fastq.gz
do

base_f=$(basename $fn .fastq.gz)        # select just the first part of the name
gunzip -c $fn | chopper -q 14 -l 1000 --maxlength 40000 | gzip > $outdir/${base_f}_filtered_reads.fastq.gz

done

Exercise

How many reads do you have left after filtering? What percentage of the original reads were filtered?

What is the average quality? Is this before or after filtering?

Optional: Run Nanoplot again to show the difference in the read statistics

GC Content

GC Content ((sum of all G’s + C’s) / (sum of all T’s + A’s)) is another interesting metric to evaluate your sequencing data. There’s no “wrong” GC content, rather this measure tells us something about the nucleic acid diversity our metagenomes/viromes. GC content is generally consistent along the length of a genome, as the GC content of sequencing reads from a single genome follows a bell curve. Regions with outlier values often represent non-self regions, for example a horizontally transferred gene, an inserted prophage, or a mobile genetic element. GC content is generally consistent for genomes of similar strains.

To compute and visualize the GC content of the reads, we will use python. We will need python throughout the course and make use of a virtual environment. Please setup a virtual environment according to this tutorial. The tutorial also shows you how to install any packages you might need for this analysis.

Exercise - GC content plots

Copy over the T4 phage genome cp /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/T4_SRR29341083.fastq.gz ./data/sequences

What is the GC content profile of your sequenced reads?
a) Plot the distribution of GC % in each read as a density plot (histogram).

Use the unfiltered reads

install these useful Python packages:

biopython

numpy

pandas

matplotlib
Interpret the distributions:

Are these histograms what you expected?

How would the GC content of a metagenome differ from these viromes?

Describe the difference in GC content between our viromes and E. coli phage T4 /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/T4_SRR29341083.fastq.gz).

What interpretations can you make about the samples based on these plots?

To make these GC_content plots, you will probably need the following python libraries and their functions:

Bio.SeqUtils: gc_fraction
Bio.SeqIO: parse
pandas: DataFrame, transform
numpy: arrange
gzip
sys
matplotlib: hist

python script to make GC_content plots

To install the libraries

source ./py3env/bin/activate
pip install biopython, pandas, numpy, matplotlib

#! usr/bin/python3

# import libraries into your script
from Bio.SeqUtils import gc_fraction
from Bio import SeqIO
import pandas as pd
import numpy as np
import gzip, sys
import matplotlib.pyplot as plt

fastq_dir=sys.argv[1]
T4_path=sys.argv[2]

# read in fastq.gz files 
gc_62={}
with gzip.open(f"{fastq_dir}/full_barcode62_filtered_reads.fastq.gz", 'rt') as f:
   for record in SeqIO.parse(f, "fastq"):
       gc_62[record.id]=gc_fraction(record.seq)*100

print("gc_62... done")

gc_63={}
with gzip.open(f"{fastq_dir}/full_barcode62_filtered_reads.fastq.gz", 'rt') as f:
   for record in SeqIO.parse(f, "fastq"):
       gc_63[record.id]=gc_fraction(record.seq)*100
print("gc_63... done")

gc_64={}
with gzip.open(f"{fastq_dir}/full_barcode62_filtered_reads.fastq.gz", 'rt') as f:
   for record in SeqIO.parse(f, "fastq"):
       gc_64[record.id]=gc_fraction(record.seq)*100
print("gc_64... done")

gc_T4={}
with gzip.open(T4_path, 'rt') as f:
   for record in SeqIO.parse(f, "fastq"):
       gc_T4[record.id]=gc_fraction(record.seq)*100
       #print(gc_fraction(record.seq)*100)
print("gc_T4... done")

# convert to a pandas dataframe
# For your samples
df_62 = pd.DataFrame([gc_62])
df_62 = df_62.T
df_63 = pd.DataFrame([gc_63])
df_63 = df_63.T
df_64 = pd.DataFrame([gc_64])
df_64 = df_64.T

# For T4 phage
df_T4 = pd.DataFrame([gc_T4])
df_T4 = df_T4.T

# make a plotting function
def make_plot(axs):
   # We can set the number of bins with the *bins* keyword argument.
   n_bins = 150

   ax1 = axs[0]
   ax1.hist(df_62, bins=n_bins)
   ax1.set_title('% GC content')
   ax1.set_ylabel('Barcode 62')
   ax1.set_xlim(0, 100)
   ax1.set_xticks(np.arange(0, 100, step=10))

   ax2 = axs[1]
   ax2.hist(df_63, bins=n_bins)
   ax2.set_ylabel('Barcode 63')

   ax3 = axs[2]
   ax3.hist(df_64, bins=n_bins)
   ax3.set_ylabel('Barcode 64')

   ax4 = axs[3]
   ax4.hist(df_T4, bins=n_bins)
   ax4.set_ylabel('T4 Phage')

# Plot:
fig, axs = plt.subplots(4,1, sharex=True, tight_layout=True)
make_plot(axs)
plt.show()

# save your plot
fig.savefig('GC_Content.png', dpi=150)

Run sbatch for gc plots

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=short
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=gc_plots
#SBATCH --output=1.1_QC/30_gc_plots/gc_plot.slurm.out.%j
#SBATCH --error=1.1_QC/30_gc_plots/gc_plot.slurm.err.%j

# activate the py3env
source ./py3env/bin/activate

# first argument: fastq files directory path
# second argument: T4 file path 
python3 ./python_scripts/qc/gc_content.py ./1.1_QC/20_chopper ./data/sequences/T4_SRR29341083.fastq.gz

deactivate 

Key Points

Sequencing quality control is an important first evaluation of our data

NanoPlot can be used to evaluate read quality of long reads

Chopper can be used to filter reads based on quality and/or length

The GC content profile of a metagenome or virome is composed of a mix of bell curves

Assembly lecture

Overview

Teaching: 60 min
Exercises: 30 min

Objectives

Watch the lecture videos and read about assembly algorithms

Sequence assembly is the reconstruction of long contiguous sequences (called contigs or scaffolds, see video below) from short sequencing reads. Before 2014, a common approach in metagenomics was to compare the short sequencing reads to the genomes of known organisms in the database (and some studies today still take this approach). However, this only works if the organisms in the database are closely related to the ones in the metagenomic sample. Recall that most of the sequences in a metavirome are unknown (“viral dark matter”), meaning that they yield no matches when compared to the reference database. Because of this, we need database-independent approaches to reconstruct new viral sequences. As sequencing technology and bioinformatic tools improved, sequence assembly enabled the recovery of longer sequences from metagenomic data. Having a longer sequence means having more information to classify it, so using metagenome assembly helps to characterize complex communities.

Video on sequence assembly

Watch the lecture video “Assembly strategies for genomics and metagenomics”. It will introduce reference-guided and de-novo assembly of genomic and metagenomic sequences (56 minutes):

Discussion

Watch the lecture video below and write down at least 3 questions and/or discussion points about it.

Click on the image to see lecture video “Assembly strategies for genomics and metagenomics” by Prof Bas E. Dutilh (56 minutes):

Questions

Would you use DBG (De-Bruijn Graph) or OLC (Overlap-Layout-Consensus) to assemble a dataset consisting of one billion short sequencing reads?

What are the strengths and weaknesses of reference-guided assembly and de novo assembly?

Would you use reference-guided or de novo assembly to assemble the genome of a model organism to discover mutations that occurred during an evolutionary experiment?

Would you use reference-guided or de novo assembly to determine the genome sequence of an unknown organism?

Why does metagenome assembly generally yield shorter contigs than genome assembly?

Additional reading: Computational Biology: Genomes, Networks, Evolution MIT course 6.047/6.878 (Prof. Manolis Kellis). This book is part of a course on Computational Biology and contains several topics that are relevant for Bioinformatics.

Read and summarize

Read the following sections and summarize shortly (less than half a page per section) their key points in your lab book:

“5.2 Genome Assembly I: Overlap-Layout-Consensus Approach” and “5.3 Genome Assembly II: String graph methods” (pages 93 to 102).

Key Points

Sequence assembly can be used to assemble genomes from reads

Metagenome assembly generally yields shorter contigs than genome assembly

Assembly and cross-assembly

Overview

Teaching: 30 min
Exercises: 180 min

Objectives

Assemble the metavirome from 3 samples

In this lesson, you will assemble the metavirome in two different ways using the tool Flye introduced in this article. Flye was designed to work well with noisy long reads and for metagenomic samples. Since you work on Draco, everything even slightly computationally expensive will be run through slurm. Please organize all the following steps in one or more sbatch scripts, as you learned yesterday. Since every tool needs different resources, it is recommended to have a single script per tool. All code snippets presented here assume that you put them in an adequate sbatch script. The necessary resources are mentioned in the comments or descriptions.

Cross-assembly

In a cross-assembly, reads from multiple samples are combined and assembled together. This allows for the discovery of shared sequence elements between the samples. If a virus (or other sequence element) is present in several samples, its sequencing reads from the different samples may be assembled together in one contig. Of course, whether this happens depends on how similar the sequence elements are, and on the stringency settings of the assembly algorithm. To figure out which contigs are present in which sample, we can map the sequencing reads from each sample to the contigs, and estimate the abundance by measuring the depth of coverage, i.e. the number of times that each nucleotide occurs in the reads. (We note that this can also be done with contigs that were assembled from a single metagenome.)

Exercise - Use flye to assemble a metagenome

You will perform a cross-assembly of the samples you started working with yesterday. To this end, you will first concatenate the sample files and then run Flye on the merged file. Gzipped files can be concatenated just like text files with the command line tool cat. Flye is installed in a conda environment on Draco.
# activate conda environment with flye installation on Draco
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate flye_v2.9.2

# merge the sequences
cat /path/to/*.fastq.gz > /path/to/all_samples.fastq.gz

# create a folder for the output cross-assembly
mkdir -p 10_results_assembly_flye/cross_assembly

# complete the flye command. This is the computationally expensive part
# and profits from many cores (30 is a good number). The used memory should
# not exceed 20GB of RAM. Think about the parameters you can pass to flye listed
# in the link below.
flye --nano-raw /path/to/all_samples.fastq.gz --meta --out-dir 10_results_assembly_flye/cross_assembly
Here you can find an overview over the possible parameters. Flye can be used for single-organism assemblies as well as metagenomic assemblies. Your sequences were generated with the Nanopore MinION platform and filtered to contain only high quality reads. You can pass an estimate of the size of your metagenome to flye, i.e., the combined length of the assembled contigs. It is difficult to accurately predict this if you do not know what’s in your samples (as in our case). How would you very roughly estimate this number?

After you have finalized your sbatch script with the resource assignments and the completed commands, you can either run it directly, or continue to expand it with the commands for the second approach described below. If you run it now, remember you can check the output of your script in the slurm log files which you can set with the sbatch parameters at the beginning of your script. Check both files. Often, the error output contains more than just errors, and this file is the more informative one.
#SBATCH --output=10_results_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=10_results_assembly_flye/assembly_flye.slurm.%j.err
Flye creates an assembly in multiple steps. You can read through the Flye log file that you defined with the #SBATCH --error= parameter. Afterward, you can find a list of all assembled contigs with additional information in the file assembly_info.txt, located in the Flye output folder.

What are the longest and shortest contig lengths?

What is the range of depth (coverage) values?

How many circular contigs were assembled? Do you think this percentage is high or low?

How does your guess about the size of the metagenome (above) compare to the actual size?

Separate assemblies

This part is recommended but optional. If you cannot finish it in the suggested time, move on to the next section “Assessing assembly quality”, which will be important for the coming days.

The second approach consists of performing separate assemblies for each sample and merging the resulting contigs by similarity in the end. Note that if a species is present in several samples, this final set will contain multiple contigs representing the same sequence, each coming from one sample. Because of this, we will further de-replicate the final contigs to get representative sequences.

Exercise - Use flye to assemble each sample individually

To run separate assemblies, you can adapt the flye command used for the cross-assembly to take each sample file individually and output the assemblies in separate folders in 10_results_assembly_flye. For this you can run a for loop over the respective files. The following expects the sample files to be called barcodeN.fastq.gz with N in (62, 63, 64):
# The individual assemblies need less memory than the cross-assembly,
# but you can still use the same resources as before.
for barcode in $(seq 62 64)
do
	flye ... /path/to/barcode$barcode.fastq.gz ...
done
Here, the command seq generates a sequence of integer numbers between its two arguments. Once the assemblies have finished, you will combine the contigs generated for each sample into a single file. Since the generated contigs are only assigned numbers by flye (not necessarily sequential), the same names will be present in each assembly. We have to rename them according to the sample they originate from for all contigs to have unique names. We can do this using Python and the Biopython package. Biopython provides many tools for the analysis of sequencing data, including tools for parsing and writing .fasta files. You can find the documentation on these here.

To use this Python package on Draco, use the virtual environment you set up yesterday. If you did not already install the biopython package yesterday, do so now. You can now write a python script that reads your contigs, changes their name, and writes a new file with the results using biopython. The basic structure of the script could look like this:
from Bio import SeqIO
import os, sys

# read the arguments passed to your script. They are stored as a list in sys.argv.
# sys.argv contains the parameters passed to the python binary, so sys.argv[0]
# always contains the file name of your script. The arguments passed to the script
# then start at sys.argv[1]
assembly = os.path.abspath(sys.argv[1])

# open a file handle using the with open(x) as y syntax. This ensures the file is
# closed properly after the code block.
with open(assembly) as file_handle:
  for record in SeqIO.parse(file_handle, "fasta"):
    # do something with the sequence name

# write the records with the new names
...
This script should not be computationally expensive, but we will anyways execute it from within an sbatch script. We can also combine it with the script or scripts you already have written for the assemblies themselves. In this case, it is important to deactivate the conda environment holding the flye installation before activating the environment you just created.

python script for renaming contigs

import os, sys
from Bio import SeqIO

def main():

   # access parameters passed to your script
   assembly = os.path.abspath(sys.argv[1])

   # throw an error if the statement behind assert is not true
   assert assembly.endswith(".fasta")
   
   out_fasta = os.path.abspath(sys.argv[2])
   assert out_fasta.endswith(".fasta")
   
   sample_id = sys.argv[3]
   
   # modify names of the scaffols and store in to_write list
   to_write = list()
   with open(assembly) as handle:
       for record in SeqIO.parse(handle, "fasta"):
           # make sure not to do this twice, if run several times. Its not necessary.
           if record.id.startswith(sample_id): continue

           # adjust the id of the record object
           record.id = f"{sample_id}_{record.id}"

           # append the object to the list for writing back to a file later.
           to_write.append(record)

   # write records in to_write to .fasta file
   with open(out_fasta, "w") as fout:
       SeqIO.write(to_write, fout, "fasta")


if __name__ == "__main__":
   main()

sbatch script for assemblies

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=20G
#SBATCH --time=01:00:00
#SBATCH --job-name=assembly_flye
#SBATCH --output=./1.2_assembly/10_results_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=./1.2_assembly/10_results_assembly_flye/assembly_flye.slurm.%j.err

# run flye in metagenomic mode for de-novo assembly of viral contigs
# First, activate the conda environment which holds the flye installation on draco:


# First, activate the conda environment which holds the flye installation on draco:
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate flye_v2.9.2

# set a data directory holding your samples in .fastq.gz format
datadir="1.1_QC/20_chopper"

# create a merged fastq.gz file by concatenating three samples:
cat $datadir/barcode*_filtered_reads.fastq.gz > $datadir/merged_filtered.fastq.gz

# set a directory for ouputting results
outdir="./1.2_assembly/10_results_assembly_flye/cross_assembly"

# create a folder for the cross-assembly
mkdir -p $outdir

# flye parameters (https://gensoft.pasteur.fr/docs/Flye/2.9/USAGE.html)
# --nano-raw: tells flye about the input data. --nano-hq is also possible but results in a smaller assembly (Why?).
# --meta: for metagenomes with uneven coverage
# --genome-size: estimated size for metagenome assembly (mg size)
# -t: threads
flye --nano-raw $datadir/merged_filtered.fastq.gz --meta --genome-size 30m --out-dir $outdir -t 30

# single assemblies
# Run flye on all samples sequentially, save the results in separate folders named like the samples:

outdir="./1.2_assembly/10_results_assembly_flye/single_assemblies"
mkdir -p $outdir
for barcode in $(seq 62 64)
do
        flye --nano-raw $datadir/barcode${barcode}_filtered_reads.fastq.gz --meta --genome-size 10m --out-dir $outdir/barcode$barcode -t 30
done

# close the conda environment
conda deactivate

# activate your virtual environment
source ./py3env/bin/activate

# rename the contings in the generated single assemblies with a python script.
# It assumes three parameters here, first the input file name, second the output file name and third,
# the sample name to add to each contig name in the respective fasta file.
for barcode in $(seq 62 64)
do
        python3 python_scripts/assembly/10_run_rename_scaffolds.py $outdir/barcode$barcode/assembly.fasta $outdir/barcode$barcode.fasta barcode$barcode
done

deactivate

cat $outdir/barcode*.fasta > $outdir/assembly.fasta

Detecting similar sequences within and between assemblies

Some samples will contain the same virus strains. This will lead to the assembly of the same sequence in several times. To de-replicate the scaffolds of the single asseblies, we will cluster them at 95% Average Nucleotide Identify (ANI) over 85% of the length of the shorter sequence. These cutoffs are often used to cluster viral genomes at the species rank. This can be done with the tool vClust and results in both, clustering complete viral genomes at the species level and clustering genome fragments along with very similar and longer sequences.

Exercise - use vClust to find simililar contigs

vClust can also output cluster representatives, which will be the longest sequences within a cluster. Use this mode to get the information we need to dereplicate our merged single assemblies. Also apply vClust to the cross-assembly, this will give us more information on the assembled sequences. You can play with the similarity cutoffs (and metrics) used for clustering to see how they affect the results. vClust needs to align all sequences to each other and can run heavily in parallel. Remember to put this step into a sbatch script again and assign around 30 cores and 20GB of RAM.
# vClust is a python script and can be run by simply calling it on draco
# you have to run it with python 3.9
vclust='python3.9 /home/groups/VEO/tools/vclust/v1.0.3/vclust.py'

$vclust prefilter -i 10_results_assembly_flye/single_assemblies/assembly.fasta ...
$vclust align ...
$vclust cluster ...
After vClust ran successfully, you need to write a Python script to filter the assembly according to the information provided by vClust. To parse a tabular file in the format of CSV or TSV (comma- or tab-separated valus), the Python package pandas can be used. If its not in your virtual environment, install it now.

Now you can write a Python script that filters the assemblies corresponding to the output of vClust:
from Bio import SeqIO
import pandas as pd
import os, sys

# read the tsv file generated by vClust
cluster_df = pd.read_csv('path/to/the/file.tsv', sep='\t')
 
with open(assembly) as file_handle:
  for record in SeqIO.parse(file_handle, "fasta"):
    # check if the contig is in the set of representatives

# write the records which passed the test
...

Python script for filtering similar contigs

import os, sys
import pandas as pd
from Bio import SeqIO

def main():

    assembly_filename = os.path.abspath(sys.argv[1])
    assert assembly_filename.endswith(".fasta")
   
    out_filename = os.path.abspath(sys.argv[2])
    assert out_filename.endswith(".fasta")
   
    cluster_reps_filename = os.path.abspath(sys.argv[3])
    assert cluster_reps_filename.endswith(".tsv")

    cluster_reps_df = pd.read_csv(cluster_reps_filename, sep='\t')
    cluster_reps_set = set(cluster_reps_df["cluster"])
	
    to_write = list()
    with open(assembly_filename) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            if record.id in cluster_reps_set: to_write.append(record)

    # write records in to_write to .fasta file
    with open(out_filename, "w") as fout:
        SeqIO.write(to_write, fout, "fasta")


if __name__ == "__main__":
    main()

sbatch script for dereplication

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=01:00:00
#SBATCH --job-name=assessment_vclust
#SBATCH --output=./1.2_assembly/20_results_assessment_vclust/assessment_vclust.slurm.%j.out
#SBATCH --error=./1.2_assembly/20_results_assessment_vclust/assessment_vclust.slurm.%j.err


vclust='python3.9 /home/groups/VEO/tools/vclust/v1.0.3/vclust.py'

indir='./1.2_assembly/10_results_assembly_flye/cross_assembly'
outdir='./1.2_assembly/20_results_assessment_vclust/cross_assembly'
mkdir -p $outdir

$vclust prefilter -i $indir/assembly.fasta -o $outdir/fltr.txt
$vclust align -i $indir/assembly.fasta -o $outdir/ani.tsv -t 30 --filter $outdir/fltr.txt
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_90.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_70.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.70
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_ani_90.tsv --ids $outdir/ani.ids.tsv --metric ani --ani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusterreps.tsv --ids $outdir/ani.ids.tsv --algorithm uclust --metric ani --ani 0.95 --cov 0.85 --out-repr

source ./py3env/bin/activate

python3 ./python_scripts/assembly/20_run_filter_representatives.py $indir/assembly.fasta $outdir/assembly.fasta $outdir/clusterreps.tsv

deactivate

echo Done running python script for cross-assembly...

indir='./1.2_assembly/10_results_assembly_flye/single_assemblies'
outdir='./1.2_assembly/20_results_assessment_vclust/single_assemblies'
mkdir -p $outdir

$vclust prefilter -i $indir/assembly.fasta -o $outdir/fltr.txt
$vclust align -i $indir/assembly.fasta -o $outdir/ani.tsv -t 30 --filter $outdir/fltr.txt
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_90.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_70.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.70
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_ani_90.tsv --ids $outdir/ani.ids.tsv --metric ani --ani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusterreps.tsv --ids $outdir/ani.ids.tsv --algorithm uclust --metric ani --ani 0.95 --cov 0.85 --out-repr

source ./py3env/bin/activate

python3 ./python_scripts/assembly/20_run_filter_representatives.py $indir/assembly.fasta $outdir/assembly.fasta $outdir/clusterreps.tsv

deactivate

Questions - Go through the results of vClust

vClust is a great tool to get a feeling for the diversity within your assemblies (or any kind of set of sequences).

How many similar sequences were found in each assembly?

How do the results depend on the choice of your metric (ANI, TANI, or GANI) and your cutoff values?

Key Points

Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.

Samples can be assembled individually and combined in a cross-assembly

vClust can be used to assess the diversity of sequences in your assembly

Assessing assembly quality

Overview

Teaching: 0 min
Exercises: 120 min

Objectives

Assess the quality of your assemblies

Now we will measure some basic aspects of the assemblies, such as the fragmentation degree and the percentage of the raw data they represent. A metagenome consists of all the genomic information of all the organisms in a given system. So in the ideal case, a metagenomic assembly would contain a single and complete contig for each chromosome or plasmid in the sample, in which case all the metagenomic sequencing reads can be perfectly mapped back to the assembly. Of course, this will never happen, because in most biomes there is a long tail of rare organisms that contribute only a tiny fraction of the sequenced DNA, so complete horizontal coverage of their genome cannot be achieved, and their assembly remains fragmented. Other problems are repeated regions in the metagenome and microdiversity between strains, which both lead to complex structures in the assembly graph, and thus fragmented assemblies.

Contig length distribution

To get a first idea about your assembly, it is helpful to look at the distribution of the length of the generated contigs. We can discribe this distribution using some numbers derived from it, such as the maximum, the median and something called N50 or N90. These numbers are computed by concatenating all contings ordered by their length. The length of the contig sitting at 50% (or 90%) of the total length of all contigs combined this way, is called N50 (or N90). The QUAST program (Gurevich et al., 2013) can be used to compute these values and visualize the distribution of the contig lengths. The tool can additionally use a reference sequence to compare the assembly against for assessing its fragmentation. We do not have a reference and will use the basic analysis of Quast.

Exercise - Use Quast to compute the contig length distribution

Use the Quast program to visualize the distribution of the contig lengths. You will need to run it two times, once per assembly, and save the results to different folders (ie. result_quast/cross_assembly and result_quast/single_assemblies). Quast does not need many ressources. Assigning 2 CPUs and 5 GB of RAM for sbatch should be enough.
# create a folder for the assessment within todays folder
$ mkdir 30_results_assessment_quast

# Quast is already installed on the server. Its a python script located here:
$ quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'

# run quast two times, once per assembly
$ $quast -o 30_results_assessment_quast/cross_assembly /path/to/your/cross_assembly/assembly.fasta

# run quast again on the assembly.fasta, which is merged from your vclust results (merged x3 single assemblies)
$ $quast -o 30_results_assessment_quast/single_assemblies ./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta
You can get the results from the file report.txt or copy the whole results folder to your computer and open report.html in your local browser.

sbatch script for running Quast

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=assessment_quast
#SBATCH --output=./1.2_assembly/30_results_assessment_quast/assessment_quast.slurm.%j.out
#SBATCH --error=./1.2_assembly/30_results_assessment_quast/assessment_quast.slurm.%j.err

# Set some variables for the quast script on draco and the files to be analysed
quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
single_assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'
outdir='./1.2_assembly/30_results_assessment_quast'

# run Quast to visualize the distribution of contig lengths
$quast -o $outdir/cross_assembly $cross_assembly
$quast -o $outdir/single_assemblies $single_assembly

How well does the assembly represent the reads?

Next, we will test how much of the raw data (reads) is represented by the assembly (contigs). We will use minimap2 to align the reads from each sample to the assembled contigs and samtools to handle the output of minimap2.

Exercise - Map the reads back to the assembly with minimap2

You will map the reads from each barcode separately. Minimap2 is an alignment tool that outputs Sequence Alignment Map (SAM) file format, which has to be converted to a compressed binary (BAM) format and then sorted using samtools view and samtools sort. You can also use samtools sort to create index files, which facilitate quick access to the data in the binary file. Last, you can get basic stats of the mapping using samtools stats. This will tell you how many of the reads aligned to each of the assemblies, and how many did not align. You can also pipe the respective outputs of each step into the next step, saving disk IO and possibly speeding up things (minimap2 and samtools are specifically designed for this). You can read the manuals for minimap2 and samtools to figure out the specific commands to use (use the solution, if you’re short on time).
# minimap2 and samtools are installed on draco and you can set aliases to their location:
minimap2='/home/groups/VEO/tools/minimap2/v2.26/minimap2'
samtools='/home/groups/VEO/tools/samtools/v1.17/bin/samtools'

# pipe the commands into eachother. The '-'
# tells the tools to take their input from the pipe:
$ $minimap2 ... | $samtools view ... - | $samtools sort ... -

# get mapping statistics of each bam file
$ $samtools stats ...
samtools stats creates a long report with tabular statistics which could be plotted. The size of the output is determined by the maximum length of all reads and can get very large for alignments of long reads. A summary of the statistics is located at the beginning of the file and you can read the relevant first 46 lines with “head -46 path/to/stats/file.txt” or using “less path/to/stats/file.txt”. There is a lot of information contained in these lines, an important measure for us is the number of reads which could be mapped back to the assembly and also the number of bases mapped.

sbatch script for aligning the samples to the assemblies

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=00:30:00
#SBATCH --job-name=alignment_minimap2
#SBATCH --output=./1.2_assembly/40_results_alignment_minimap2/alignment_minimap2.slurm.%j.out
#SBATCH --error=./1.2_assembly/40_results_alignment_minimap2/alignment_minimap2.slurm.%j.err


# assign tool paths to aliases for better readability
minimap2='/home/groups/VEO/tools/minimap2/v2.26/minimap2'
samtools='/home/groups/VEO/tools/samtools/v1.17/bin/samtools'

# run minimap2 to align all reads from a sample to the assembled contigs
# and pipe the output into samtools for conversion into the binary bam format
#
# minimap2 parameters (https://lh3.github.io/minimap2/minimap2.html):
# -x map_ont : Use a preset for parameterizing the affine gap penalty model for the extension of matched seeds
# suited for noisy nanopore reads.
# -a : output in SAM format
# -t 30 : run with 30 threads
#
# samtools parameters (http://www.htslib.org/doc/samtools.html):
#
# samtools view can be used to convert between SAM, BAM and CRAM formats.
# view -u : output uncompressed binary format (BAM)
#
# samtools sort can be used to sort a SAM, BAM or CRAM file. Some tools expect sorted alignments.
# sort --write-index : output the index of the sorted alignments, can reduce file IO when accessing only a subset of the alignments
# sort -o : set the output file for the sorted alignments
#
# - : the - tells samtools to take the inpute from the pipe (| is the piping operator).


indir='./1.1_QC/20_chopper'

# cross-assembly
outdir='./data/alignments/cross_assembly'
assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
mkdir -p $outdir

# loop through the numbers 62 to 64 and use it to generate diffenrent filenames within the loop
for barcode in $(seq 62 64) 
do 
 $minimap2 -x map-ont -a -t 30  $assembly $indir/barcode${barcode}_filtered_reads.fastq.gz | \
   $samtools view -u - | $samtools sort -o $outdir/barcode$barcode.bam --write-index -
 $samtools stats $outdir/barcode$barcode.bam > $outdir/barcode$barcode_stats.txt
done

# single assemblies
outdir='./data/alignments/single_assemblies'
assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'
mkdir -p $outdir

for barcode in $(seq 62 64) 
do 
 $minimap2 -x map-ont -a -t 30 $assembly $indir/barcode${barcode}_filtered_reads.fastq.gz | \
   $samtools view -u - | $samtools sort -o $outdir/barcode$barcode.bam --write-index -
 $samtools stats $outdir/barcode$barcode.bam > $outdir/barcode$barcode_stats.txt
done

Questions - Compare both assemblies

You now have access to several metrics about your assemblies.

The assembly data from Flye;

The similarity between your contigs from vClust;

N50, N90 and the distribution of contig lengths from Quast;

The alignment of your reads to the assembled contigs from Minimap2/Samtools.

Can you summarize these results in a clear way and explain the difference between the assemblies? Focus on the difference between the cross-assembly and the separate assemblies.

Which assembly has more contigs? Why?

Which assembly has more contigs above a fixed length? Why?

Which assembly better represents its input reads? Why?

Can you think of other metrics to assess the quality of a metagenomic assembly?

Key Points

checkV assesses the quality of your contigs with regard to viral completeness and contamination

minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes

samtools can be used to read, filter, convert and summarize alignments

Visualizing the assembly

Overview

Teaching: 0 min
Exercises: 90 min

Objectives

Understand the topology of the de-Bruijn graph

Understand how the presence of similar species in the sample affects the assembly

This section is suggested as homework.

Choose one of the following two topics: “A. Paths in the de-Bruijn graph” or “B. Effect of related species”. Pick what sounds more interesting to you, and discuss your results with someone who picked the other topic. (Bandage is difficult to run on Windows.)

A. Paths in the de-Bruijn graph

We will use Bandage, a tool to visualize the assembly graph. Bandage is difficult to run on a Windows computer. In the releases section, follow the instructions to download the most appropriate version, such as Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip. To run it, unzip the file and call Bandage from the terminal like this:

# run Bandage
$ ./Bandage_Ubuntu-x86-64_v0.9.0.AppImage

In File > Load_graph, navigate to and load the file assembly_graph.fastg of the cross-assembly. Then click Draw graph to visualize the graph. Note that this graph has already been compacted by collapsing nodes that form linear, unbranching paths into unitigs. Nodes in the graph are called edge_N (confusing name…) with N being an integer. They often correspond to the final contigs in your assembly.

Bubbles and junctions

Open the file assembly_info.txt corresponding to the graph you are looking at. The N in the node names as displayed by Bandage corresponds to the number assigned to continuous paths in the de-Bruijn graph by Flye. The “graph_path” column holds this information for all contigs.

What does an asterisk * mean?

What do multiple occurrences of the same number mean?

Pick two components of the visualized de-Bruijn graph and explain their topology and information content.

Are there bubbles and junctions?

Can you relate the complexity of the visualized graph to the Flye command line parameters?

Metagenomic samples often contain several strains for a given species. This is particularly evident with viruses that typically contain many haplotypes. Each small difference between the genomes leads to a fork and a structure in the assembly graph. This complicates the path-finding algorithm implemented in the assembly tool. Mistakes at this point can lead to chimeric contigs containing sequences from more than one strain. To visualize this effect, we will align the reads in our samples back to the assembled contigs and use jbrowse2 to visualize the differences between reads and contigs. Download the tool. Under Linux or Mac, you can start the AppImage simply by typing

# run JBrowse2
$ ./jbrowse-desktop-v2.13.1-linux.AppImage

from within the corresponding folder. After starting JBrowse2, select open sequence file(s):

Assign a name to your assembly, select FastaAdapter under Type and then choose file to navigate to your assembly file and submit:

Click launch view with linear genome view selected, then show all regions in assembly and “open track selector”, click the plus sign in the lower right corner and select add track. Under main file, select File and then choose file and navigate to the alignments you computed for the first of our three samples (e.g., “barcode62.bam”). After that, repeat the same under index file and navigate to the index file of the alignments (e.g., “barcode62.bam.csi”). Click Next and make sure, the IndexedBamAdapter is selected and confirm.

Repeat this process for the other two alignment files and their adapter files. Now, you can inspect single reads aligned to your assembly. Try to familiarize yourself with the interface. You can search for a specific contig by typing its name into the search bar at the top of the interface.

Chimeras and reads connecting contigs

Try to look into contigs with a high coverage and find some wich have reads mapped to them in all three samples.

Can you identify multiple virus strains visually? Make a screenshot, explain what you see, and how this fits into what you learned about assemblies with de-Bruijn graphs.

When clicking on an alignment, you can find alternative alignments of the same read to other contigs.

Name at least one reason that (part of) a read could be aligned to multiple locations.

Key Points

Bandage can visualize the de-Bruijn graph

JBrowse2 can visualize genomic data like alignments and coverage

Identifying Viral Contigs I

Overview

Teaching: 0 min
Exercises: 180 min

Objectives

Understand the concept and structure of benchmarking virus identification tools

Evaluate the benchmarking results presented in the paper by Wu et al

Choose a virus identification tool

Comparing Virus Identification Tools

In this theoretical part, you will dive into a very important part of microbiome data analysis: how to choose a good tool. Several tools have been developed over the past years for identifying viral sequences. Each tool measures different aspects of the sequence, applies different algorithms to extract and compare the information, and uses different reference databases to distinguish viral from non-viral sequences.

To choose the best one for your project, you should consider the overall performance of the tools (e.g. sensitivity and specificity, but also ask whether the purpose of the tool aligns with your research goal. If a recent tool does the same task as a previously published one, the authors should compare (or benchmark) them in their publication. But benchmarks can be biased, so ideally, bioinformatic tools are benchmarked by independent scientists.

Today, we will study the paper: “Benchmarking bioinformatic virus identification tools using real‐world metagenomic data across biomes” (Wu et al., 2024). Considering the allocated time for this activity, you will probably not be able to read the entire paper. To answer the proposed questions, please focus on Figures 1, 3, and 6 (and associated Methods and Results paragraphs, search online when necessary).

Questions

The authors investigated nine tools, which can be divided into three categories. What categories? Describe the basic approach of each category of tools.

What is meant by the sets of True Positives, False Negatives, False Positives, and True Negatives in Figure 1? Explain what each of these sets can tell us in the benchmark, and explain how they were generated.

In Figure 3 A/B/C, what indicates a good performance?

In Figure 3 D/E/F, the terms TPR, FPR, and AUC/ROC are used for evaluation. What are those terms? What is the main takeaway from these figures?

Overall, which types of tools perform better? Why do you think that is?

Explain what is shown in Figure 6. Do you believe it? What value does this analysis add to the benchmarking?

Considering all you have learned from the paper, is there an ideal tool to identify viral sequences? Why (not)?

For the project we are developing in this module, which tool(s) would you use? Why?

Key Points

Different approaches can be used in a virus identification tool, such as reference-based and machine learning

Benchmark is a comparison between tools and should offer concrete evidence of performance, such as number of TP, FN, FP and TN

When choosing a tool, you should consider the priorities of the project. E. g. you need to identify as many viruses as possible and FP are not critical, so a tool with high recall is ideal

Identifying viral contigs II

Overview

Teaching: 0 min
Exercises: 180 min

Objectives

Select medium-complete and low-contaminated contigs

Identify phage contigs in assembled viromes and metagenomes

Interpret Jaeger’s output

Select viral contigs for further analysis

Assess assembly completeness

Understand differences between viromes and metagenomes

In this section, we will identify viral sequences among our assembled contigs. This morning, you studied how Wu et al. benchmarked nine bioinformatic virus identification tools. As many of those tools are quite slow, we will use a different tool that was developed by the VEO-MGX Groups: Jaeger. Jaeger was developed after the Wu et al. benchmarking study, so we will test how well it performs ourselves by running it on the cross-assembled virome contig dataset, and also on contigs from total community metagenomes from the same Unterwarnow estuary samples. We will compare the percentage of detected contigs to the numbers reported in the Wu et al. paper.

Document your activities and answer the questions below in your lab book. Do not forget to cite all relevant sources in your work.

Identify viral contigs

We will start to identify viral contigs with Jaeger.

Challenge

Viromes and metagenomes are obtained in the wet lab through different filtration steps. Smaller filters remove larger particles such as eukaryotes and prokaryotes. Where do you expect the most hits from Jaeger (virome or metagenome) and why?

Why is it important to use tools like Jaeger on viromes?

Note that Jaeger should require a bit more than 1 GB for the sbatch job, so make sure to allocate at least 2 GB of memory. Jaeger can run on CPU nodes, but it’s speed is optimal when run on GPU nodes. In the parameter --partition of the sbatch script, you could add gpu,short to allocate the job to a GPU node.

Exercise

Run Jaeger for cross-assembly:

Read and interpret the output following Jaeger’s webpage
module load nvidia/cuda/12.1.0
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh
conda activate jaeger_dev
python3 /home/groups/VEO/tools/jaeger/v1.1.30a0/Jaeger/bin/jaeger run -i <file> -o <output_file>
Inspect the output files After running Jaeger, answer the questions below.
head <output>

sbatch script for jaeger

>#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=standard
#SBATCH --mem=5G
#SBATCH --time=2:00:00
#SBATCH --job-name=jaeger
#SBATCH --output=./1.3_virus_identification/10_jaeger/jaeger.slurm.out.%j
#SBATCH --error=./1.3_virus_identification/10_jaeger/jaeger.slurm.err.%j

#The most updated version of Jaeger is from 20240806

module load nvidia/cuda/12.1.0
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh 
conda activate jaeger_dev

# from here on, we will just use the cross-assembly for our analyses

assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
outdir='./1.3_virus_identification/10_jaeger'
jaeger='python3 /home/groups/VEO/tools/jaeger/v1.1.30a0/Jaeger/bin/jaeger run'

mkdir -p $outdir

$jaeger -i $assembly -o $outdir/results_jaeger

Questions

Do the results corroborate your expectations?

Why are not all contigs in the virome identified as viral contigs?

Take the longest contig from the dataset and BLAST it using blastn. What are the top hits? Are they expected?

Optional: How many viral contigs are there in the virome?

bash command for getting the number of phage contigs
cut -f 3 <file> | sort | uniq -c

To get a command to get the sequence of a contigID, you coul go to “Extract contig fasta” of the tutorial of Gene Calling and Functional Annotation II.

Estimating genome completeness

There are tools to assess the completeness of bacterial and viral genome sequences. For viruses we use CheckV. The tool identifies genes on the query sequence and compares them to a database of viral and bacterial marker genes. Each taxonomic group of bacteria or phages, e.g. a species or a family, has certain marker genes on its genome. So based on the number and types of marker genes, CheckV can figure out to which taxon a query contig belongs and estimate how much of the genome it represents (estimated completeness). (We will improve the taxonomic annotation in the “Viral Taxonomy and Phylogeny” section next week.) CheckV also checks for unexpected marker genes on the sequence (estimated contamination), and whether part of a phage contig likely represents bacterial sequence, in which case the fragment could be part of a host genome with an integrated prophage.

# create a folder for the assessment (or let sbatch create it when you assign the output and error log files)
$ mkdir 30_results_assessment_checkv

# activate the conda environment containing the checkv installation
$ source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate checkv_v1.0.1

# run checkV on both assemblies
$ checkv end_to_end ...

sbatch script for running checkV

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=22
#SBATCH --partition=standard
#SBATCH --mem=20G
#SBATCH --time=02:30:00
#SBATCH --job-name=assessment_checkv
#SBATCH --output=./1.3_virus_identification/20_results_assessment_checkv/assessment_checkv.slurm.%j.out
#SBATCH --error=./1.3_virus_identification/20_results_assessment_checkv/assessment_checkv.slurm.%j.err

# run CheckV to assess the completeness of single-contig virus genomes.
# First, activate the conda environment which holds the CheckV installation on draco:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate checkv_v1.0.1

# CheckV parameters (https://bitbucket.org/berkeleylab/checkv/src/master/#markdown-header-running-checkv)
# checkv end-to-end runs the CheckV pipeline from end to end :). It expects an input fasta file 
# with the assembly and an output path.
# -t: threads

# assigning variables for readability

database='/work/groups/VEO/databases/checkv/v1.5'
outdir='./1.3_virus_identification/20_results_assessment_checkv'
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
single_assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'

checkv end_to_end -t 20 -d $database $cross_assembly $outdir/cross_assembly
checkv end_to_end -t 20 -d $database $single_assembly $outdir/single_assemblies

Go through the CheckV results

CheckV produces many output files, and also saves files for the intermediate steps of the tools that are used to find viral marker genes (diamond and hmmsearch). A summary of all results can be found in the file quality_summary.tsv. Open the file and familiarize yourself with the information presented in the table. On CheckV’s website, you can find information about the output in the sections “How it works” and “Output files”.

How are completeness and length related? (qualitative answer, name examples)

What are reasons for contigs with low completeness to appear in the assembly?

How do completeness and contig coverage relate? (qualitative answer, name examples)

Filter contigs

Exercise - select high-quality phage contigs

Use Jaeger’s predictions (‘phage’ and ‘non-phage’) and CheckV’s estimates of completeness and contamination to separate phage from non-phage contigs and reduce our assembly to high-quality contigs of high completeness. If you are an experienced programmer, write (a) script(s) for that. If you are not, use the solutions below. For the solutions below, we used file assembly_default_jaeger.tsv, however, you could also use assembly_default_phages_jaeger.tsv. The results of CheckV are located in the file quality_summary.tsv located in CheckV’s output folder. As a rule of thumb, you could keep all contigs with completeness >50% and contamination <5%. These values could be changed depending on the data and on the project. Note that filtering for low completeness can remove some conserved regions.

To read the tabular files, you can use python’s pandas package. If you didn’t do this yesterday, install it in your virtual environment now.

Then, you can load a .tsv file and select content from it like this:
import pandas as pd

# the .tsv format separates cells by a tab ('\t')
jaeger_df = pd.read_csv(jaeger_results_path, sep='\t')

# create a python set of contig names that stick to the if-statement in the loop
jaeger_selection = {row['contig_id'] for index, row in jaeger_df.iterrows() if row['prediction'] == 'phage'}

# do the same for the checkv results and use set operations to get the contigs selected by both tools
joint_selection = jaeger_selection.intersection(checkv_selection)

# modify the code from yesterday (rename and filter contigs) to go through the assembly and save contigs in the joint selection
...

python script for selecting viral contigs

import os, sys
import pandas as pd
from Bio import SeqIO

def main():

    assembly_path = os.path.abspath(sys.argv[1])
    assert assembly_path.endswith(".fasta")

    jaeger_results_path = os.path.abspath(sys.argv[2])
    assert jaeger_results_path.endswith(".tsv")
    
    checkv_results_path = os.path.abspath(sys.argv[3])
    assert checkv_results_path.endswith(".tsv")
    
    out_fasta = os.path.abspath(sys.argv[4])
    assert out_fasta.endswith(".fasta")

    # read the tsv files as pandas dataframes
    jaeger_df = pd.read_csv(jaeger_results_path, sep='\t')
    checkv_df = pd.read_csv(checkv_results_path, sep='\t')
	
    # collect the sets of contigs which stick to our selection cutoffs
    jaeger_selection = {row['contig_id'] for index, row in jaeger_df.iterrows() if row['prediction'] == 'phage'}
    checkv_selection = {row['contig_id'] for index, row in checkv_df.iterrows() if row['completeness'] > 50 and row['contamination'] < 5}
	
    # use set operation union to get the contigs in the jaeger set AND in the checkv set
    joint_selection = jaeger_selection.intersection(checkv_selection)
	
    # print some numbers
    print(f"Number of input contigs: {len(jaeger_df.index)}, selected by jaeger: {len(jaeger_selection)}, selected by checkv: {len(checkv_selection)}, joint selection: {len(joint_selection)}")
	
    # define list of records to keep and fill it by comparing the contig id of each record to the joint set of selected contigs
    out_records = []
    with open(assembly_path) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            if record.id in joint_selection: out_records.append(record)
    
    # write the selected records into a new file
    with open(out_fasta, "w") as fout:
        SeqIO.write(out_records, fout, "fasta")

if __name__ == "__main__":
    main()

sbatch script for submitting the python script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=filter_contigs
#SBATCH --output=./1.3_virus_identification/40_results_filter_contigs/filter_contigs.slurm.%j.out
#SBATCH --error=./1.3_virus_identification/40_results_filter_contigs/filter_contigs.slurm.%j.err

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# in this sbatch script, its not necessarry to create the directory,
# we already told sbatch to create it for the log files.
# mkdir -p 40_results_filter_contigs

# In this solution, our script takes the assembly and the files
# 'complete_contigs_default_jaeger.tsv' from jaeger and 
# 'quality_summary.tsv' from CheckV as an input. Set them as variables
# for readability
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
jaegerresults='./1.3_virus_identification/10_jaeger/results_jaeger/assembly/assembly_default_jaeger.tsv'
checkvresults='./1.3_virus_identification/20_results_assessment_checkv/cross_assembly/quality_summary.tsv'
outdir='./1.3_virus_identification/40_results_filter_contigs'

# run our script for filtering contigs based on the output of jaeger
# and CheckV as well as the output path. 
python ./python_scripts/identify/40_filter_contigs.py $cross_assembly $jaegerresults $checkvresults $outdir/assembly.fasta

# deactivate the environment
deactivate

Compare the results

CheckV and jaeger follow different approaches and we expect their outputs not to match perfectly. Estimate how different their predictions of the contigs viralness are.

At the selected cutoffs, how many contigs get chosen by jaeger, how many by CheckV?

How many Jaeger hits were annotated by CheckV as having high-quality?

To find how many Jaeger hits were annotated by CheckV as high quality:

cut -f1 assembly_default_phages_jaeger.tsv > phage_contigs_jaegar.list

grep -w -Ff phage_contigs_jaegar.list  20_results_assessment_checkv/cross_assembly/quality_summary.tsv | grep "High-" | wc -l

Key Points

Filtering contigs by completeness and contamination is crucial to obtain an informative dataset

Tools like Jaeger classify your contigs, enabling you to understand your samples

No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results

Visualizing distributions

Overview

Teaching: 0 min
Exercises: 90 min

Objectives

plot pairwise groups of measurements and determine if they are different

This section is suggested as homework.

Differences between assemblies

Location of metagenome assemblies: /work/groups/VEO/shared_data/veo_students/metagenome_XJ/bacterial_assembly_q15.fasta. You will need to copy it to your own home directory.

This assembly contains much longer, but also generally more contigs than the virome assembly. Use Matplotlib to visualize the distribution of the contig lengths. Use at least two plot types (e.g. histogram, violin plot, or boxplot) and adjust the plots to optimally visualize the differences in the contig length distributions. The bacterial assembly contains some very long fragments (outliers), try to find a way to deal with those.

First, you have to include Matplotlib into your virtual environment:

# activate the environment
$ source path/to/your/py3env/bin/activate

# install the packages we need
$ pip install matplotlib

Then you can write a plotting script. If you want to put multiple plots into the same figure, you can use the following lines of code:

import matplotlib as plt

# create a figure with two panels horizontally next to each other
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))

# plot into the first (left) panel
axs[0].violinplot(list_of_datasets_to_plot)

# plot into the second (right) panel
axs[0].boxplot(list_of_datasets_to_plot)

# save the figure to a file
fig.savefig("filename.png")

python script for plotting differences in distributions

# This script takes 3 arguments. Run it:
# python your_script_file_name.py path/to/assembly1.fasta path/to/assembly2.fasta box_and_violin.png

import os, sys
from Bio import SeqIO
import matplotlib.pyplot as plt

def main():
    # take the first argument to the script as the filename to the bacterial assembly 
    assembly1_filename = os.path.abspath(sys.argv[1])
    assert assembly1_filename.endswith(".fasta")

    # take the second argument to the script as the filename to the viral assembly
    assembly2_filename = os.path.abspath(sys.argv[2])
    assert assembly2_filename.endswith(".fasta")

    # set the filename of the output PNG file
    out_filename = os.path.abspath(sys.argv[3])
    assert out_filename.endswith(".png")

    # open the assembly files and get the lengths of each contig within one list 
    with open(assembly1_filename) as handle:
        lengths_assembly1 = [len(record.seq) for record in SeqIO.parse(handle, "fasta")]

    with open(assembly2_filename) as handle:
        lengths_assembly2 = [len(record.seq) for record in SeqIO.parse(handle, "fasta")]

    # use pyplot to create a multi panel plot with 2 columns and 1 row. figsize is in inches...
    fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))

    # make violin plots for both lists of contig lengths
    axs[0].violinplot([lengths_assembly1, lengths_assembly2], showextrema=False)

    # add axis description and title, limit the y range for comparability
    axs[0].set_xticks([1,2], ["phages", "bacteria"])
    axs[0].set_ylim([0,100000])
    axs[0].set_title('Violin plot')

    # make box plots for both lists of contig lengths and add lables
    axs[1].boxplot([lengths_assembly1, lengths_assembly2], showfliers=False)
    axs[1].set_xticks([1,2], ["phages", "bacteria"])
    axs[1].set_title('Box plot')
    
    fig.savefig(out_filename, dpi=200)
    

if __name__ == "__main__":
    main()

Plot the difference between two assemblies

Plot the distribution of the lengths of both, the bacterial and the viral assemblies using Matplotlib. Choose at least 2 of the following:

histogram

boxplot

violin plot

What are the strengths and weaknesses of your chosen visualizations? How do they compare in highlighting the differences in the length distributions of the two assemblies?

Key Points

matplotlib and pyplot provide multiple tools for the visualization of data points

Gene Calling and Functional Annotation I

Overview

Teaching: 120 min
Exercises: 60 min

Objectives

Understand what is gene calling and functional annotation

Investigate the features of phages and how it impacts ORF annotation

Understand what information functional annotation can provide

Gene Finding

Once we have an assembled genome sequence (or even a contig that represents a fragment of a genome), we want to know what kind of organism the sequence belongs to (taxonomic annotation), what genes it encodes (gene annotation or gene calling), and the functions of those genes (functional annotation). Ideally, gene calling uses a good gene model that is tailored to the organism of study, but to determine the organism we need to know its genes. So it can be an iterative process. In our case, there are pretty good standard gene models available for bacteria and for phages, which we will use.

Phage genomes have specific features that are important to keep in mind. The following lecture by Dr. Robert Edwards explains these, and how they were taken into account when developing a specialized phage gene caller Phanotate.

Click on the image to see the lecture video “Phage Genomics” by Dr. Rob Edwards (29 minutes):

The video below by Dr Evelien Adriaenssens explains more about phage genome structure and functional annotation

Click on the image to see the lecture video “Basics of phage genome annotation & classification: getting started” by Dr Evelien Adriaenssens (68 minutes, watch from 12:02 to 30:45):

Exercise

Watch the video lectures for gene finding and functional annotation to answer the questions below:

What is genome annotation?

What are ORFs and why are they important in annotating genes?

Explain the Phanotate algorithm.

Name at least 3 features of phages should be considered when developing a phage gene annotation tool

Additional Resources

The links of Prodigal and Phanotate might also useful for answering the questions.

As additional material, the video by Dr. Katelyn McNair contains explanations of the algorithm of Phanotate. The topic is similar to the video by Dr. Robert Edwards, but more detailed about the algorithm

Functional Annotation

Once you have ORFs for your phage genomes, you might be curious to what genes they represent. Annotating genes on a genome is called functional annotation and gives insights into the phage lifestyle (lytic or lysogenic), taxonomy, hosts interactions and much more. Therefore, this is probably one of the most biologically informative and interesting steps of the viromics pipeline.

To annotate the phage genomes we will use a tool called Pharokka. Pharokka uses phannotate for the gene calling.

Exercise

The pharokka paper.
Please read the paper for pharokka - from sections 1 - 2.4 to answer the following questions:

What genomics features are taken into account in the pharokka algorithm?

How does pharokka assign functional annotations to ORFs using PHROGs?

What is an HMM?

Additional Resources

The PHROG paper and the PHROGs database might also be useful for you

A caveat to pharokka

As pharokka uses PHROGs to annotate viral genomes, it is limited in the types of auxiliary metabolic genes it can assign. This limitation is set by which genes were previously classified as AMGs and therefore have a PHROG. De novo assignment of AMGs is not possible. AMGs therefore might be better assigned by using a bacterial or a combination viral+bacterial functional annotation tool like DRAM

Key Points

Genome annotation gives meaning to genomic sequences

ORFs can be predicted from start and stop codons in the genomic sequences

Phages have different genomic features than prokaryotes, which influences the design of algorithms

Tools like Phanotate are very useful to process a large amount of contigs. However, no tool is perfect, so a critical interpretation of the results is important

Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Gene Calling and Functional Annotation II

Overview

Teaching: min
Exercises: 180 min

Objectives

Perform gene calling and functional annotation on the viral contigs

Running pharokka

For the functional annotation of our viral contigs, we will be using Pharokka. The “viral contigs” are the contigs we filtered yesterday (i.e. using Jaeger and CheckV)

We first have to load the appropriate conda environment

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh

conda activate pharokka_v1.7.1

The base usage for this program follows:

pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads>  

See the docs for the pharokka parameters

We have to add the following options:

the pharokka database is here: /work/groups/VEO/databases/pharokka/v1.7.1/
To run on viral contigs from a metavirome: -m
To also use PyHMMER --meta_hmm
To use Phanotate for the gene prediction: -g phanotate

Therefore your command should look like this:

pharokka.py -i filtered_assembly.fasta -o output_dir -d path/to/database -t 10 -m --meta_hmm -f -g phanotate

We recommend modifying the following SBATCH resources:

#SBATCH –cpus-per-task=10
#SBATCH –partition=short,standard,interactive
#SBATCH –mem=10G

Exercise

Run pharokka with the appropriate parameters using an sbatch script

sbatch script for pharokka

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=20G
#SBATCH --time=02:00:00
#SBATCH --job-name=pharokka
#SBATCH --output=./1.4_annotation/10_pharokka/pharokka.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/pharokka.slurm.err.%j

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate pharokka_v1.7.1


database='/work/groups/VEO/databases/pharokka/v1.7.1/'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./1.4_annotation/10_pharokka'

# extra options to use

# -m: run in metavirome mode (each contig is a new virus)
# --meta_hmm : also run PyHMMER on the PHROGS (richer annotations with this)
# -f: force restart the outputs
# -g: gene annotation by phanotate (default for meta mode is prodigal-gv)

pharokka.py -i $assembly -o $outdir -d $database -t 10 -m --meta_hmm -f -g phanotate

Plotting functional annotations

Choose a pet contig (or two)

Exercise - extracting contig information

Choose a pet contig. Maybe your favourite contig number… but we recommend choosing one with high completeness and which is ‘annotation-rich’

perhaps with many annotated genes, a crispr region, mash results (from pharokka) etc.

Discuss with your classmates so that everybody picks a different virus one. We will continue using this “pet” contig in the future lessons also. a) Write down some information about your contig. e.g. contig length, # of genes, # of annotated genes, gc%, coding density

For the plotting we, we will still use pharokka. Pharokka’s built-in pharokka_plotter.py uses a python library called pyCirclize. You will also use this library later for the visualization homework so stay tuned!

pharokka_plotter.py requires a gff, genbank and fasta file for the viral contig. Therefore, we have to extract these from the larger pharokka.gff and pharokka.gbk outputs.

The .gff file contains information about the gene positions and strands
The .gbk file contains information about the functional annotations of each gene
The fasta file contains the nucleotide sequence

Exercise - extracting contig information

We need 3 pieces of information about your particular contig to use the pharokka plotter

contig.gff : extracted from the pharokka.gff

contig.gbk : extracted from the pharokka.gbk

contig.fasta : nucl. sequence extracted from the filtered assembly

Extract the functional annotation information for your contig from the pharokka.gff and pharokka.gbk files and the contig nucleotide sequence from the fasta file

Gff and fasta files are easy to pull information from so we can just grep or sed the contig directly into an output file

Extract contig.gff

grep "contig_xxxx" pharokka.gff > contig_xxxx.gff

# for example
mkdir -p ./1.4_annotation/10_pharokka/11_selected_contig

# get the contig.gff
grep "contig_827" ./1.4_annotation/10_pharokka/pharokka.gff > ./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gff

Extract contig fasta

# if your fasta file is in multi-line format
# this tells sed to print between "contig_1026" and the next ">", including both patterns
# `head` removes the last line (the second header)
sed -n '/contig_1026/,/>/p' phage_contigs.fasta | head -n -1

# for example:
sed -n '/contig_827/,/>/p' ./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta | head -n -1  > ./1.4_annotation/10_pharokka/11_selected_contig/contig_827.fasta

Genbank files are a bit more difficult to extract information from but they are formatted sequentially by records. Luckily for us, there are pre-built modules in the biopython library to work with genbank files. SeqRecord and Bio.SeqIO.parse

If you head the pharokka.gbk file, you might see how the records are structured:

LOCUS       contig_1026            35733 bp    DNA     linear   PHG 06-SEP-2024
DEFINITION  contig_1026.
ACCESSION   contig_1026
VERSION     contig_1026
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             complement(618..749)
                     /ID="contig_1026_CDS_0001"
                     /transl_table=11
                     /phrog="No_PHROG"
                     /top_hit="No_PHROG"
                     /locus_tag="contig_1026_CDS_0001"
                     /function="unknown function"
                     /product="hypothetical protein"
                     /source="Pyrodigal-gv_0.3.1"
                     /score="3.3"
                     /phase="0"
                     /translation="MTRWNTTLAVYEIWTGTQWQAVASSTYSINYLIVAGGASGAHY*"

See if you can write a script to extract the genbank record for your contig using this module - for clues - see solution below!

The whole contig is called a “record” - see the biopython SeqRecord class page for what each element in the record is called, use Bio.SeqIO.parse to iterate through these records and Bio.SeqIO.write to write out the record you want.

biopython script to extract contig.gbk (genbank)

#!/usr/bin/python3

from Bio import SeqIO
import sys 

# read files in as variables
input_gbk = sys.argv[1]
output_gbk = sys.argv[2]
contig = sys.argv[3]

# parse the genbank and write output
for record in SeqIO.parse(input_gbk, "genbank"):
   # print(record.id)
   if record.id == contig:
       SeqIO.write(record, output_gbk, "genbank")

sbatch script to run your python script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=00:01:00
#SBATCH --job-name=py_script
#SBATCH --output=./1.4_annotation/10_pharokka/extract_contigs.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/extract_contigs.slurm.err.%j

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# change the output and contig names accordingly
python3 python_scripts/annotation/extract_contig_gbk.py pharokka.gbk output_contig.gbk contig_xxxx

Plot your contigs!

Once we have our fasta, gff and genbank files, we can use pharokka_plotter.py to plot our pet contig. pharokka_plotter.py requires the same conda environment as pharokka. So, to to activate the environment use the same commands you used for pharokka. Feel free to experiment with the different options given in the manual. The base usage is:

pharokka_plotter.py -i input.fasta -n pharokka_plot --gff pharokka.gff --genbank pharokka.gbk

You no longer need the pharokka database and we recommend modifying the following SBATCH parameters #SBATCH --mem=10G

Exercise - plotting

Plot the genome annotations for your pet contig

Make sure to add your plot into your lab book!

sbatch script for all the extraction steps + pharokka_plotter

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=10G
#SBATCH --time=2:00:00
#SBATCH --job-name=pharokka_1
#SBATCH --output=./1.4_annotation/10_pharokka/pharokka_plotter.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/pharokka_plotter.slurm.err.%j

assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
outdir='./1.4_annotation/10_pharokka/11_selected_contig'

mkdir -p $outdir

# get the contig.gff and contig.fasta
grep "contig_827" ./1.4_annotation/10_pharokka/pharokka.gff > $outdir/contig_827.gff

sed -n '/contig_827/,/>/p' $assembly | head -n -1  > $outdir/contig_827.fasta

# next, we get the contig.gbk using a python script
# change the names accordingly

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# the inputs for the following python script should be
# python3 script.py input.gbk output.gbk contig_name
input_gbk='./1.4_annotation/10_pharokka/pharokka.gbk'
output_gbk='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gbk'
name='contig_827'

python3 python_scripts/annotation/extract_contig_gbk.py $input_gbk $output_gbk $name

# Then we run pharokka plotter on our selected contig

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate pharokka_v1.7.1

# define all your file paths for readability
outdir='./1.4_annotation/10_pharokka/11_selected_contig/'

fasta='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.fasta'
gff='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gff'
gbk='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gbk'

pharokka_plotter.py -i $fasta -n $outdir/pharokka_plot --gff $gff --genbank $gbk

The final plot might look something like this:

Interpreting your contig

With functional annotation we can begin to get an insight into our viruses to piece together information about their lifestyles, hosts interactions etc, what type of genes they have. You’ll find that most genes on viral contigs are not actually annotated - this is because 1) viruses evolve very quickly and assigning gene annotations by homology is difficult, and 2) viruses are highly mosaic - they pick up and lose genes from their hosts and other viruses, meaning not all viruses have all the same genes and viral gene databases do not always account for their variability. The creation of newer tools like phold aid in this by using structural homologs to assign gene annotation.

Note : We have phold installed on Draco, if you would like to experiment with it for your project! For example, you might pick a contig that is NOT well annotated and see how many extra genes you can annotate using structural homology

Exercise - Interpret your contigs

How many structural genes are annotated?

Describe at least 2 different types of structural genes a) Where are they placed in your genome? b) To which part of phage structure do they belong? (capsid, tail etc) c) What other viruses have these structural features?

How many replication genes are annotated?

These could be genes related to transcriptional regulation, DNA/RNA metabolism lysis etc

Describe the role of 1 replication gene a) What is the function of the gene? b) What part of the replication cycle is it used? c) How does it interact with the host genes (if they do)

Describe 1 or more interesting genomic features

CRISPR arrays, tRNA, virulence factors etc

To the best of your ability, infer how this virus might be interacting with its host.

You might investigate clues temperate/lytic lifestyles, auxiliary metabolic genes, anti-crisprs etc

Key Points

Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Visualizing an annotated viral genome

Overview

Teaching: min
Exercises: 90 min

Objectives

Use Python to draw a map of a viral genome

Visualize gene overlap and direction

Next, we will visualize the differences in ORFs called by the two tools phannotate and prodigal-gv. Phannotate is used by Pharokka, which we ran to find genes and annotate them with funktions today. Yesterday we used CheckV to estimate the completeness of viral contigs. This tool is based on gene annotations found by prodigal-gv. To illustrate the difference between the two methods, we will plot the sets of genes used by both tools similar to the plots you generated today using Pharokka, which are based on the python package pyCirclize. The pyCirclize package provides tools to generate high-quality plots of annotated genomes in a circular layout, and allows for adding various data elements. Check the GitHub page to get an idea of its possibilities. To use the package, you have to install it into your virtual environment:

# activate the environment
$ source path/to/your/py3env/bin/activate

# install the packages we need
$ pip install pyCirclize

You can get some examples of how to use the package here. Starting from the first example, the Circos plot of the Enterobacter phage, adjust it to show both Phanotate and CheckV ORFs on your pet phage genome. We want to compare the differences in ORF calling. The information about the ORFs used by CheckV can be found in the tmp folder in your CheckV results (gene_features.tsv).

Exercise - plot two sets of gene predictions

Optional: Think about a way to convert the information in the TSV file into something parsable by pyCirclize and write a script for plotting the two sets of ORFs.

You can use the solution for the optional task above to create a .gff file from the gene predictions contained in the gene_features.tsv file of CheckV. If you run the script(s) on Draco, please remember to use slurm. The plotting script should not need much memory and can run on a single core.

Python script for converting tabular annotations into the GFF format

# This script expects three parameters to be passed to it. The path to CheckV's gene_features.tsv,
# the filename of the output .gff file and the name of the contig to be exported to gff.
 
import pandas as pd
import gffutils.gffwriter as gw
import gffutils.feature as gf
import os, sys

def main():

  # read parameters passed to the script. It ex
  in_file = os.path.abspath(sys.argv[1])
  assert in_file.endswith(".tsv")
	
  out_file = os.path.abspath(sys.argv[2])
  writer = gw.GFFWriter(out_file)
	
  contig_id = sys.argv[3]

  # read the .tsv file into a pandas dataframe (sep="\t" reads tab-separated files)
  df = pd.read_csv(in_file, sep="\t")
  # select the ORFs with the specified contig name
  df = df.loc[df["contig_id"] == contig_id]

  # iterate through the ORFs of the selcted contig
  for index, row in df.iterrows():
    # convert the strand integer into GFF-compatible "+" and "-"
    if row["strand"] == 1: strand = "+"
    else: strand = "-"
    # use gfutils to create a GFF feature from the information in the dataframe and append it to the output file 
    f = gf.Feature(seqid=row["contig_id"], featuretype="CDS", start=row["start"], end=row["end"], strand=strand)
    writer.write_rec(f)

  # close the writer to flush the writing buffer and properly finish the file
  writer.close()

if __name__ == "__main__":
    main()

python script for plotting ORFs with pyCirclize

# This script expects four paramerters passed to it. 1. The filename of the .gff file you generated with the above script,
# 2. the gff created by phannotate during todays practical session, 3. the filename of the output .png file and 4. the name
# of the contig you want to plot

from pycirclize import Circos
from pycirclize.parser import Gff
import os, sys

def main():
  # read the parameters passed to the script (see first comment for details)
  prodigal_filename = os.path.abspath(sys.argv[1])
  assert prodigal_filename.endswith(".gff")
	
  phanotate_filename = os.path.abspath(sys.argv[2])
  assert phanotate_filename.endswith(".gff")
	
  out_filename = os.path.abspath(sys.argv[3])
  assert out_filename.endswith(".png")
	
  contig_id = sys.argv[4]

  # use pycirclize's GFF parser to read the two input GFF files
  prodigal_gff = Gff(prodigal_filename)
  phanotate_gff = Gff(phanotate_filename)

  # Initialize circos sector by genome size
  circos = Circos(sectors={phanotate_gff.name: phanotate_gff.range_size})
  circos.text(contig_id, size=15)
  sector = circos.sectors[0]
	
  # Outer track: add size marks for the genome's length
  outer_track = sector.add_track((98, 100))
  outer_track.axis(fc="lightgrey")
  outer_track.xticks_by_interval(5000, label_formatter=lambda v: f"{v / 1000:.0f} Kb")
	outer_track.xticks_by_interval(1000, tick_length=1, show_label=False)
	
  # Plot forward & reverse CDS genomic features and color them differently for phannotate's ORF predictions
  cds_track = sector.add_track((90, 95))
  cds_track.genomic_features(phanotate_gff.extract_features("CDS", target_strand=1), plotstyle="arrow", fc="salmon")
  cds_track.genomic_features(phanotate_gff.extract_features("CDS", target_strand=-1), plotstyle="arrow", fc="skyblue")

  # Plot forward & reverse CDS genomic features and color them differently for prodigal's ORF predictions
  cds_track = sector.add_track((82, 87))
  cds_track.genomic_features(prodigal_gff.extract_features("CDS", target_strand=1), plotstyle="arrow", fc="salmon")
  cds_track.genomic_features(prodigal_gff.extract_features("CDS", target_strand=-1), plotstyle="arrow", fc="skyblue")

  # save plot to file
  circos.savefig(out_filename)
	
if __name__ == "__main__":
  main()

sbatch script for submitting the plotting script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=short
#SBATCH --mem=2G
#SBATCH --time=00:01:00
#SBATCH --job-name=plot
#SBATCH --output=./1.4_annotation/20_plot_viral_genome/viral_genome.slurm.%j.out
#SBATCH --error=./1.4_annotation/20_plot_viral_genome/viral_genome.slurm.%j.err

# activate the python virtual environment with the packages we need for plotting
source ./py3env/bin/activate

mkdir -p ./1.4_annotation/20_plot_viral_genome/

# define an array with contig ids to plot
declare -a arr=("contig_827" "contig_827")

features='./1.3_virus_identification/20_results_assessment_checkv/cross_assembly/tmp/gene_features.tsv'
outdir='./1.4_annotation/20_plot_viral_genome'
pharokka_dir='./1.4_annotation/10_pharokka/11_selected_contig'

# loop through the array
for contig in "${arr[@]}"
do
      echo $contig
      echo "$contig"
      # run the python script wich extracts the prodigal-gv-based annotations from the CheckV output
      # in this solution, the script expects three parameters:
      # first: the filname of the table generated by CheckV
      # second: the output filename for the filtered gff file
      # third: the contig id to filter for
      python python_scripts/annotation/10_table_to_gff.py $features $outdir/$contig.gff $contig
      
      # run the python script wich plots the two annotations in a circle
      # in this solution, the script expects four parameters:
      # first: the gff file of the prodigal-gv-based annotation created from CheckV's gene_features.gff
      # second: the gff file created by phanotate 
      # third: the output file name
      # fourth: the contig id to write into the plot

      python python_scripts/annotation/10_viral_genome.py $outdir/$contig.gff $pharokka_dir/$contig.gff $outdir/"$contig"_circos.png $contig
done

deactivate

Compare the annotations

Describe the differences between the two gene annotations based on the plots you created. Does the visualization confer these differences well?

Key Points

The pyCirclize package provides tools for plotting genomic data, e.g., the set of genes, in a circular layout.

Host Prediction I

Overview

Teaching: 60 min
Exercises: 120 min

Objectives

Understand how biological information is used to predict hosts

Understand the difficulties with host prediction

Learn about the new techniques that are being used for host prediction

Host prediction lecture

Listen to lecture by Varada and Malte (pay attention!!!) slides for host prediction

Questions

What are some biological interactions viruses have with their hosts? (hint: start with the infection cycle)

Pick one classical host prediction method:

In brief, explain how this works

What is the major challenge with this method?

How can you be confident in your host prediction?

Evaluate the following scenarios: a) Two viral contigs match to the same host during the host prediction b) One viral contig matches 2 hosts: s__Salinibacter ruber and s__Longimonas halophila

For the hands-on part - we will be using a tool called RaFAH. This tool uses random forest model machine learning models to predict hosts for phages. They train the ML model using the protein content of viral sequences and compare it with manually curated classical host predictions (CRISPR sequences, tRNA and homology based matches) and related other tools. Have a look into the RaFAH paper and focus on the introduction and Figures 1 and 2.

Questions

Briefly discribe how RaFAH derives its predictions from a genome sequence.

RaFAH produces a probability score for each host genus in its training set and reports the genus with the maximum probability. In Figure 2, you can find how this score relates to precision and recall for the test dataset the authors used.

Decide, which probability score you would use as a cutoff for predictions you would trust. Explain your decision.

Additional resources

IBM link on Random Forest (RF): introduces RF and decision trees in a short and simple way
Chapter on RF for Bioinformatics: explains how RF measures feature importance and describes Bioinformatic applications
Chapter on Decision Trees and RF: describes decision trees and RF by direct application in python. Also, if you want a great introduction to machine learning in general and python programming, take a look at the whole book by Jake VanderPlas, which is freely available online

Key Points

Host Prediction II

Overview

Teaching: 0 min
Exercises: 240 min

Objectives

Run RaFAH to predict a host genus for each contig.

Run BLASTn to find homologies between metagenome and metavirome.

Host prediction

Today, we will use the tool RaFAH to predict a genus of a putative host for our contigs. The tool predicts proteins in each viral sequence and then assigns them to a set of orthologous groups. The annotated proteins are then used to predict a host genus with a pretrained random forest model. This model matches the set of proteins with host proteins it was trained on.

After running RaFAH, we will compare the resulting prediction with the bacterial metagenome you briefly analyzed in one of last weeks homeworks. The bacterial metagenome was taxonomically annotated using GTDB. We will get into taxonomic classification of viruses tomorrow, for today it is important to know that there are different methods of taxonomic classification. RaFAH uses the NCBI taxonomy, so we will have to translate the output of RaFAH into the GTDB taxonomy to be able to compare the predictions with our bacterial metagenome.

Next, we will use BLAST as an alternative way of linking viruses to their hosts. With BLAST, we will align our viral contigs to putative bacterial and archael genomes (Metagenome-Assembled-Genomes, MAGS) to find sequence matches between the viruses and potential hosts.

Finally, we will combine the results from RaFAH and BLAST to compare the two methods and see how well they agree.

RaFAH

Exercise - Use RaFAH to predict hosts for our contigs

RaFAH requires a single file for each contig to run. You first have to write a python script which separates the combined assembly into single files. You can use the package biopython installed in your virtual environment for this:
import os, sys
from Bio import SeqIO

# define file paths from the arguments
...

# loop through the records in the combined assembly
with open(assembly_path) as handle:
    for record in SeqIO.parse(handle, "fasta"):
        # set a filename per record
        out_fasta = os.path.join(out_dir, f"{record.id}.fasta")
        # write the record to the file
        with open(out_fasta, "w") as fout:
            SeqIO.write([record], fout, "fasta")
Next, you will write a python script, which translates the host genus predictions from the NCBI taxonomy into the GTDB taxonomy. You can copy the translation table from this location on draco:
$ cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.1_host_prediction/ncbi_to_gtdb.csv ./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv
There are hundreds of thousands of taxa in both taxonomy schemes, so we need to efficiently map one onto the other. Here are some hints:
import os, sys
import pandas as pd

# Extract the genus part of the NCBI taxonomy.
# This should correspond to RaFAH's output
def get_ncbi_genus(s):
    for ss in s.split(";"):
        if ss.startswith("g__"): return ss[3:]
    return "none"

# Extract the complete taxonomic classification up to genus from GTDB
def get_gtdb_upto_genus(s):
    return s.split(";s__")[0]

# set or get filenames from sys.argv and read tables with pandas
...

# create a dictionary for efficient mapping between the two taxonomies
translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()}

# use the dictionary to translate the predictions in the RaFAH output table
...
You can run the scripts in the same sbatch script as RaFAH. Remember to source and deactivate our python virtual environment accordingly. Here you can find a description of the parameters you can pass to RaFAH (the page is a bit hard to read). The tool is programmed in perl and you can find it here on draco:
# activate the conda environment with the dependencies RaFAH requires
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1

# set a variable to call the RaFAH script
rafah='/home/groups/VEO/tools/rafah/RaFAH.pl'

# create an output folder or use the one you set for the slurm logs:
mkdir 10_results_hostprediction_rafah

# run RaFAH with the appropriate file paths and arguments
perl "$rafah" --predict --genomes_dir 10_results_hostprediction_rafah/split_contigs --extension .fasta
RaFAH is computationally expensive and can use multiple threads. We recommend you set the following sbatch parameters:

#SBATCH –cpus-per-task=32

#SBATCH –partition=standard,short

#SBATCH –mem=50G

RaFAH uses the random forest model to compute a probability for all host genuses included in its training. You can find these probabilities in the output file *Host_Predictions.tsv. The file *Seq_Info_Prediction.tsv contains per contig the genus with the highest probability.

How many contigs have a probability score larger than your chosen cutoff?

Does the probability score relate to the completeness of the contigs? (qualitative answer)

python script for splitting the assembly into separate files

import os, sys
from Bio import SeqIO

def main():

    # define file paths from the arguments
    assembly_path = os.path.abspath(sys.argv[1])
    assert assembly_path.endswith(".fasta")

    # set an output directory and create it if it does not exist
    out_dir = os.path.abspath(sys.argv[2])
    if not os.path.exists(out_dir): os.makedirs(out_dir)

    # loop through the records in the combined assembly
    with open(assembly_path) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            # set a filename per record
            out_fasta = os.path.join(out_dir, f"{record.id}.fasta")
            # write the record to the file
            with open(out_fasta, "w") as fout:
                SeqIO.write([record], fout, "fasta")

if __name__ == "__main__":
    main()

python script for translating NCBI genus IDs to GTDB

import os, sys
import pandas as pd

# Extract the genus part of the NCBI taxonomy.
# This should correspond to RaFAH's output
def get_ncbi_genus(s):
    for ss in s.split(";"):
        if ss.startswith("g__"): return ss[3:]
    return "none"

# Extract the complete taxonomic classification up to genus from GTDB
def get_gtdb_upto_genus(s):
    return s.split(";s__")[0]

def main():

    # set the filename of the rafah output file containing the best prediction per contig
    rafah_filename = os.path.abspath(sys.argv[1])
    assert rafah_filename.endswith(".tsv")

    # set the filename of the translation table with NCBI and GTDB taxonomic IDs
    translation_filename = os.path.abspath(sys.argv[2])
    assert translation_filename.endswith(".csv")
    
    # set the output filename
    out_filename = os.path.abspath(sys.argv[3])
    assert out_filename.endswith(".csv")

    # read the input tables using pandas
    translate_df = pd.read_csv(translation_filename)
    rafah_df = pd.read_csv(rafah_filename, sep='\t')

    # we are only interested in the first 3 columns
    rafah_df = rafah_df.iloc[:, :3]

    # create a dictionary for efficient matching of NCBI to GTDB
    translation_dict = {get_ncbi_genus(r["ncbi_taxonomy"]):get_gtdb_upto_genus(r["gtdb_taxonomy"]) for i, r in translate_df.iterrows()}

    # translate all NCBI genus predictions to GTDB.
    # We don't manipulate the table while iterating over it because this could lead to problems.
    translated_hosts = []
    for i, row in rafah_df.iterrows():
        translated_hosts.append(translation_dict[row["Predicted_Host"]] if row["Predicted_Host"] in translation_dict else "no_translation") 

    # add the translated predictions to the dataframe and save the table.
    rafah_df["Predicted_Host"] = translated_hosts
    rafah_df.to_csv(out_filename,index=False)

if __name__ == "__main__":
    main()

sbatch script for host prediction with RaFAH

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=50G
#SBATCH --time=02:00:00
#SBATCH --job-name=rafah
#SBATCH --output=./2.1_host_prediction/10_rafah/rafah.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/10_rafah/rafah.slurm.%j.err

assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
contigs='./2.1_host_prediction/10_rafah/split_contigs'

# RaFAH expects each genome in a separate file. Activate our virtual environment
# and run a python script to split our filtered assembly into single files
source ./py3env/bin/activate

# The script requires the assmbly path and a directory for outputting the contigs
python ./python_scripts/host_prediction/10_split_assembly_file.py $assembly $contigs

# deactivate the python environment, just to be sure not to cause problems with the conda environment RaFAH needs
deactivate

# activate the conda environment with the dependencies RaFAH requires
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate perl_v5.32.1

# set a variable to call the RaFAH script
rafah='/home/groups/VEO/tools/rafah/RaFAH.pl'

# rafah parameters (https://gensoft.pasteur.fr/docs/RaFAH/0.3/)
# --predict: run the pipeline for predicting hosts
# --genomes_dir: the directory with the separate files for each contig
# --extension: the extension of the contig files
# --file_prefix: can specify an output dir here and "run name" (rafah_1) here

perl "$rafah" --predict --genomes_dir $contigs --extension .fasta --file_prefix ./2.1_host_prediction/10_rafah/rafah_1

# deactivate RaFAH's conda environment
conda deactivate

# Our python script for translating the Taxonomy needs our python environment again
source ./py3env/bin/activate

# set the path to the table with the translation between NCBI and GTDB and to RaFAH output
translation='./2.1_host_prediction/10_rafah/ncbi_to_gtdb.csv'
rafahtable='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions.tsv'

# run our script with the appropriate file names as parameters
python3 ./python_scripts/host_prediction/20_translate_taxonomy.py $rafahtable $translation ./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv

deactivate

Host Prediction using Blastn

While RaFAH uses a database of reference bacterial/archaeal genomes to find hosts, these bacteria/archaea may or may not be in your samples. One of the simplest ways to look for potential hosts is to do a quick blastn. Viral genomes often contain fragments of host DNA, for example an auxiliary metabolic gene. Conversely, viral DNA can also be found in host genomes; e.g. prophages, CRISPR spacers, fragments of viral genes, tRNAs etc.

For this task, we assembled metagnome-assembled-genomes (MAGs or bins) of the prokaryotic community from the same three samples given to you. This resulted in 110 MAGs. However, not all of these are complete and many are contaminated. After filtering for a completeness >= 50% and a contamination <= 6%, 19 “good MAGs” remained.

These “good MAGs” were concatenated and are located /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa. These MAGs will be the “database” against which we align our viral contigs (query sequences)

You also require the GTDB taxonomic classifications for the bins. These are located /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv

Exercise - copy the MAGs and the gtdb bin classification over

Please copy this fasta file into your own directory

# make a new directory for this analysis
mkdir ./2.1_host_prediction/20_blastn/

# copy over the good MAGs
cp /work/groups/VEO/shared_data/Viromics2024Workspace/test_run/2.1_host_prediction/20_blastn/good_bins_all.fa ./2.1_host_prediction/20_blastn/

# copy over the taxonomic classification for the MAGs
cp /work/groups/VEO/shared_data/Viromics2024Workspace/2.3_bacterial_binning/40_gtdbtk/results/gtdbtk.bac120.summary.tsv ./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv

Next, we will create a custom blast database using our MAGs fasta file and use blastn (nucleotide-nucleotide) to do the alignment

Exercise - makeblastdb and blastn

Create an sbatch script for the makeblastdb and blastn

makeblastdb is located at /home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb

blastn is located at /home/groups/VEO/tools/blast/v2.14.0/bin/blastn

A conda environment is NOT required for this tool!

The basic usage for the makeblastdb and blastn are the following

NOTE: the database you create in makeblastdb has to be fed into blastn
# make a custom database
makeblastdb -in mags_file.fasta -out  mags_file.fasta -dbtype nucl

# run blastn with the file.fasta database
blastn -db mags_file.fasta -query filtered_cross_assembly.fasta -out viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6
Resources for blastn can vary! Large databases require a lot of computational power and will take a long time but luckily our task is very small. We recommend the following resources
#SBATCH –cpus-per-task=32
#SBATCH –partition=short,standard
#SBATCH –mem=5G
#SBATCH –time=01:00:00

-outfmt 6 : is an output format that is tabular. Your output contains the following 12 columns in THIS order

query, subject, pid, aln_length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore

query : query id - this is your viral contig
subject : subject id - this is the bin and contig it matches
pid : percent id of the match
aln_length : alignment length
mismatch : number of mismatches
gapopen : number of gaps
qstart, qend : start and end positions of query sequence
sstart, send : start and end positions of subject sequence
evalue : expect value (the lower, the better)
bitscore : a score based on gaps, mismatches, % id and alignment length

More resources for blast output formats here and blast manual

sbatch script for blast tasks

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=5G
#SBATCH --time=01:00:00
#SBATCH --job-name=blastn
#SBATCH --output=./2.1_host_prediction/20_blastn/blastn.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/20_blastn/blastn.slurm.%j.err

# the blast tools
makeblastdb='/home/groups/VEO/tools/blast/v2.14.0/bin/makeblastdb'
blastn='/home/groups/VEO/tools/blast/v2.14.0/bin/blastn'

# the query (filtered cross assembly) and subject (MAGs fasta)
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
bins_fasta='./2.1_host_prediction/20_blastn/good_bins_all.fa'

$makeblastdb -in $bins_fasta -out $bins_fasta -dbtype nucl

$blastn -db $bins_fasta -query $assembly -out ./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out -num_threads 32 -outfmt 6 

Exercise - analyzing blast results

How many total hits did you get from blastn?

How many of these hits would you trust?

How many unique viral contigs were you able to get host predictions for from blastn?

Was there cases where 1 viral contig hit multiple MAGs?
a) which interaction would you then trust?

Combining results for blast and RaFAH

Now we can write a python script to combine the results from RaFAH and blastn to compare the host predictions from different method. You might use pandas dataframes and pd.join or pd.merge to write this.

You will need the following 3 files:

the blast output: ./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out file
the taxonomic classifications for the MAGs: ./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv
the rafah results with the converted taxonomy: ./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv

py script for merging the blastn and rafah results

import pandas as pd
import sys

# read in the files
blast = sys.argv[1]
taxa = sys.argv[2]
rafah = sys.argv[3]
out = sys.argv[4]

df_rafah = pd.read_csv(rafah, header=0)
print(df_rafah)

df_bins_gtdb = pd.read_csv(taxa, sep='\t' ,header=0)
print(df_bins_gtdb)

df_blast = pd.read_csv(blast, sep='\t', header=None,
                      names=['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore'])
print(df_blast.head(n=10))


# add MAG taxonomy to the blast results
df_blast['bin'] = df_blast['subject'].str.split('.fa').str[0]
print(df_blast.head(n=10))

df_blast_gtdb_merged = df_blast.merge(df_bins_gtdb[['user_genome','classification']], how='left', left_on='bin', right_on='user_genome')
print(df_blast_gtdb_merged.head(n=10))

# merge blast results with rafah
df_blast_gtdb_rafah_merged = df_blast_gtdb_merged.merge(df_rafah,how='left',left_on='query',right_on='Sequence',suffixes=(None,'_rafah'))
df_blast_gtdb_rafah_merged

# select relevant columns
df_blast_gtdb_rafah_merged=df_blast_gtdb_rafah_merged[['query', 'subject', 'pid', 'aln_length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'bin',	'classification','Predicted_Host', 'Winner_Score']]

# rename the columns to meaningful headers
df_blast_gtdb_rafah_merged = df_blast_gtdb_rafah_merged.rename(columns = {'query':'viral_contig', 'subject':'host_bin_contig','bin':'host_bin','classification':'blast_predicted_host','Predicted_Host':'rafah_predicted_host', 'Winner_Score':'rafah_score'})

print(df_blast_gtdb_rafah_merged)

# write out to a csv
df_blast_gtdb_rafah_merged.to_csv(out,index=False)

sbatch script for blast tasks

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short
#SBATCH --mem=5G
#SBATCH --time=01:00:00
#SBATCH --job-name=combine
#SBATCH --output=./2.1_host_prediction/30_combined_results/combine.slurm.%j.out
#SBATCH --error=./2.1_host_prediction/30_combined_results/combine.slurm.%j.err

# Our python script for combining blastn and rafah needs our python environment again
source ./py3env/bin/activate

# file paths
blast='./2.1_host_prediction/20_blastn/viral_contigs_vs_bins_blastn.out'
gtdb='./2.1_host_prediction/20_blastn/gtdbtk.bac120.summary.tsv'
rafah='./2.1_host_prediction/10_rafah/rafah_1_Host_Predictions_gtdb.csv'
output='./2.1_host_prediction/30_combined_results/combined_blast_rafah_host_predictions.csv'

python3 ./python_scripts/host_prediction/30_combine_host_prediction.py $blast $gtdb $rafah $output

Exercise - comparing RaFAH and blast results

For how many host predictions, do RaFAH and blastn agree and upto which taxonomic level?

Are you able identify CRISPRs or integrated phage genomes?
a) Explain your decision
b) How might you confirm your CRISPR/prophage identification?

Pick one example where you would trust RaFAH predictions more and another where you would trust blast predictions more and explain why.

Key Points

RaFAH uses a random forest model to predict hosts to the genus level for phages

RaFAH returns a probability for each host genus it can predict.

Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction

Machine learning and classical host prediction methods have a trade-off between recall and sensitivity

Viral taxonomy and phylogeny I

Overview

Teaching: 60 min
Exercises: 120 min

Objectives

Understand the differences between taxonomic approaches for viral and cellular organisms

Explain the challenges with viral taxonomy, and how they may be overcome

How are viruses taxonomically classified?

A widely accepted approach for taxonomic classification is by selecting a marker gene that is shared by all organisms, creating a multiple sequence alignment and phylogenetic tree, and identifying taxa as characteristic branches or lineages in the tree. This approach can be applied to all cellular organisms including bacteria, archaea, and eukaryotes, particularly with marker genes functioning in the ribosome. But only to subgroups of viruses, as no marker gene is universally conserved in all viruses. This lack of a universal genomic feature is thought to reflect their multiple evolutionary origins.

There is no single method to classify viruses. Many experts from the global virology community have done their part by classifying viruses according to their specific knowledge. This has generated a patchwork of methods that, ideally, capture the features of different viral lineages and generate meaningful taxa that are in agreement with biology. With the advance of viromics and the discovery of viruses by their genome sequences only, bioinformaticians have developed many methods to cluster them. Thus far, these methods have not been widely adopted for official ICTV virus taxonomy. They are rarely universally applicable, and if they are, there is bound to be some level of disagreement with some taxa that were previously defined by experts.

One popular method in the bacteriophage field is the gene-sharing network. This involves creating a network where nodes are genomes and edges represent shared gene families. Tight clusters in this gene-sharing network represent groups of viral genomes that share many genes, and those are interpreted as taxonomic groups. Clusters can be identified with different gene-sharing cutoffs, corresponding to different taxonomic ranks, where closely related viruses (species, genera) share more genes than distant ones (families). Tools like VICTOR and vConTACT3 work this way.

As more and more viruses are sequenced and we get a better view of the virosphere, marker genes are making a comeback. Although a universal marker gene shared by all viruses does not exist, marker genes are certainly shared by viruses of lower-ranking taxa (species, genera, families, orders, and in some cases above). This was already clear from the gene-sharing network. Detecting the marker gene is evidence that a virus belongs to a given taxon, and phylogenetic trees can help resolve the lower-level taxonomy, just like ribosomal marker gene trees for cellular organisms. Tools like vClassifier and geNomad use marker gene approaches.

Exercise - An overview of viral taxonomy

Check out the ICTV website: https://ictv.global/.

Discuss with your fellow students and write down your answers:

What is ICTV? (not just a TV channel featuring polar bears)

How many taxonomic ranks are available to classify the virosphere?

What is a realm?

How many viral genera are there?

Read the following sections from the review “Global Organization and Proposed Megataxonomy of the Virus World” by Koonin et al.:

The Baltimore classes of viruses, virus hallmark genes, and major evolutionary trends in the virus world

Evolutionary links among viruses within and across the Baltimore classes, section: Double-Stranded DNA Viruses

Discuss with your fellow students and write down your answers:

What are the Baltimore Classes, and what are they based on?

What triggered the change in viral taxonomy?

What are the pros and cons of a “genomic taxonomy”?

Key Points

Viruses have multiple origins, so there is no universal marker gene

Viral taxonomy is based on many different methods

Gene-sharing networks and marker genes are popular methods for bacteriophage taxonomy

Viral taxonomy and phylogeny II

Overview

Teaching: 240 min
Exercises: 0 min

Objectives

Run vConTACT3 to work out the taxonomy of your viruses

Interpret this taxonomy

What kind of phages exist in your dataset?

For this task, we will use vConTACT3 and geNomad.

vConTACT3

vConTACT3 has an underlying assumption that the fraction of shared genes between two viruses represents their evolutionary relationship. Thus, the vConTACT3 gene-sharing network closely correlates with the ICTV taxonomy.

Exercise - run vConTACT3

Run vcontact3 on your virome dataset

You will need the following commands for the conda environment and database

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9 

# Need to use with db path? 
vcontact3 --db-path /work/groups/VEO/databases/vcontact3/v20240513

See your options:

vcontact3 run --help

Example command

Note: Take care to have enough memory (RAM) for processing. vConTACT3 loads a full viral network into a graph

vcontact3 run --nucleotide virus_genomes.fasta --output output_directory --db-path /work/groups/VEO/databases/vcontact3/v20231101/ -e cytoscape,tree

# other commands you might use
# -t : threads
# --distance-metric : {VirClust,SqRoot,Shorter,Jaccard}
# --min-shared-genes : minimum number of shared genes for two genomes to be connected
# -e : exporting formats/files

sbatch script to run vcontact3

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=vcontact3
#SBATCH --output=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/10_vcontact_results/vcontact3.slurm.err.%j

# activate conda environment
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate mamba_20231101_python_3.9

database='/work/groups/VEO/databases/vcontact3/v20231101/'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./2.2_taxonomy/10_vcontact_results'

#options
# -e : how do you want the output formatted? See the vcontact3 help for more options

vcontact3 run --nucleotide $assembly --output $outdir --db-path $database -e cytoscape

Output hints

final_assignments.csv contains your taxonomic assignments .cyjs files are network files for Cytoscape .pkl.gz files are compressed pickle files (read with Python-pandas)

geNomad

geNomad has a different way of assigning taxonomy. It uses a marker-gene-to-taxonomy database. It first identifies marker genes on the contigs, classifies the taxonomy for each gene, and then consolidates it using weights. Read it here. From this pipeline, we will only use the annotate module - which includes functional annotation and taxonomic classification. geNomad can also identify and classify proviruses (integrated viruses) and plasmids.

Exercise - run geNomad

Run geNomad on your virome dataset

The conda environment you will require:
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh

conda activate genNomad_v20230721
We will only be using the annotate module from the geNomad pipeline. The basic command for this:
genomad annotate [OPTIONS] INPUT OUTPUT DATABASE

sbatch script to run genomad

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=50G
#SBATCH --time=2:00:00
#SBATCH --job-name=genomad
#SBATCH --output=./2.2_taxonomy/20_genomad_results/genomad.slurm.out.%j
#SBATCH --error=./2.2_taxonomy/20_genomad_results/genomad.slurm.err.%j

# activate conda environment
source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate genNomad_v20230721

database='/work/groups/VEO/databases/geNomad/v1.3'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./2.2_taxonomy/20_genomad_results'

#options

genomad annotate --threads 10 $assembly $outdir $database

Outputs from genomad

phage_contigs_taxonomy.tsv : taxonomic classifications phage_contigs_genes.tsv : functional annotations

One of the advantages of running geNomad is that it also outputs functional annotations as part of its “annotate” pipeline module. These functions are assigned differently to how pharokka assigns them using a marker dataset of 227,897 profiles.

Exercise - comparing your taxonomic annotations

What percentage of your viruses were classified by each tool?

How can you judge the accuracy of your taxonomic classifications?

How many classifications agree/disagree?

What is the maximum taxonomic distance between viruses that are still connected in this graph?

What are the limitations of using these tools?

Pick one of the viral contigs annotated by vConTACT3

How many genes are shared with known reference viruses?

Which genes are shared with the other known viruses?

Visualize your virus on the network. see homework exercises here.

Which viruses are closely related? Which are not?

Hint on querying pickle files using python

import pandas as pd

# read in the shared genes pickle file
# this pickle file contains genes shared at even 30% identity - there are other files for 40, 50 ,60 and 70% identity
df_3 = pd.read_pickle("HMMprofile.0.3_SqRoot_shared_genes.pkl.gz")
df_3
# check how many genes contig_518 shares with the other contigs
df_3_contig_518 = df_3.loc[df_3['contig_518'] > 0, 'contig_518']
df_3_contig_518

Next steps - getting all your data together

Now that we have results from many different tools and programs, we can build a table to compare and correlate them. You can build this table however you would like - using excel or python etc… If you are using python, I recommend using pandas df.merge or df.join methods for this task. You will likely need to use it multiple times. Here’s a sample of the type of table you need to build:

Contig	Contig length (bp)	Coverage	Completeness	Contamination	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Num Genes	Taxonomy	Taxonomy
		total			CheckV	CheckV	CheckV	Phannotate	Phannotate	Phannotate	GeNomad	GeNomad	GeNomad	vConTACT3	GeNomad
					Total	Viral	Bacterial	Total	Viral	Bacterial	Total	Viral	Bacterial
Contig_1	1000	97	0.3	50	45	2	50	47	2	50	47	2	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes	unknown

py script for merging all the output tables from all the tools!

# %%
import pandas as pd
import functools as ft

assembly_filename = "assembly_info.txt"
jaeger_filename = "assembly_default_jaeger.tsv"
checkv_filename = "quality_summary.tsv"
pharokka_filename = "pharokka_cds_functions.tsv"
genomad_filename = "assembly_genes.tsv"
genomad_taxonomy_filename = "assembly_taxonomy.tsv"
vcontact_filename = "final_assignments.csv"
out_filename = "viral_contigs_all_outputs_merged.csv"


# %%
df_jaeger = pd.read_csv(jaeger_filename, header=0, sep='\t',
                       usecols=[0,2,16]).set_index('contig_id')

df_jaeger = df_jaeger.loc[(df_jaeger['prediction'] == 'phage')]

print(df_jaeger)

# %%
df_assembly = pd.read_csv(assembly_filename, header=0, sep='\t',
                         usecols=[0,1,2,3],
                         names=['contig_id','length','coverage','is_circular']).set_index('contig_id')
print(df_assembly)

# %%
df_checkv = pd.read_csv(checkv_filename, header=0, sep='\t',
                       usecols=[0,4,6,7,9,11],
                       names=['contig_id','checkv_total_genes','checkv_viral_genes','checkv_quality','completeness','contamination']).set_index('contig_id')

df_checkv.fillna(0, inplace=True)
# print(df_checkv)

df_checkv = df_checkv.loc[(df_checkv['completeness'] > 50) & (df_checkv['contamination'] < 5)]

print(df_checkv)

# %%
df_pharokka = pd.read_csv(pharokka_filename, header=0, sep='\t')

df_pharokka = df_pharokka.pivot(index='contig',columns='Description')
df_pharokka.columns = df_pharokka.columns.get_level_values(1)

#df_pharokka = df_pharokka[['Count']]
df_pharokka = df_pharokka[['CDS','CRISPRs','DNA, RNA and nucleotide metabolism','connector', 'head and packaging','integration and excision', 'lysis','moron, auxiliary metabolic gene and host takeover','other','tRNAs','tail','transcription regulation']]

df_pharokka.rename(columns={'CDS':'phanotate_total_genes'}, inplace=True)


print(df_pharokka)

# %%
# genomad genes

df_genomad_genes = pd.read_csv(genomad_filename, header=0, sep='\t',
                      usecols=[0,8],
                      names=['contig_id_cds','marker'])
#print(df_genomad_genes)

# split the column, join to df and add a column name
df_genomad_genes = df_genomad_genes.join(df_genomad_genes['contig_id_cds'].str.rsplit('_',n=1, expand=True).add_prefix('A'))

# rename the columns
df_genomad_genes.rename(columns={'A0':'contig_id','A1':'cds'}, inplace=True)
# print(df_genomad_genes)

# get total genes
df_genomad_genes_counts = df_genomad_genes.groupby(by='contig_id').count()
# print(df_genomad_genes_counts)

# get viral genes base on the marker column containing "VV"
df_genomad_genes['marker'].fillna('bla',inplace=True)
df_genomad_vv_counts = df_genomad_genes[df_genomad_genes['marker'].str.contains('VV')].groupby(by='contig_id').count()
print(df_genomad_vv_counts)

# merge the gene counts and marker gene counts and rename
df_final_genomad = pd.DataFrame.join(df_genomad_genes_counts[['cds']],df_genomad_vv_counts[['marker']],on='contig_id', how='left')
df_final_genomad.rename(columns={'cds':'genomad_total_genes','marker':'genomad_viral_genes'}, inplace=True)

print(df_final_genomad)

# %%
df_rafah = pd.read_csv("rafah_1_Host_Predictions_gtdb.csv", header=0,
                      usecols=[1,2,3],
                      names=['contig_id','rafah_predicted_host','rafah_prediction_score']).set_index('contig_id')
print(df_rafah)


# %%
# vcontact3 taxonomy

df_vcontact = pd.read_csv(vcontact_filename,header=0)
df_vcontact = df_vcontact[df_vcontact['GenomeName'].str.contains('contig')]

cols = ['realm (prediction)','phylum (prediction)','class (prediction)','order (prediction)','family (prediction)','subfamily (prediction)','genus (prediction)']
df_vcontact['taxonomy'] = df_vcontact[cols].apply(lambda row: ';'.join(row.values.astype(str)), axis=1)

df_vcontact_final=df_vcontact[['GenomeName','taxonomy']]
df_vcontact_final.rename(columns={'GenomeName':'contig_id','taxonomy':'vcontact_taxonomy'}, inplace=True)
df_vcontact_final.set_index('contig_id',inplace=True)

print(df_vcontact_final)

# %%
# genomad taxonomy

df_genomad_taxa = pd.read_csv("assembly_taxonomy.tsv",sep='\t',header=0,
                             usecols=[0,2,3,4], names=['contig_id','genomad_agreement','genomad_taxid','genomad_taxonomy']).set_index('contig_id')

print(df_genomad_taxa)

# %%
# merging all the files together

dfs_list = [df_assembly,df_pharokka,df_final_genomad, df_rafah,df_vcontact_final,df_genomad_taxa]
df_jaeger_checkv = df_checkv.join(df_jaeger, how='inner')
print(df_jaeger_checkv)


df_others = ft.reduce(lambda left, right: pd.merge(left,right,how='left',left_index=True,right_index=True), dfs_list)

df_final = df_jaeger_checkv.join(df_others,how='left')

print(df_final)

# %%
df_final.to_csv(out_filename)

sbatch script for submitting the merging script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --mem=5G
#SBATCH --time=00:10:00
#SBATCH --job-name=merge
#SBATCH --output=2.2_taxonomy/30_merge_summary/merge.slurm.%j.out
#SBATCH --error=2.2_taxonomy/30_merge_summary/merge.slurm.%j.err

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate


python3 ./python_scripts/taxonomy/merge_summary.py

# deactivate the environment
deactivate

Key Points

VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies

VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes

Both methods have strong agreement with the ICTV classifications

Visualizing viral taxonomy

Overview

Teaching: 0 min
Exercises: 90 min

Objectives

Plot the network of closely related viruses created by vConTACT3

This section is suggested as homework.

Today, you taxonomically annotated your phage contigs with vConTACT3. The tool applies a gene-sharing network approach by computing the number of homologous genes between two viral sequences. Based on this information, it integrates unseen contigs into a reference network of taxonomically classified phages, and then uses information about related viruses in the network to infer a taxonomic classification. The resulting network, which contains both the reference phages and our assembled contigs, can be found in the output file graph.cyjs in your vConTACT3 results folder. This file contains the network encoded in JSON format, which can be parsed with the Python standard library package json.

Python also provides several tools for handling and analyzing networks. In this section, you will use the NetworkX package to visualize the network of phages. The goal is to get an idea of the taxonomic diversity of the contigs in our dataset, and how it compares to reference phages in the database. You can either focus on a single contig and its closely related phages (close network neighborhood) or plot the whole network to get an overview of the full virosphere (at least the viruses in the reference database).

NetworkX provides several functions for drawing the network. Our network only consists of nodes (phages) and weighted edges (number of shared genes) between them, without any layout information (where to position the nodes). There are many algorithms for finding a visually pleasing layout for a given network. The Kamada-Kawai layout and the spring layout both employ a physical force-based simulation that optimizes the distance or attraction between node pairs. In our network, the number of shared genes between two nodes serves as a measure of attraction. The spring layout is computationally less demanding and should be used for the visualization of the complete network.

Making these plots involves a fair amount of programming and testing your code. This homework is not so much about figuring out the code but to introduce you to networkx as a tool for network analysis and visualization, and create a visualization which can confer the relational character of the data better than a table could. You can either try to figure out the code for this homework for yourself or start with one of the solutions and try to adjust or improve it. The goal is for you to end up with a picture of the network of phages which confers the idea of the algorithm VConTACT3 implements.

# you have to add networkx to the virtual environment first
import networkx as nx
# the json package is part of the standard library
import os, sys, json

# parse the json file and create a networkx graph from it
with open(graph_filename) as json_file:
     graph_json = json.load(json_file)

     # the following lines are necessary for compatibility with networkx
     graph_json["data"] = []
     graph_json["directed"] = False
     for node in graph_json["elements"]["nodes"]:
         # the 'value' attribute will be used for naming the nodes
         node["data"]["value"] = node["data"]["id"]

     # create a networkx graph from the graph_json dictionary
     g = nx.cytoscape_graph(graph_json)

# use draw_networkx_nodes(...) and draw_networkx_edges(...) to draw the graph
# use g.subgraph(...) or g.edge_subgraph(...) to select a subgraph of interest

python script for plotting the local neighborhood of your selected contig

import os, sys, json
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib as mpl

# the keys to access taxonomic ranks of the phages in the graph
ranks = ["order", "class", "phylum"]

# this function looks for the lowest available rank according to the ones defined in the ranks variable
def get_lowest_tax(node):
    for r in ranks: 
        if r in node.keys(): return node[r]
    return "none"

def main():

    graph_filename = os.path.abspath(sys.argv[1])
    assert graph_filename.endswith(".cyjs")
    
    out_filename = os.path.abspath(sys.argv[2])
    assert out_filename.endswith(".png")

    contig_id = sys.argv[3]

    with open(graph_filename) as json_file:
        graph_json = json.load(json_file)

        # set for compatibility with networkx
        graph_json["data"] = []

        for node in graph_json["elements"]["nodes"]:
            # assign "value" attribute for compatibility
            node["data"]["value"] = node["data"]["id"]
            # assign lowest availble taxonomic rank with our function
            node["data"]["lowest_tax"] = get_lowest_tax(node["data"])

        g = nx.cytoscape_graph(graph_json)
        
    # find your favourite contig and get its internal node id
    contig_id_node = -1
    for node in g.nodes(data="name"):
    	if node[1] == contig_id: contig_id_node = node[0]
    
    # raise an error, if the contig could not be found
    if contig_id_node == -1: raise(AssertionError(f"Could not find {contig_id} in the graph"))

    # get all nodes connected to the selected contig and compute the corresponding subgraph
    node_set = {contig_id_node}.union({n for n in g.neighbors(contig_id_node)})
    subgraph =  g.subgraph(node_set)

    # get all taxa in the subgraph
    taxa = nx.get_node_attributes(subgraph, "lowest_tax")
    
    # set up a mapping from node to color based on the taxonomic rank
    unique_taxa = np.unique([t for t in taxa.values()])
    taxa_to_color = {t:mpl.colormaps["tab20"](i) for i, t in enumerate(unique_taxa)}
    colors = {n:taxa_to_color[t] for n, t in taxa.items()}

    # compute a positional layout using the kamada-kawai algorithm
    pos = nx.kamada_kawai_layout(subgraph)

    # set colors for nodes. use the taxonomic information to color the reference nodes
    names = nx.get_node_attributes(subgraph, "name")
    colors = [('seagreen' if names[n] == contig_id else 'red') if names[n].startswith('contig') else colors[n] for n in subgraph.nodes()]

    # draw the network using the functions networkx provides
    nx.draw_networkx_edges(subgraph, pos=pos, width=0.05, alpha=0.4)
    nx.draw_networkx_nodes(subgraph, pos=pos, alpha=0.7, node_size=100, node_color=colors)
    nx.draw_networkx_labels(subgraph, pos=pos, labels=names, font_size=3)

    # create rectangles for a legend connecting taxa with their colors in the graph
    patches = [mpatches.Patch(color=c, label=l, alpha=0.6) for l,c in taxa_to_color.items() if l!='none'] + \
   	   [mpatches.Patch(color='red', label='our contigs'), mpatches.Patch(color='green', label='selected contig')]
   	
    plt.legend(handles=patches, loc='upper right', framealpha=0.5, frameon=True, prop={'size': 5})

    ax = plt.gca()
    ax.axis('off')
    plt.tight_layout()
    plt.savefig(out_filename, dpi=700)


if __name__ == "__main__":
    main()

python script for plotting the complete network of phages

import os, sys, json
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib as mpl

# the keys to access taxonomic ranks of the phages in the graph
ranks = ["order", "class", "phylum"]

# this function looks for the lowest available rank according to the ones defined in the ranks variable
def get_lowest_tax(node):
    for r in ranks: 
        if r in node.keys(): return node[r]
    return "none"

def main():

    # parse first parameter: path to the graph file
    graph_filename = os.path.abspath(sys.argv[1])
    assert graph_filename.endswith(".cyjs")
    
    # parse second parameter: path to the output file
    out_filename = os.path.abspath(sys.argv[2])
    assert out_filename.endswith(".png")

    # third parameter: the id of your favourite contig to highlight in the network
    contig_id = sys.argv[3]

	  # parse the json file containing the graph generated by VContact3
    with open(graph_filename) as json_file:
        graph_json = json.load(json_file)
   
        # set for compatibility with networkx
        graph_json["data"] = []
        
        for node in graph_json["elements"]["nodes"]:
            # assign "value" attribute for compatibility
            node["data"]["value"] = node["data"]["id"]
            # assign lowest availble taxonomic rank with our function
            node["data"]["lowest_tax"] = get_lowest_tax(node["data"])
            
        for edge in graph_json["elements"]["edges"]:
            # add "weight attribute for compatibility
            edge["data"]["weight"] = float(edge["data"]["distance"])
            # assign "attraction" attribute as inverse of "distance" for the position finding algorithm (spring)
            edge["data"]["attraction"] = 1-edge["data"]["weight"]
            
        # generate a networkx graph object from the json object    
        g = nx.cytoscape_graph(graph_json)
    
    # find your favourite contig and get its internal node id
    contig_id_node = -1
    for node in g.nodes(data="name"):
    	if node[1] == contig_id: contig_id_node = node[0]
    
    # raise an error, if the contig could not be found
    if contig_id_node == -1: raise(AssertionError(f"Could not find {contig_id} in the graph"))

	  # limit our drawing to nodes actually connected to anything.
    connected_nodes = {n for n in g.nodes() if len(g.adj[n])>0}
    subgraph = g.subgraph(connected_nodes)
    
    # get all taxa in the subgraph
    taxa = nx.get_node_attributes(subgraph, "lowest_tax")
    
    # the number of taxa could be above the number of colors in a colormap, combine two for more colors
    colormap = [mpl.colormaps["tab20"](i) for i in range(20)] + [mpl.colormaps["tab20b"](i) for i in range(20)]
    
    # set up a mapping from node to color based on the taxonomic rank
    unique_taxa = np.unique([t for t in taxa.values()])
    taxa_to_color = {t:colormap[i] for i, t in enumerate(unique_taxa)}
    colors = {n:taxa_to_color[t] for n, t in taxa.items()}

    # compute positions for the nodes in the network. The spring algorithm is fast, but not so pretty.
    # sadly, networkx has no good alternatives for large networks.
    pos = nx.spring_layout(subgraph, k=1/100000, weight="attraction", iterations=50)
    
    # The produced layout is very concentrated in the center. Transform the coordinates for zooming in.
    pos = {k:np.array(p)-np.array(p)*np.log(np.linalg.norm(p)) for k,p in pos.items()}
    
    # get the "name" attribute of all nodes. We can use it to identify our contigs
    names = nx.get_node_attributes(subgraph, "name")
    
    # get the subgraph containing the reference set of phages and assign colors to the nodes according to the taxonomic rank
    nodes_ref = {n for n in subgraph.nodes() if not names[n].startswith("contig_")}
    subgraph_ref = subgraph.subgraph(nodes_ref)
    subgraph_ref_colors = [colors[n] for n in subgraph_ref.nodes()]
    
    # get the subgraph containing our contigs, we will color them all the same.
    nodes_contigs = {n for n in subgraph.nodes() if names[n].startswith("contig_")}
    subgraph_contigs = subgraph.subgraph(nodes_contigs)
    
    # get the subgraph of your favourite contig (only the contig :)) and the edge subgraph connecting the contig to the reference
    # we can use it to highlight the connections of the contig in the reference network
    subgraph_select = subgraph.subgraph({contig_id_node})
    subgraph_neighbors = subgraph.edge_subgraph([(contig_id_node, n) for n in subgraph.neighbors(contig_id_node)])

    # create a figure and axis with pyplot
    fig, ax = plt.subplots()

    # draw the reference network. first, we draw the edges and then the nodes. networkx also does this automatically. 
    # read the comment of the next code block for setting the drawing order manually
    nx.draw_networkx_edges(subgraph, pos=pos, alpha=0.1, width=[1-subgraph[u][v]["weight"] for u,v in subgraph.edges()], node_size=10, edge_color="grey", ax=ax)
    nx.draw_networkx_nodes(subgraph_ref, pos=pos, alpha=0.6, node_size=10, node_color=subgraph_ref_colors, linewidths=0.0, ax=ax)
    
    # highlight your favourite contigs neighborhood by drawing the connecting edges and circles around the neighbor nodes
    # set_zorder tells matplotlib the order, in which objects are to be rendered. the values can be set arbitrarily as long as
    # they set up an order.
    edges_on_top = nx.draw_networkx_edges(subgraph_neighbors, pos=pos, alpha=0.6, width=0.02, node_size=10, ax=ax)
    edges_on_top.set_zorder(2)
    edges_on_top = nx.draw_networkx_nodes(subgraph_neighbors, pos=pos, linewidths=0.02, node_size=10, node_color='none', edgecolors="black", ax=ax)
    edges_on_top.set_zorder(2)
    
    # draw our contigs on top of everything else
    contig_nodes_collection = nx.draw_networkx_nodes(subgraph_contigs, pos=pos, alpha=0.9, node_size=10, node_color="red", linewidths=0.0, ax=ax)
    contig_nodes_collection.set_zorder(3)
    
    # add your favourite contig on top of that
    select_collection = nx.draw_networkx_nodes(subgraph_select, pos=pos, alpha=0.9, node_size=10, node_color="green", linewidths=0.0, ax=ax)
    select_collection.set_zorder(4)
    
    # create rectangles for a legend connecting taxa with their colors in the graph
    patches = [mpatches.Patch(color=c, label=l, alpha=0.6) for l,c in taxa_to_color.items()] + [mpatches.Patch(color='red', label='our contigs'), mpatches.Patch(color='green', label='selected contig')]
    ax.legend(handles=patches, loc='center right', framealpha=0.5, frameon=True, prop={'size': 5})
    
    # do some adjustments to improve the plot, first remove outer borders
    ax.axis('off')
    
    # set the min and max of each axis, add a bit of free space on the upper xlim for the legend
    xs, ys = [p[0] for p in pos.values()], [p[1] for p in pos.values()]
    xmin, xmax, ymin, ymax = np.min(xs), np.max(xs), np.min(ys), np.max(ys)
    ax.set_xlim([xmin + 0.05*xmin, xmax + 0.5*xmax])
    ax.set_ylim([ymin + 0.05*ymin, ymax + 0.05*ymax])
    
    # save the figure with a high resolution to be able to see enough
    fig.savefig(out_filename, dpi=700)


if __name__ == "__main__":
    main()

sbatch script for submitting the plotting script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --mem=5G
#SBATCH --time=00:10:00
#SBATCH --job-name=taxonomic_graph
#SBATCH --output=./2.2_taxonomy/40_plot_networks/taxonomic_graph.slurm.%j.out
#SBATCH --error=./2.2_taxonomy/40_plot_networks/taxonomic_graph.slurm.%j.err

# activate the python virtual environment with the packages we need
source ../py3env/bin/activate

# in this sbatch script, its not necessarry to create the directory,
# we already told sbatch to create it for the log files.
mkdir -p ./2.2_taxonomy/40_plot_networks/

# define an array with contig ids to plot
declare -a arr=("contig_10" "contig_109", "contig_164")

# loop through the array
for contig in "${arr[@]}"
do
	# run our scripts for plotting the network around a contig of choice
  # PLOT CONTIG IN LOCAL GENE SHARING NETWORK
	python python_scripts/taxonomy/30_taxonomic_graph_local.py ../2.2_viral_taxonomy/02_vcontact3_results_new_contig_names/graph.cyjs ./2.2_taxonomy/40_plot_networks/"$contig"_graph_local.png $contig

  # PLOT CONTIG IN GLOBAL GENE SHARING NETWORK
  python python_scripts/taxonomy/30_taxonomic_graph_global.py ../2.2_viral_taxonomy/02_vcontact3_results_new_contig_names/graph.cyjs ./2.2_taxonomy/40_plot_networks/"$contig"_graph_global.png $contig
done


# deactivate the environment
deactivate

Plot the network of closely related phages

You can choose to plot the whole network of all reference phages and contigs, or zoom in and plot a single contig and its neighborhood.

Use the taxonomic information encoded in the network to color the nodes.

Use draw_networkx_nodes and draw_networkx_edges to better access the plot attributes.

Key Points

The Python package NetworkX can be used to work with networks

The visualization of graphs usually requires the computation of positions for the nodes

Many of our contigs are close to Caudoviricitae, but we also have several connected groups of unclassified sequences

Designing a research project

Overview

Teaching: min
Exercises: 480 min

Objectives

Get acquainted with the state of the art of phage research

Choose one or more papers for inspiration

Write down 1 or 2 research questions

Make a plan on how to tackle the questions

To close off this course, we would like to give you the opportunity to design your own research project. A research question does not have to be very profound, it can be simple. Viromics is a new field, the virosphere is huge, and there is a LOT of data, so there are more unanswered than answered questions. Looking back on the course, are there things you are wondering about or want more time to work on? You can also look in literature for inspiration.

Here are a few papers that caught our eye, maybe they inspire you (feel free to find other papers too):

A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes, by Dutilh et al (2014)
Bacteriophage evolution differs by host, lifestyle and genome, by Mavrich and Hatfull (2017)
Evolution of BACON Domain Tandem Repeats in crAssphage and Novel Gut Bacteriophage Lineages, by de Jonge et al (2019)
Virus Bioinformatics, by Pappas et al (2021)
High viral abundance and low diversity are associated with increased CRISPR-Cas prevalence across microbial ecosystems, by Meaden et al (2022)
Viruses interact with hosts that span distantly related microbial domains in dense hydrothermal mats, by Hwang et al (2023)

Choose one type of project

Choose one type of project that better fits your interests:

Go into detail into analyses
Build a broader research proposal

Detailed Analyses

This type of project is more hands-on. You might write your own script(s), run your own analyses and produce your own plots. The idea here would be to focus on one of the methods/analyses we covered in the course and to go deeper into the analysis. In any case, the question you want to address should be clearly defined.

An example: “What are the differences between functional annotation done with Pharokka and Genomad?”.

Below are what should be included in your documentation/presentation for project type 1

A brief background on the topic
A workflow for the steps you plan to take
A rationale for why you will take those steps
What will be the inputs and outputs?
Which programs will (do you plan to) use?

Note that hypotheses do not need be proven. We want to see from you as results:

Output (a table or a figure) Interpretation of the output: what does the data tell you? Does it answer your question? It does not have to. In a negative case, can you say what you might do differently next time

Research Proposal Style

This type of project is more in the style of a research proposal with a broader scope. You should focus on the biological questions and how to address them using viromics.

Your hypothesis and aims should be clear and backed-up by the literature.

Below are what should be included in your documentation/presentation for project type 2

A brief background on the topic
Formulate your research question or hypothesis to test
What informed the question?
How would you address this question or test the hypothesis? (make a detailed plan)
Estimate the complexity of the problem

Presentation

Present the project you worked on
Limitations
Things you found particularly interesting
Any lessons learnt

Your project notes should still go into your lab book

Exercise

Once you have an idea of a research question, take some time to talk about your idea with your fellow students. Often your initial idea can be further refined based on discussions with others.

Make sure you check the literature: are there any papers that do something related to your question? Find the three papers that are the most closely related to your project, and use them to refine your question.

Make a plan for tackling your question(s). The more detail you can add, the better. Think about data sources, bioinformatic methods, possible outcomes, expectations, backup/follow-up plans, hypotheses, and interpretation.

Here, you can find ideas for projects.

Key Points

Working on project

Overview

Teaching: min
Exercises: 180 min

Objectives

Work on your own project

Clarify unclear points with your teachers

This morning please continue working on your research project design. Additionally, we have open hours with the teachers. Discuss the project and/or clarify any open issues with them.

Key Points

Prepare Presentation

Overview

Teaching: min
Exercises: 180 min

Objectives

Prepare your final presentation

Take the time today to prepare your final presentation.

Key Points

Presentation

Overview

Teaching: min
Exercises: 180 min

Objectives

Present your progress

Get feedback

Present your progress and get feedback from the teachers and your colleagues.

Key Points

Send Documentation

Overview

Teaching: min
Exercises: 180 min

Objectives

Clean up documentation and scripts

Send all required documentation to your teachers

In your final hours on the module, clean up your lab book, scripts and any additional documentation and send them to your teachers. We hope you learned interesting things during the module and wish you all the best!

Key Points