Course introduction
In this course, you will analyze viral sequence data
You need to keep a lab notebook
You need to be able to access the HPC Draco
Introduction to viromics
Viromics is the study of viruses, and in our case bacteriophages, using next-generation sequencing technologies
Bacteriophages are diverse and ubiquitous across all biomes
Bacteriophages have large implications on their environment including the human gut
Getting to know your files
Sequence data and information about genes and genomes can come in many different formats
The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files
Sequencing Quality Control
Sequencing quality control is an important first evaluation of our data
NanoPlot can be used to evaluate read quality of long reads
Chopper can be used to filter reads based on quality and/or length
The GC content profile of a metagenome or virome is composed of a mix of bell curves
Assembly lecture
Assembly and cross-assembly
Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.
Samples can be assembled individually and combined in a cross-assembly
vClust can be used to assess the diversity of sequences in your assembly
Assessing assembly quality
checkV assesses the quality of your contigs with regard to viral completeness and contamination
minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes
samtools can be used to read, filter, convert and summarize alignments
Visualizing the assembly
Identifying Viral Contigs I
Different approaches can be used in a virus identification tool, such as reference-based and machine learning
Benchmark is a comparison between tools and should offer concrete evidence of performance, such as number of TP, FN, FP and TN
When choosing a tool, you should consider the priorities of the project. E. g. you need to identify as many viruses as possible and FP are not critical, so a tool with high recall is ideal
Identifying viral contigs II
Filtering contigs by completeness and contamination is crucial to obtain an informative dataset
Tools like Jaeger classify your contigs, enabling you to understand your samples
No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results
Visualizing distributions
Gene Calling and Functional Annotation I
Genome annotation gives meaning to genomic sequences
ORFs can be predicted from start and stop codons in the genomic sequences
Phages have different genomic features than prokaryotes, which influences the design of algorithms
Tools like Phanotate are very useful to process a large amount of contigs. However, no tool is perfect, so a critical interpretation of the results is important
Functional annotation of viral genomes can give clues about viral lifestyle and host interactions
Gene Calling and Functional Annotation II
Visualizing an annotated viral genome
The pyCirclize package provides tools for plotting genomic data, e.g., the set of genes, in a circular layout.
Host Prediction I
Host Prediction II
RaFAH uses a random forest model to predict hosts to the genus level for phages
RaFAH returns a probability for each host genus it can predict.
Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction
Machine learning and classical host prediction methods have a trade-off between recall and sensitivity
Viral taxonomy and phylogeny I
Viruses have multiple origins, so there is no universal marker gene
Viral taxonomy is based on many different methods
Gene-sharing networks and marker genes are popular methods for bacteriophage taxonomy
Viral taxonomy and phylogeny II
VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies
VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes
Both methods have strong agreement with the ICTV classifications
Visualizing viral taxonomy
The Python package NetworkX can be used to work with networks
The visualization of graphs usually requires the computation of positions for the nodes
Many of our contigs are close to Caudoviricitae, but we also have several connected groups of unclassified sequences
Designing a research project
Working on project
Prepare Presentation
Send Documentation