Course introduction
|
In this course, you will analyze viral sequence data
You need to keep a lab notebook
You need to be able to access the HPC Draco
|
Introduction to viromics
|
Viromics is the study of viruses, and in our case bacteriophages, using next-generation sequencing technologies
Bacteriophages are diverse and ubiquitous across all biomes
Bacteriophages have large implications on their environment including the human gut
|
Getting to know your files
|
Sequence data and information about genes and genomes can come in many different formats
The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files
|
Sequencing Quality Control
|
Sequencing quality control is an important first evaluation of our data
NanoPlot can be used to evaluate read quality of long reads
Chopper can be used to filter reads based on quality and/or length
The GC content profile of a metagenome or virome is composed of a mix of bell curves
|
Assembly lecture
|
|
Assembly and cross-assembly
|
Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.
Samples can be assembled individually and combined in a cross-assembly
vClust can be used to assess the diversity of sequences in your assembly
|
Assessing assembly quality
|
checkV assesses the quality of your contigs with regard to viral completeness and contamination
minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes
samtools can be used to read, filter, convert and summarize alignments
|
Visualizing the assembly
|
|
Identifying Viral Contigs I
|
Different approaches can be used in a virus identification tool, such as reference-based and machine learning
Benchmark is a comparison between tools and should offer concrete evidence of performance, such as number of TP, FN, FP and TN
When choosing a tool, you should consider the priorities of the project. E. g. you need to identify as many viruses as possible and FP are not critical, so a tool with high recall is ideal
|
Identifying viral contigs II
|
Filtering contigs by completeness and contamination is crucial to obtain an informative dataset
Tools like Jaeger classify your contigs, enabling you to understand your samples
No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results
|
Visualizing distributions
|
|
Gene Calling and Functional Annotation I
|
Genome annotation gives meaning to genomic sequences
ORFs can be predicted from start and stop codons in the genomic sequences
Phages have different genomic features than prokaryotes, which influences the design of algorithms
Tools like Phanotate are very useful to process a large amount of contigs. However, no tool is perfect, so a critical interpretation of the results is important
Functional annotation of viral genomes can give clues about viral lifestyle and host interactions
|
Gene Calling and Functional Annotation II
|
|
Visualizing an annotated viral genome
|
The pyCirclize package provides tools for plotting genomic data, e.g., the set of genes, in a circular layout.
|
Host Prediction I
|
|
Host Prediction II
|
RaFAH uses a random forest model to predict hosts to the genus level for phages
RaFAH returns a probability for each host genus it can predict.
Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction
Machine learning and classical host prediction methods have a trade-off between recall and sensitivity
|
Viral taxonomy and phylogeny I
|
Viruses have multiple origins, so there is no universal marker gene
Viral taxonomy is based on many different methods
Gene-sharing networks and marker genes are popular methods for bacteriophage taxonomy
|
Viral taxonomy and phylogeny II
|
VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies
VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes
Both methods have strong agreement with the ICTV classifications
|
Visualizing viral taxonomy
|
The Python package NetworkX can be used to work with networks
The visualization of graphs usually requires the computation of positions for the nodes
Many of our contigs are close to Caudoviricitae, but we also have several connected groups of unclassified sequences
|
Designing a research project
|
|
Working on project
|
|
Prepare Presentation
|
|
Presentation
|
|
Send Documentation
|
|