Viromics2024: Glossary

Key Points

Course introduction
  • In this course, you will analyze viral sequence data

  • You need to keep a lab notebook

  • You need to be able to access the HPC Draco

Introduction to viromics
  • Viromics is the study of viruses, and in our case bacteriophages, using next-generation sequencing technologies

  • Bacteriophages are diverse and ubiquitous across all biomes

  • Bacteriophages have large implications on their environment including the human gut

Getting to know your files
  • Sequence data and information about genes and genomes can come in many different formats

  • The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files

Sequencing Quality Control
  • Sequencing quality control is an important first evaluation of our data

  • NanoPlot can be used to evaluate read quality of long reads

  • Chopper can be used to filter reads based on quality and/or length

  • The GC content profile of a metagenome or virome is composed of a mix of bell curves

Assembly lecture
  • Sequence assembly can be used to assemble genomes from reads

  • Metagenome assembly generally yields shorter contigs than genome assembly

Assembly and cross-assembly
  • Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.

  • Samples can be assembled individually and combined in a cross-assembly

  • vClust can be used to assess the diversity of sequences in your assembly

Assessing assembly quality
  • checkV assesses the quality of your contigs with regard to viral completeness and contamination

  • minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes

  • samtools can be used to read, filter, convert and summarize alignments

Visualizing the assembly
  • Bandage can visualize the de-Bruijn graph

  • JBrowse2 can visualize genomic data like alignments and coverage

Identifying Viral Contigs I
  • Different approaches can be used in a virus identification tool, such as reference-based and machine learning

  • Benchmark is a comparison between tools and should offer concrete evidence of performance, such as number of TP, FN, FP and TN

  • When choosing a tool, you should consider the priorities of the project. E. g. you need to identify as many viruses as possible and FP are not critical, so a tool with high recall is ideal

Identifying viral contigs II
  • Filtering contigs by completeness and contamination is crucial to obtain an informative dataset

  • Tools like Jaeger classify your contigs, enabling you to understand your samples

  • No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results

Visualizing distributions
  • matplotlib and pyplot provide multiple tools for the visualization of data points

Gene Calling and Functional Annotation I
  • Genome annotation gives meaning to genomic sequences

  • ORFs can be predicted from start and stop codons in the genomic sequences

  • Phages have different genomic features than prokaryotes, which influences the design of algorithms

  • Tools like Phanotate are very useful to process a large amount of contigs. However, no tool is perfect, so a critical interpretation of the results is important

  • Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Gene Calling and Functional Annotation II
  • Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Host Prediction I
Host Prediction II
  • RaFAH uses a random forest model to predict hosts to the genus level for phages

  • RaFAH returns a probability for each host genus it can predict.

  • Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction

  • Machine learning and classical host prediction methods have a trade-off between recall and sensitivity

Viral taxonomy and phylogeny I
  • Viruses have multiple origins, so there is no universal marker gene

  • Viral taxonomy is based on many different methods

  • Gene-sharing networks and marker genes are popular methods for bacteriophage taxonomy

Viral taxonomy and phylogeny II
  • VConTACT3 is used to classify viral taxonomy of sequences based on the number of shared genes

  • Marker genes such as the terminase large subunit (terL) can also be used to judge how related viruses are and in some cases classify the taxonomy as lower taxa ranks

Visualizing viral taxonomy
  • The Python package NetworkX can be used to work with networks

  • The visualization of graphs usually requires the computation of positions for the nodes

  • Many of our contigs are close to Caudoviricitae, but we also have several connected groups of unclassified sequences

Designing a research project
Working on project
Prepare Presentation
Presentation
Send Documentation

Glossary

FIXME