Viromics2024: Glossary

Key Points

Course introduction
  • In this course, you will analyze viral sequence data

  • You need to keep a lab notebook

  • You need to be able to access the HPC Draco

Introduction to viromics
  • Viromics is the study of viruses, and in our case bacteriophages, using next-generation sequencing technologies

  • Bacteriophages are diverse and ubiquitous across all biomes

  • Bacteriophages have large implications on their environment including the human gut

Getting to know your files
  • Sequence data and information about genes and genomes can come in many different formats

  • The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files

Sequencing Quality Control
  • Sequencing quality control is an important first evaluation of our data

  • NanoPlot can be used to evaluate read quality of long reads

  • Chopper can be used to filter reads based on quality and/or length

  • The GC content profile of a metagenome or virome is composed of a mix of bell curves

Assembly lecture
  • Sequence assembly can be used to assemble genomes from reads

  • Metagenome assembly generally yields shorter contigs than genome assembly

Assembly and cross-assembly
  • Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.

  • Samples can be assembled individually and combined in a cross-assembly

  • vClust can be used to assess the diversity of sequences in your assembly

Assessing assembly quality
  • checkV assesses the quality of your contigs with regard to viral completeness and contamination

  • minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes

  • samtools can be used to read, filter, convert and summarize alignments

Visualizing the assembly
  • Bandage can visualize the de-Bruijn graph

  • JBrowse2 can visualize genomic data like alignments and coverage

Identifying Viral Contigs I
  • Different approaches can be used in a virus identification tool, such as reference-based and machine learning

  • Benchmark is a comparison between tools and should offer concrete evidence of performance, such as number of TP, FN, FP and TN

  • When choosing a tool, you should consider the priorities of the project. E. g. you need to identify as many viruses as possible and FP are not critical, so a tool with high recall is ideal

Identifying viral contigs II
  • Filtering contigs by completeness and contamination is crucial to obtain an informative dataset

  • Tools like Jaeger classify your contigs, enabling you to understand your samples

  • No wet-lab or dry-lab technique is perfect. Filtering non-viral contigs from your data improves its quality, helping you obtain better results

Visualizing distributions
  • matplotlib and pyplot provide multiple tools for the visualization of data points

Gene Calling and Functional Annotation I
  • Genome annotation gives meaning to genomic sequences

  • ORFs can be predicted from start and stop codons in the genomic sequences

  • Phages have different genomic features than prokaryotes, which influences the design of algorithms

  • Tools like Phanotate are very useful to process a large amount of contigs. However, no tool is perfect, so a critical interpretation of the results is important

  • Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Gene Calling and Functional Annotation II
  • Functional annotation of viral genomes can give clues about viral lifestyle and host interactions

Visualizing an annotated viral genome
  • The pyCirclize package provides tools for plotting genomic data, e.g., the set of genes, in a circular layout.

Host Prediction I
Host Prediction II
  • RaFAH uses a random forest model to predict hosts to the genus level for phages

  • RaFAH returns a probability for each host genus it can predict.

  • Blastn between the viral and prokaryotic fractions can be used as a quick method for host prediction

  • Machine learning and classical host prediction methods have a trade-off between recall and sensitivity

Viral taxonomy and phylogeny I
  • Viruses have multiple origins, so there is no universal marker gene

  • Viral taxonomy is based on many different methods

  • Gene-sharing networks and marker genes are popular methods for bacteriophage taxonomy

Viral taxonomy and phylogeny II
  • VConTACT3 and geNomad are two programs used to classify viral taxonomy of sequences using differing strategies

  • VConTACT3 uses gene-sharing networks, while GeNomad uses marker genes

  • Both methods have strong agreement with the ICTV classifications

Visualizing viral taxonomy
  • The Python package NetworkX can be used to work with networks

  • The visualization of graphs usually requires the computation of positions for the nodes

  • Many of our contigs are close to Caudoviricitae, but we also have several connected groups of unclassified sequences

Designing a research project
Working on project
Prepare Presentation
Presentation
Send Documentation

Glossary

FIXME