Viromics Workshop: Glossary

Key Points

Welcome
Listen to Assembly lecture
The dataset
  • Metagenomics is the culture-independent study of the collection of genomes from different microorganisms present in a complex sample.

  • We call dark matter to the sequences that don’t match to any other known sequence in the databases.

  • FASTA format does not contain sequencing quality information.

  • Next Generation Sequencing data is made of short sequences.

Metavirome assembly
  • With sequence assembly we get longer, more meaningful genomic fragments from short sequencing reads.

  • In a cross-assembly, reads coming from the same species in different samples are merged into the same contig.

Lunch break
Visualizing the assembly graph
Assessing assemblies quality
Binning contigs
Setup and run DeepVirFinder
  • Different tools have different environments. Keeping them in separate environments makes runs reproducible and prevents a variety of problems.

  • We are running DeepVirFinder during the lecture, because the run takes ~50 minutes.

Listen to Virus Detection lecture
Setup and run PPR-Meta
  • Setting up the right conda environment for a tool can be tricky.

  • PPR-Meta runs much faster than DeepVirFinder.

Setup and Run VirFinder
  • Logistic Regression is another type of Machine Learning than can be used to distinguish between viral and non-viral sequences

Comparing Virus Identification Tools
  • Despite out data being almost exclusively viral, the tools identify max. 2/3 of the sequences as viral.

  • Making the decision boundrary less strict will include more sequences, which might seem like an advantage in this case. However, if we were working with a mixed metagenomic dataset, this would mean that we would falsely annotate microbial sequences as viral.

Setup and run VirSorter
  • VirSorter is a homology-based tool.

  • Because Virsorter has to compare each sequence to a database, it is slower that many other tools.

Lunch Break
  • Nutrition is important

  • Guten Appetit!

Compare Results for four tools
  • The different tools often agree, but often disagree on whether a contig is viral. This is to an extent affected by the length of the contig.

  • Even for current state-of-the-art tools, getting a high sensitivity is hard.

  • Some tools make more similar predictions than others.

Prophage Prediction
  • Features such as GC-Content changes, or sudden enrichment in viral genes indicate the presence of a prophage in a contig/genome.

  • Results of a tool are sometimes distributed across multiple folders. Make sure to check all output files so that you can get the max out of your experiment.

Listen to Benchmarking lecture
Introduction and setting up
Gene prediction
Prodigal modes
Functional annotation
Lunch break
Clustering proteins
Integrating annotations
Inspecting the MSAs
Install R package
  • Start the installation of R package and continue with the next lesson

Clustering and taxonomic classification of uncultivated viral genomes
Required data
Homology based search
  • Sometimes you might find a good hit, for example for the crAssphage bins, or for many of the well-described viruses infecting humans. In other cases, we need more sophisticated search strategies to assign a given viral sequence to a previously described taxon.

Clustering viral sequences based on shared proteins
Installing and running Vcontact2
  • We’re running vContact2 over lunch because it takes around an hour to finish

Lunch break
Check Vcontact2
Phylogeny based on marker genes
Gene sharing networks with Vcontact2
Assessing viral contigs completeness and contamination
Track alpha and beta diversity dynamics of viral/microbial communities
  • Next Generation Sequencing data is compositional and should be analyzed using compositional data analysis methods

  • Hill number is linear and more intuitive than original alpha diversity measures

Rstudio set up from Conda environment, package installment and data download
Exploring data
  • Plot absolute abundance, relative abundance and centered-log ratio abundance plots to see the difference of different abundance measures.

  • Picking thresholds for filtering can be tricky. Play with the thresholds to filter data based on your questions and your data.

Break
Alpha diversity
  • Different alpha diversity indices emphasize on different aspects of alpha diversity. Make choices based on your questions and interpret the results based on the methods you choosed.

  • Hill numbers are linear and intuitive while original alpha diversicy index values are not.

Beta diversity
  • Understand the similairties and dissimilarites of different beta diversity/distance matrices.

  • Aitchison distance is the distance between samples or features within simplex space. We use Aitchison distance in compositional data analysis.

Differential abundance
  • Use clr transformation to transform the data

Glossary

FIXME