Clustering viral sequences based on shared proteins

Overview

Teaching: 0 min
Exercises: 60 min
Questions
  • Can we cluster viral sequences based on shared protein clusters?

Objectives
  • Identify putative crAssphage bins

Another way to classify your genomes is to cluster them based on the gene content of the genomes. We will first visualize the protein content of the viral bins using a sort of heatmap that reflects the phylogenetic profiles of the different protein clusters generated on day 3. For this we are going to use the MMSeqs2 protein clusters (PCs) you made in the previous day from your binned sequences.

First, open a terminal and make a new folder for today’s results to keep things organised:

$ cd /path/of/viromics/folder/
$ mkdir day4
$ cd day4

We provide an R script for this part of the analysis. However, if you are familiar with making heatmaps and clustering data in R (or any other programming language), please feel free to read along and make your own script.

# Download the script from github:
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day4/clustering.R

Tidyverse should have installed by now: go back to Rstudio and check. If you have trouble installing Tidyverse please contact any of the assistants.

Create a new project in Rstudio (file > create new project > Existing directory) and select the /path/of/viromics/folder/day4 folder.

Open the clustering.R script you just downloaded. It should be in the lower right pane under “Files”

Question 1: Data exploration.

Follow the R script up until line 18. Make sure to change the file paths so that they point correctly, and read the comments to understand what you’re doing.
Take a moment to look at what was in the data you just loaded.

  • What’s in each of the different columns?
  • What information do you still lack to be able to cluster genomes based on protein content?

The RefSeq proteins have unique protein IDs (e.g. YP_009124822.1), but no identifier that tells us to which genome they belong. We will download this information and some additional taxonomic metadata from NCBI virus.

Go to Find Data > Bacteriophages (A):

Image

Click the Protein tab (B), and in the left (C) select complete and partial RefSeq genome completeness. Next click “Download” (D):

Image

In the screen that pops up click Current table view results, csv format (E):
Image

Download all records: Image

Select Accession, Species, Genus, Family, Length, Protein, and Accession with version (F): Image

Finally click Download. Downloading might take long, so as a backup we’ve also included the file in the Github repo:

# Download the metadata from github:
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/data/sequences_bac.csv

Question 2: Clustering contigs based on shared protein clusters

Go back to the R script and follow the steps up to line 87.
Look at the heatmap. Are there any core viral genes? Can you find closely related bins?

Solution

We observe:

  • sparse clustering indicates high viral diversity in terms of gene content.
  • No genes are shared between all bins. This suggests there are no core genes.
  • some bins share some PCs, and might be closely related
    Example heatmap

Question 3: How can we assign taxonomy to our bins?

Right now we see a few bins that are perhaps related to each other because they share a few proteins. So we know they are similar, but not what they are. Can you think of a way to annotate these sequences through clustering?

Solution

To annotate our contigs, we need to add sequences with known taxonomy and check if they cluster with our bins.
In other words, we need to include our RefSeq protein clusters and genomes.
Would you add all RefSeq genomes? Why (not)?
Hint: check how many RefSeq genomes are in our protein cluster dataset.

Next, we are going to add crAssphage genomes and protein clusters from our RefSeq set, to see if we can find any crAssphage-like phages in our assembled bins.

Question 4: Heatmap with crAssphage RefSeq genomes

follow steps in the R script until the end of the script.
Look at the new heatmap. Interpret what you see.

Solution

  • We see a couple of different clusters of RefSeq crAssphage Genomes
  • Some of our bins cluster with these crAssphages, so perhaps these sequences are also crAssphages?
    Example heatmap with crAssphages

Challenge: Try to classify more bins by including other RefSeq sequences

See if you can annotate any of the other bins by a similar method.
Include PCs and genomes for another viral genus or species that you expect to be present in the human gut microbiome (use google to find which phages you would expect).
Modify the script to filter the RefSeq sequences for that taxon, and make a heatmap. Can you classify any of the other bins?

Key Points