Gene prediction

Overview

Teaching: 30 min
Exercises: 15 min
Questions
Objectives

We will use Prodigal , a tool for “prokaryotic gene recognition and translation initiation site identification”, to identiy putative protein coding genes in our phage contigs. I purposely quoted the title as this hints that the tool is not specificallly built for phage gene predcition but for prokaryotes instead.

Why would Prodigal still perform well for phages?

For prodigal to be able to predict genes it has to be trained, for which it uses several sequence charcteristics, among others:

Prodigal algorithm

Have a look at the 2010 paper for a more detailed explanation of the algorithm.

  • What does prodigal do when there is no so called “Shine Dalgarno” motif?
  • Why is verifying the predictions difficult?
  • What start codon(s) does prodigal use in its search?

Now we have a rough idea of how Prodigal works, go to the “metagenomes” section in the prodigal wiki

  • What would be the best approach for our dataset?

Contigs length

The manual states: Isolated, short sequences (<100kbp) such as plasmids, phages, and viruses should generally be analyzed using Anonymous Mode. As a note, “Anonymous mode” was previously called “meta” mode. Are our phages shorter than 100kb? how long is the longest contig? Use bioawk to find that out.

Solution

bioawk -c fastx '{ print $name, length($seq) }' fasta_bins/all_binned_contigs.fasta

Running Prodigal

Lets run -p meta on all our contigs. Make sure we are in the day3 folder, then run the code below

mkdir prodigal_default
cd prodigal_default
prodigal -i ../fasta_bins/all_binned_contigs.fasta -a proteins.faa -o genes.txt -f sco -p meta
cd ..
  • How many proteins did we predict? (use grep for example)
  • That we did not predict proteins does not mean they are not there. In what case do you think prodigal will miss proteins?

Key Points