Gene Calling and Functional Annotation II

Overview

Teaching: min
Exercises: 180 min
Objectives
  • Perform gene calling and functional annotation on the viral contigs

Running pharokka

For the functional annotation of our viral contigs, we will be using Pharokka. The “viral contigs” are the contigs we filtered yesterday (i.e. using Jaeger and CheckV)

We first have to load the appropriate conda environment


source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh

conda activate pharokka_v1.7.1

The base usage for this program follows:


pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads>  

See the docs for the pharokka parameters

We have to add the following options:

Therefore your command should look like this:


pharokka.py -i filtered_assembly.fasta -o output_dir -d path/to/database -t 10 -m --meta_hmm -f -g phanotate

We recommend modifying the following SBATCH resources:

Exercise

  1. Run pharokka with the appropriate parameters using an sbatch script

sbatch script for pharokka

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=20G
#SBATCH --time=02:00:00
#SBATCH --job-name=pharokka
#SBATCH --output=./1.4_annotation/10_pharokka/pharokka.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/pharokka.slurm.err.%j

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate pharokka_v1.7.1


database='/work/groups/VEO/databases/pharokka/v1.7.1/'
assembly='./1.3_virus_identification/40_results_filter_contigs/assembly.fasta'
outdir='./1.4_annotation/10_pharokka'

# extra options to use

# -m: run in metavirome mode (each contig is a new virus)
# --meta_hmm : also run PyHMMER on the PHROGS (richer annotations with this)
# -f: force restart the outputs
# -g: gene annotation by phanotate (default for meta mode is prodigal-gv)

pharokka.py -i $assembly -o $outdir -d $database -t 10 -m --meta_hmm -f -g phanotate

Plotting functional annotations

Choose a pet contig (or two)

Exercise - extracting contig information

  1. Choose a pet contig. Maybe your favourite contig number… but we recommend choosing one with high completeness and which is ‘annotation-rich’
    • perhaps with many annotated genes, a crispr region, mash results (from pharokka) etc.
    • Discuss with your classmates so that everybody picks a different virus one. We will continue using this “pet” contig in the future lessons also. a) Write down some information about your contig. e.g. contig length, # of genes, # of annotated genes, gc%, coding density

For the plotting we, we will still use pharokka. Pharokka’s built-in pharokka_plotter.py uses a python library called pyCirclize. You will also use this library later for the visualization homework so stay tuned!

pharokka_plotter.py requires a gff, genbank and fasta file for the viral contig. Therefore, we have to extract these from the larger pharokka.gff and pharokka.gbk outputs.

The .gff file contains information about the gene positions and strands
The .gbk file contains information about the functional annotations of each gene
The fasta file contains the nucleotide sequence

Exercise - extracting contig information

We need 3 pieces of information about your particular contig to use the pharokka plotter

  • contig.gff : extracted from the pharokka.gff
  • contig.gbk : extracted from the pharokka.gbk
  • contig.fasta : nucl. sequence extracted from the filtered assembly
  1. Extract the functional annotation information for your contig from the pharokka.gff and pharokka.gbk files and the contig nucleotide sequence from the fasta file

Gff and fasta files are easy to pull information from so we can just grep or sed the contig directly into an output file

Extract contig.gff

grep "contig_xxxx" pharokka.gff > contig_xxxx.gff

# for example
mkdir -p ./1.4_annotation/10_pharokka/11_selected_contig

# get the contig.gff
grep "contig_827" ./1.4_annotation/10_pharokka/pharokka.gff > ./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gff

Extract contig fasta


# if your fasta file is in multi-line format
# this tells sed to print between "contig_1026" and the next ">", including both patterns
# `head` removes the last line (the second header)
sed -n '/contig_1026/,/>/p' phage_contigs.fasta | head -n -1

# for example:
sed -n '/contig_827/,/>/p' ./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta | head -n -1  > ./1.4_annotation/10_pharokka/11_selected_contig/contig_827.fasta

Genbank files are a bit more difficult to extract information from but they are formatted sequentially by records. Luckily for us, there are pre-built modules in the biopython library to work with genbank files. SeqRecord and Bio.SeqIO.parse

If you head the pharokka.gbk file, you might see how the records are structured:


LOCUS       contig_1026            35733 bp    DNA     linear   PHG 06-SEP-2024
DEFINITION  contig_1026.
ACCESSION   contig_1026
VERSION     contig_1026
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             complement(618..749)
                     /ID="contig_1026_CDS_0001"
                     /transl_table=11
                     /phrog="No_PHROG"
                     /top_hit="No_PHROG"
                     /locus_tag="contig_1026_CDS_0001"
                     /function="unknown function"
                     /product="hypothetical protein"
                     /source="Pyrodigal-gv_0.3.1"
                     /score="3.3"
                     /phase="0"
                     /translation="MTRWNTTLAVYEIWTGTQWQAVASSTYSINYLIVAGGASGAHY*"

See if you can write a script to extract the genbank record for your contig using this module - for clues - see solution below!

The whole contig is called a “record” - see the biopython SeqRecord class page for what each element in the record is called, use Bio.SeqIO.parse to iterate through these records and Bio.SeqIO.write to write out the record you want.

biopython script to extract contig.gbk (genbank)

#!/usr/bin/python3

from Bio import SeqIO
import sys 

# read files in as variables
input_gbk = sys.argv[1]
output_gbk = sys.argv[2]
contig = sys.argv[3]

# parse the genbank and write output
for record in SeqIO.parse(input_gbk, "genbank"):
   # print(record.id)
   if record.id == contig:
       SeqIO.write(record, output_gbk, "genbank")

sbatch script to run your python script

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=00:01:00
#SBATCH --job-name=py_script
#SBATCH --output=./1.4_annotation/10_pharokka/extract_contigs.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/extract_contigs.slurm.err.%j

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# change the output and contig names accordingly
python3 python_scripts/annotation/extract_contig_gbk.py pharokka.gbk output_contig.gbk contig_xxxx

Plot your contigs!

Once we have our fasta, gff and genbank files, we can use pharokka_plotter.py to plot our pet contig. pharokka_plotter.py requires the same conda environment as pharokka. So, to to activate the environment use the same commands you used for pharokka. Feel free to experiment with the different options given in the manual. The base usage is:

pharokka_plotter.py -i input.fasta -n pharokka_plot --gff pharokka.gff --genbank pharokka.gbk

You no longer need the pharokka database and we recommend modifying the following SBATCH parameters #SBATCH --mem=10G

Exercise - plotting

  1. Plot the genome annotations for your pet contig

Make sure to add your plot into your lab book!

sbatch script for all the extraction steps + pharokka_plotter

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=10G
#SBATCH --time=2:00:00
#SBATCH --job-name=pharokka_1
#SBATCH --output=./1.4_annotation/10_pharokka/pharokka_plotter.slurm.out.%j
#SBATCH --error=./1.4_annotation/10_pharokka/pharokka_plotter.slurm.err.%j

assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
outdir='./1.4_annotation/10_pharokka/11_selected_contig'

mkdir -p $outdir

# get the contig.gff and contig.fasta
grep "contig_827" ./1.4_annotation/10_pharokka/pharokka.gff > $outdir/contig_827.gff

sed -n '/contig_827/,/>/p' $assembly | head -n -1  > $outdir/contig_827.fasta

# next, we get the contig.gbk using a python script
# change the names accordingly

# activate the python virtual environment with the packages we need
source ./py3env/bin/activate

# the inputs for the following python script should be
# python3 script.py input.gbk output.gbk contig_name
input_gbk='./1.4_annotation/10_pharokka/pharokka.gbk'
output_gbk='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gbk'
name='contig_827'

python3 python_scripts/annotation/extract_contig_gbk.py $input_gbk $output_gbk $name

# Then we run pharokka plotter on our selected contig

source /vast/groups/VEO/tools/anaconda3/etc/profile.d/conda.sh && conda activate pharokka_v1.7.1

# define all your file paths for readability
outdir='./1.4_annotation/10_pharokka/11_selected_contig/'

fasta='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.fasta'
gff='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gff'
gbk='./1.4_annotation/10_pharokka/11_selected_contig/contig_827.gbk'

pharokka_plotter.py -i $fasta -n $outdir/pharokka_plot --gff $gff --genbank $gbk

The final plot might look something like this:

pharokka plotter contig_827

Interpreting your contig

With functional annotation we can begin to get an insight into our viruses to piece together information about their lifestyles, hosts interactions etc, what type of genes they have. You’ll find that most genes on viral contigs are not actually annotated - this is because 1) viruses evolve very quickly and assigning gene annotations by homology is difficult, and 2) viruses are highly mosaic - they pick up and lose genes from their hosts and other viruses, meaning not all viruses have all the same genes and viral gene databases do not always account for their variability. The creation of newer tools like phold aid in this by using structural homologs to assign gene annotation.

Note : We have phold installed on Draco, if you would like to experiment with it for your project! For example, you might pick a contig that is NOT well annotated and see how many extra genes you can annotate using structural homology

Exercise - Interpret your contigs

  1. How many structural genes are annotated?
  2. Describe at least 2 different types of structural genes a) Where are they placed in your genome? b) To which part of phage structure do they belong? (capsid, tail etc) c) What other viruses have these structural features?
  3. How many replication genes are annotated?
    • These could be genes related to transcriptional regulation, DNA/RNA metabolism lysis etc
  4. Describe the role of 1 replication gene a) What is the function of the gene? b) What part of the replication cycle is it used? c) How does it interact with the host genes (if they do)
  5. Describe 1 or more interesting genomic features
    • CRISPR arrays, tRNA, virulence factors etc
  6. To the best of your ability, infer how this virus might be interacting with its host.
    • You might investigate clues temperate/lytic lifestyles, auxiliary metabolic genes, anti-crisprs etc

Key Points

  • Functional annotation of viral genomes can give clues about viral lifestyle and host interactions