Visualizing the assembly

Overview

Teaching: 0 min
Exercises: 60 min
Objectives
  • Understand the topology of the de-Bruijn graph

  • Understand how the presence of similar species in the sample affects the assembly

Contig length distribution

To get an idea about the quality of your assembly, i.e. the degree of fragmentation and potentially full length complete viral genomes, it is helpful to look at the distribution of the length of the generated contigs. We can discribe this distribution using some numbers derived from it, such as the minimum, maximum or median length and something called N50 or N90. These numbers are computed by concatenating all contings ordered by their length. The length of the contig sitting at 50% (or 90%) of the total length of all contigs combined this way, is called N50 (or N90). The QUAST program (Gurevich et al., 2013) can be used to compute these values and visualize the distribution of the contig lengths. The tool can additionally use a reference sequence to compare the assembly against for assessing its fragmentation. We do not have a reference and will use the basic analysis of Quast.

Exercise - Use Quast to compute the contig length distribution

Use the Quast program to visualize the distribution of the contig lengths. You will need to run it two times, once per assembly, and save the results to different folders (ie. result_quast/cross_assembly and result_quast/single_assemblies). Quast does not need many ressources. Assigning 2 CPUs and 5 GB of RAM for sbatch should be enough.

# create a folder for the assessment within todays folder
$ mkdir 30_results_assessment_quast

# Quast is already installed on the server. Its a python script located here:
$ quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'

# run quast
$ $quast -o 30_results_assessment_quast/cross_assembly /path/to/your/cross_assembly/assembly.fasta

You can get the results from the file report.txt or copy the whole results folder to your computer and open report.html in your local browser.

How fragmented is your assembly?

The distribution of contig lengths can already tell you a lot about the assembly. Combined with some prior knowledge about the sample you sequenced, we can use it as an indicator of the quality or completeness of the assembly.

  1. What would be 2 extreme cases of length distributions? (Something other than no contigs at all :))
  2. How do the numbers we computed fit into your expectations about the sample you analyzed?

sbatch script for running Quast

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=quast
#SBATCH --output=30_quast/quast.slurm.%j.out
#SBATCH --error=30_quast/quast.slurm.%j.err

# Set some variables for the quast script on draco and the files to be analysed
quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'
assembly='10_assembly_flye/assembly.fasta'
outdir='30_quast'

# run Quast to visualize the distribution of contig lengths
$quast -o $outdir/assembly $assembly

Optional: Paths in the de-Bruijn graph

We will use Bandage, a tool to visualize the assembly graph. Bandage is difficult to run on a Windows computer. In the releases section, follow the instructions to download the most appropriate version, such as Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip. To run it, unzip the file and call Bandage from the terminal like this:

# run Bandage
$ ./Bandage_Ubuntu-x86-64_v0.9.0.AppImage

In File > Load_graph, navigate to and load the file assembly_graph.fastg of the cross-assembly. Then click Draw graph to visualize the graph. Note that this graph has already been compacted by collapsing nodes that form linear, unbranching paths into unitigs. Nodes in the graph are called edge_N (confusing name…) with N being an integer. They often correspond to the final contigs in your assembly.

Bubbles and junctions

Open the file assembly_info.txt corresponding to the graph you are looking at. The N in the node names as displayed by Bandage corresponds to the number assigned to continuous paths in the de-Bruijn graph by Flye. The “graph_path” column holds this information for all contigs.

  1. What does an asterisk * mean?
  2. What do multiple occurrences of the same number mean?

Pick two components of the visualized de-Bruijn graph and explain their topology and information content.

  1. Are there bubbles and junctions?
  2. Can you relate the complexity of the visualized graph to the Flye command line parameters?

Key Points

  • Bandage can visualize the de-Bruijn graph

  • JBrowse2 can visualize genomic data like alignments and coverage