Visualizing the assembly
Overview
Teaching: 0 min
Exercises: 60 minObjectives
Understand the topology of the de-Bruijn graph
Understand how the presence of similar species in the sample affects the assembly
Contig length distribution
To get an idea about the quality of your assembly, i.e. the degree of fragmentation and potentially full length complete viral genomes, it is helpful to look at the distribution of the length of the generated contigs. We can discribe this distribution using some numbers derived from it, such as the minimum, maximum or median length and something called N50 or N90. These numbers are computed by concatenating all contings ordered by their length. The length of the contig sitting at 50% (or 90%) of the total length of all contigs combined this way, is called N50 (or N90). The QUAST program (Gurevich et al., 2013) can be used to compute these values and visualize the distribution of the contig lengths. The tool can additionally use a reference sequence to compare the assembly against for assessing its fragmentation. We do not have a reference and will use the basic analysis of Quast.
Exercise - Use Quast to compute the contig length distribution
Use the Quast program to visualize the distribution of the contig lengths. You will need to run it two times, once per assembly, and save the results to different folders (ie.
result_quast/cross_assembly
andresult_quast/single_assemblies
). Quast does not need many ressources. Assigning 2 CPUs and 5 GB of RAM for sbatch should be enough.# create a folder for the assessment within todays folder $ mkdir 30_results_assessment_quast # Quast is already installed on the server. Its a python script located here: $ quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py' # run quast $ $quast -o 30_results_assessment_quast/cross_assembly /path/to/your/cross_assembly/assembly.fasta
You can get the results from the file
report.txt
or copy the whole results folder to your computer and openreport.html
in your local browser.
How fragmented is your assembly?
The distribution of contig lengths can already tell you a lot about the assembly. Combined with some prior knowledge about the sample you sequenced, we can use it as an indicator of the quality or completeness of the assembly.
- What would be 2 extreme cases of length distributions? (Something other than no contigs at all :))
- How do the numbers we computed fit into your expectations about the sample you analyzed?
sbatch script for running Quast
#!/bin/bash #SBATCH --tasks=1 #SBATCH --cpus-per-task=2 #SBATCH --partition=short,standard,interactive #SBATCH --mem=1G #SBATCH --time=00:30:00 #SBATCH --job-name=quast #SBATCH --output=30_quast/quast.slurm.%j.out #SBATCH --error=30_quast/quast.slurm.%j.err # Set some variables for the quast script on draco and the files to be analysed quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py' assembly='10_assembly_flye/assembly.fasta' outdir='30_quast' # run Quast to visualize the distribution of contig lengths $quast -o $outdir/assembly $assembly
Optional: Paths in the de-Bruijn graph
We will use Bandage, a tool to visualize
the assembly graph. Bandage is difficult to run on a Windows computer. In the
releases section, follow the instructions
to download the most appropriate version, such as Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
.
To run it, unzip the file and call Bandage from the terminal like this:
# run Bandage
$ ./Bandage_Ubuntu-x86-64_v0.9.0.AppImage
In File > Load_graph, navigate to and load the file assembly_graph.fastg
of
the cross-assembly. Then click Draw graph to visualize the graph. Note that
this graph has already been compacted by collapsing nodes that form linear,
unbranching paths into unitigs. Nodes in the graph are called edge_N (confusing name…)
with N being an integer. They often correspond to the final contigs in your assembly.
Bubbles and junctions
Open the file
assembly_info.txt
corresponding to the graph you are looking at. The N in the node names as displayed by Bandage corresponds to the number assigned to continuous paths in the de-Bruijn graph by Flye. The “graph_path” column holds this information for all contigs.
- What does an asterisk * mean?
- What do multiple occurrences of the same number mean?
Pick two components of the visualized de-Bruijn graph and explain their topology and information content.
- Are there bubbles and junctions?
- Can you relate the complexity of the visualized graph to the Flye command line parameters?
Key Points
Bandage can visualize the de-Bruijn graph
JBrowse2 can visualize genomic data like alignments and coverage