Visualizing the assembly graph
Overview
Teaching: 20 min
Exercises: 30 minQuestions
How the k-mer size contributes to the conectivity of the graph?
Are related species connected in the graph?
Objectives
Understanding how k-mer size affects the topology of the graph
Understanding how the presence of similar species in the sample affects the graph
Effect of k-mer size
The choice of the size of k-mer has a great impact on the final assembly. When running SPAdes, you might have noticed it doesn’t use a single k-mer size per assembly but rather a range of k-mer sizes (21, 33 and 55 in this case), where each subsequent graph is built on the previous one. This is what they call a multisized de Bruijn graph. which benefits from the high connectivity of small k-mer sizes and the simplicity of the large ones. From Bankevich et al., 2012, smaller values of k collapse more repeats together, making the graph more tangled. Larger values of k may fail to detect overlaps between reads, particularly in low coverage regions, making the graph more fragmented. […] Ideally, one should use smaller values of k in low-coverage regions (to reduce fragmentation) and larger values of k in high-coverage regions (to reduce repeat collapsing). The multisized de Bruijn graph allows us to vary k in this manner.
Recall that k-mer size indicates the amount of overlap (k-1) that is necessary to perform the junction in the de Bruijn graph. The longer the k-mer is, the longer stretch of correct nucleotides are necessary to perform such junction. Knowing this, which k-mer sizes (small or large) are more affected by sequencing errors? Explain why.
We will use the Bandage tool to visualize
the graph. In the releases section,
follow instructions to download the most appropriate version. If you are in attending
the workshop live, download Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
).
To run it, unzip the file, open a new terminal tab (Ctrl + Shift + t), activate the conda
environment for today, and call Bandage from the terminal like this:
# run Bandage
$ ./Bandage_Ubuntu-x86-64_v0.9.0.AppImage
In File > Load_graph, navigate to any of the per sample assemblies and load
assembly_graph.fastg
for the k-mer size 21. Then click Draw graph to see the
graph. Note well this graph has been already compacted by collapsing those nodes
that form linear, unbranched paths (click any large node to see how its length is
way larger than 21, the k-mer size). Without closing the Bandage window, open a new
terminal tab and run Bandage again to load the graph for the k-mer size 55. Answer
questions below:
- Why do we expect the graph to be very tangled with small k-mer sizes such as the K21?
- K55 graph seems easier to traverse, but note well this graph has been constructed using information from previous k-mers too. Can you think of any disadvantage of using only a large k-mer size to construct the graph? Would you expect high or low connnectivity?
Effect of related species
Often in metagenomic samples, a number of strains for a given species are present, and this is particularly evident with viral communities that typically contain an abundance of haplotypes (or quasispecies). Because of the high amount of homologous regions between these strains, the assembly graph is complex as multiple genomes occupy much of the same kmer space. The convergence-divergence structure in the graph generated by these homologous regions make traversing the graph more complex, and mistakes at this point can lead to chimeric contigs containing sequence from more than one strain. Note well the convergence-divergence structure is also observed in horizontal gene transfer events between any species.
The crAssphage in the graph
From Dutil et al., 2014 we know that one the viruses in this dataset is the prototypical crAssphage (p-crAssphage). Moreover, by mapping the sequencing reads back to the cross-assembly they could see a small contig recruiting reads from all the samples, suggesting that the genome where this contig comes from was present in all the samples. Or maybe related genomes that share that genomic sequence.
Let’s identify the p-crAssphage in the cross-assembly graph using Bandage and Blast. Download the p-crassphage genome as follows:
# download p-crassphage genome
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day1/crAssphage.fasta
Then, run Bandage and load the cross-assembly graph under 1_assemblies/cross_assembly/assembly_graph.fastg
.
Then, click Create/view BLAST search and use crAssphage.fasta
as query. Colored
nodes are the ones showing similarity to the p-crAssphage. In a new Bandage window,
repeat the Blast analysis with the sample F2T1, which was used in the original paper
to reconstruct the p-crassphage. Can you explain what you see? To corroborate
your answer, inspect the assembly_graph.fastg
file of F2T1 to know which scaffold
is the p-crassphage. After this, note if it clustered with any other scaffolds from
different samples.
Key Points