Visualizing the assembly graph

Overview

Teaching: 20 min
Exercises: 30 min

Questions

How the k-mer size contributes to the conectivity of the graph?

Are related species connected in the graph?

Objectives

Understanding how k-mer size affects the topology of the graph

Understanding how the presence of similar species in the sample affects the graph

Effect of k-mer size

The choice of the size of k-mer has a great impact on the final assembly. When running SPAdes, you might have noticed it doesn’t use a single k-mer size per assembly but rather a range of k-mer sizes (21, 33 and 55 in this case), where each subsequent graph is built on the previous one. This is what they call a multisized de Bruijn graph. which benefits from the high connectivity of small k-mer sizes and the simplicity of the large ones. From Bankevich et al., 2012, smaller values of k collapse more repeats together, making the graph more tangled. Larger values of k may fail to detect overlaps between reads, particularly in low coverage regions, making the graph more fragmented. […] Ideally, one should use smaller values of k in low-coverage regions (to reduce fragmentation) and larger values of k in high-coverage regions (to reduce repeat collapsing). The multisized de Bruijn graph allows us to vary k in this manner.

Recall that k-mer size indicates the amount of overlap (k-1) that is necessary to perform the junction in the de Bruijn graph. The longer the k-mer is, the longer stretch of correct nucleotides are necessary to perform such junction. Knowing this, which k-mer sizes (small or large) are more affected by sequencing errors? Explain why.

We will use the Bandage tool to visualize the graph. In the releases section, follow instructions to download the most appropriate version. If you are in attending the workshop live, download Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip). To run it, unzip the file, open a new terminal tab (Ctrl + Shift + t), activate the conda environment for today, and call Bandage from the terminal like this:

# run Bandage
$ ./Bandage_Ubuntu-x86-64_v0.9.0.AppImage

In File > Load_graph, navigate to any of the per sample assemblies and load assembly_graph.fastg for the k-mer size 21. Then click Draw graph to see the graph. Note well this graph has been already compacted by collapsing those nodes that form linear, unbranched paths (click any large node to see how its length is way larger than 21, the k-mer size). Without closing the Bandage window, open a new terminal tab and run Bandage again to load the graph for the k-mer size 55. Answer questions below:

Why do we expect the graph to be very tangled with small k-mer sizes such as the K21?
K55 graph seems easier to traverse, but note well this graph has been constructed using information from previous k-mers too. Can you think of any disadvantage of using only a large k-mer size to construct the graph? Would you expect high or low connnectivity?

Often in metagenomic samples, a number of strains for a given species are present, and this is particularly evident with viral communities that typically contain an abundance of haplotypes (or quasispecies). Because of the high amount of homologous regions between these strains, the assembly graph is complex as multiple genomes occupy much of the same kmer space. The convergence-divergence structure in the graph generated by these homologous regions make traversing the graph more complex, and mistakes at this point can lead to chimeric contigs containing sequence from more than one strain. Note well the convergence-divergence structure is also observed in horizontal gene transfer events between any species.

The crAssphage in the graph

From Dutil et al., 2014 we know that one the viruses in this dataset is the prototypical crAssphage (p-crAssphage). Moreover, by mapping the sequencing reads back to the cross-assembly they could see a small contig recruiting reads from all the samples, suggesting that the genome where this contig comes from was present in all the samples. Or maybe related genomes that share that genomic sequence.

Let’s identify the p-crAssphage in the cross-assembly graph using Bandage and Blast. Download the p-crassphage genome as follows:

# download p-crassphage genome
$ wget https://raw.githubusercontent.com/MGXlab/Viromics-Workshop-MGX/gh-pages/code/day1/crAssphage.fasta

Then, run Bandage and load the cross-assembly graph under 1_assemblies/cross_assembly/assembly_graph.fastg. Then, click Create/view BLAST search and use crAssphage.fasta as query. Colored nodes are the ones showing similarity to the p-crAssphage. In a new Bandage window, repeat the Blast analysis with the sample F2T1, which was used in the original paper to reconstruct the p-crassphage. Can you explain what you see? To corroborate your answer, inspect the assembly_graph.fastg file of F2T1 to know which scaffold is the p-crassphage. After this, note if it clustered with any other scaffolds from different samples.

Key Points

previous episode

Viromics Workshop

next episode

Visualizing the assembly graph

Overview

Effect of k-mer size

The crAssphage in the graph

Key Points

previous episode

next episode

previous episode

Viromics Workshop

next episode

Visualizing the assembly graph

Overview

Effect of k-mer size

Effect of related species

The crAssphage in the graph

Key Points

previous episode

next episode