Assessing assembly quality

Overview

Teaching: 0 min
Exercises: 120 min
Objectives
  • Assess the quality of your assemblies

Now we will measure some basic aspects of the assemblies, such as the fragmentation degree and the percentage of the raw data they represent. A metagenome consists of all the genomic information of all the organisms in a given system. So in the ideal case, a metagenomic assembly would contain a single and complete contig for each chromosome or plasmid in the sample, in which case all the metagenomic sequencing reads can be perfectly mapped back to the assembly. Of course, this will never happen, because in most biomes there is a long tail of rare organisms that contribute only a tiny fraction of the sequenced DNA, so complete horizontal coverage of their genome cannot be achieved, and their assembly remains fragmented. Other problems are repeated regions in the metagenome and microdiversity between strains, which both lead to complex structures in the assembly graph, and thus fragmented assemblies.

Contig length distribution

To get a first idea about your assembly, it is helpful to look at the distribution of the length of the generated contigs. We can discribe this distribution using some numbers derived from it, such as the maximum, the median and something called N50 or N90. These numbers are computed by concatenating all contings ordered by their length. The length of the contig sitting at 50% (or 90%) of the total length of all contigs combined this way, is called N50 (or N90). The QUAST program (Gurevich et al., 2013) can be used to compute these values and visualize the distribution of the contig lengths. The tool can additionally use a reference sequence to compare the assembly against for assessing its fragmentation. We do not have a reference and will use the basic analysis of Quast.

Exercise - Use Quast to compute the contig length distribution

Use the Quast program to visualize the distribution of the contig lengths. You will need to run it two times, once per assembly, and save the results to different folders (ie. result_quast/cross_assembly and result_quast/single_assemblies). Quast does not need many ressources. Assigning 2 CPUs and 5 GB of RAM for sbatch should be enough.

# create a folder for the assessment within todays folder
$ mkdir 30_results_assessment_quast

# Quast is already installed on the server. Its a python script located here:
$ quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'

# run quast two times, once per assembly
$ $quast -o 30_results_assessment_quast/cross_assembly /path/to/your/cross_assembly/assembly.fasta

# run quast again on the assembly.fasta, which is merged from your vclust results (merged x3 single assemblies)
$ $quast -o 30_results_assessment_quast/single_assemblies ./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta

You can get the results from the file report.txt or copy the whole results folder to your computer and open report.html in your local browser.

sbatch script for running Quast

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=1G
#SBATCH --time=00:30:00
#SBATCH --job-name=assessment_quast
#SBATCH --output=./1.2_assembly/30_results_assessment_quast/assessment_quast.slurm.%j.out
#SBATCH --error=./1.2_assembly/30_results_assessment_quast/assessment_quast.slurm.%j.err

# Set some variables for the quast script on draco and the files to be analysed
quast='python3 /home/groups/VEO/tools/quast/v5.2.0/quast.py'
cross_assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
single_assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'
outdir='./1.2_assembly/30_results_assessment_quast'

# run Quast to visualize the distribution of contig lengths
$quast -o $outdir/cross_assembly $cross_assembly
$quast -o $outdir/single_assemblies $single_assembly

How well does the assembly represent the reads?

Next, we will test how much of the raw data (reads) is represented by the assembly (contigs). We will use minimap2 to align the reads from each sample to the assembled contigs and samtools to handle the output of minimap2.

Exercise - Map the reads back to the assembly with minimap2

You will map the reads from each barcode separately. Minimap2 is an alignment tool that outputs Sequence Alignment Map (SAM) file format, which has to be converted to a compressed binary (BAM) format and then sorted using samtools view and samtools sort. You can also use samtools sort to create index files, which facilitate quick access to the data in the binary file. Last, you can get basic stats of the mapping using samtools stats. This will tell you how many of the reads aligned to each of the assemblies, and how many did not align. You can also pipe the respective outputs of each step into the next step, saving disk IO and possibly speeding up things (minimap2 and samtools are specifically designed for this). You can read the manuals for minimap2 and samtools to figure out the specific commands to use (use the solution, if you’re short on time).

# minimap2 and samtools are installed on draco and you can set aliases to their location:
minimap2='/home/groups/VEO/tools/minimap2/v2.26/minimap2'
samtools='/home/groups/VEO/tools/samtools/v1.17/bin/samtools'

# pipe the commands into eachother. The '-'
# tells the tools to take their input from the pipe:
$ $minimap2 ... | $samtools view ... - | $samtools sort ... -

# get mapping statistics of each bam file
$ $samtools stats ...

samtools stats creates a long report with tabular statistics which could be plotted. The size of the output is determined by the maximum length of all reads and can get very large for alignments of long reads. A summary of the statistics is located at the beginning of the file and you can read the relevant first 46 lines with “head -46 path/to/stats/file.txt” or using “less path/to/stats/file.txt”. There is a lot of information contained in these lines, an important measure for us is the number of reads which could be mapped back to the assembly and also the number of bases mapped.

sbatch script for aligning the samples to the assemblies

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=00:30:00
#SBATCH --job-name=alignment_minimap2
#SBATCH --output=./1.2_assembly/40_results_alignment_minimap2/alignment_minimap2.slurm.%j.out
#SBATCH --error=./1.2_assembly/40_results_alignment_minimap2/alignment_minimap2.slurm.%j.err


# assign tool paths to aliases for better readability
minimap2='/home/groups/VEO/tools/minimap2/v2.26/minimap2'
samtools='/home/groups/VEO/tools/samtools/v1.17/bin/samtools'

# run minimap2 to align all reads from a sample to the assembled contigs
# and pipe the output into samtools for conversion into the binary bam format
#
# minimap2 parameters (https://lh3.github.io/minimap2/minimap2.html):
# -x map_ont : Use a preset for parameterizing the affine gap penalty model for the extension of matched seeds
# suited for noisy nanopore reads.
# -a : output in SAM format
# -t 30 : run with 30 threads
#
# samtools parameters (http://www.htslib.org/doc/samtools.html):
#
# samtools view can be used to convert between SAM, BAM and CRAM formats.
# view -u : output uncompressed binary format (BAM)
#
# samtools sort can be used to sort a SAM, BAM or CRAM file. Some tools expect sorted alignments.
# sort --write-index : output the index of the sorted alignments, can reduce file IO when accessing only a subset of the alignments
# sort -o : set the output file for the sorted alignments
#
# - : the - tells samtools to take the inpute from the pipe (| is the piping operator).


indir='./1.1_QC/20_chopper'

# cross-assembly
outdir='./data/alignments/cross_assembly'
assembly='./1.2_assembly/10_results_assembly_flye/cross_assembly/assembly.fasta'
mkdir -p $outdir

# loop through the numbers 62 to 64 and use it to generate diffenrent filenames within the loop
for barcode in $(seq 62 64) 
do 
 $minimap2 -x map-ont -a -t 30  $assembly $indir/barcode${barcode}_filtered_reads.fastq.gz | \
   $samtools view -u - | $samtools sort -o $outdir/barcode$barcode.bam --write-index -
 $samtools stats $outdir/barcode$barcode.bam > $outdir/barcode$barcode_stats.txt
done

# single assemblies
outdir='./data/alignments/single_assemblies'
assembly='./1.2_assembly/20_results_assessment_vclust/single_assemblies/assembly.fasta'
mkdir -p $outdir

for barcode in $(seq 62 64) 
do 
 $minimap2 -x map-ont -a -t 30 $assembly $indir/barcode${barcode}_filtered_reads.fastq.gz | \
   $samtools view -u - | $samtools sort -o $outdir/barcode$barcode.bam --write-index -
 $samtools stats $outdir/barcode$barcode.bam > $outdir/barcode$barcode_stats.txt
done

Questions - Compare both assemblies

You now have access to several metrics about your assemblies.

  • The assembly data from Flye;
  • The similarity between your contigs from vClust;
  • N50, N90 and the distribution of contig lengths from Quast;
  • The alignment of your reads to the assembled contigs from Minimap2/Samtools.

Can you summarize these results in a clear way and explain the difference between the assemblies? Focus on the difference between the cross-assembly and the separate assemblies.

  1. Which assembly has more contigs? Why?
  2. Which assembly has more contigs above a fixed length? Why?
  3. Which assembly better represents its input reads? Why?
  4. Can you think of other metrics to assess the quality of a metagenomic assembly?

Key Points

  • checkV assesses the quality of your contigs with regard to viral completeness and contamination

  • minimap2 aligns long and noisy nanopore reads efficiently to large (meta)genomes

  • samtools can be used to read, filter, convert and summarize alignments