Assessing assemblies quality

Overview

Teaching: 15 min
Exercises: 25 min
Questions
  • Which assembly contains longer contigs?

  • Which assembly best represents the raw data (sequencing reads)?

Objectives

Assemblies assessment

Now we will measure some basic aspects of the assemblies, such as the fragmentation degree and the percentage of the raw data they contain. Ideally, the assembly would contain a single and complete contig for each species in the sample, and would represent 100% of the sequencing reads.

Fragmentation

Use the QUAST program (Gurevich et al., 2013) to assess how fragmented are the assemblies. Have a look at the possible parameters with quast -h. You will need to run it two times, one per assembly, and save the results to different folders (ie. quast_crossassembly and quast_separate)

# create a folder for the assessment
$ mkdir 1_assemblies/assessment

# run quast two times, one per assembly
$ quast -o 1_assemblies/assessment/<OUTPUT_FOLDER> ...

Raw data representation

You can know the amount of raw data represented by the assemblies by mapping the reads back to them and quantifying the percentage or reads that could be aligned. For this, use the BWA (cite) and Samtools (cite) programs. BWA is a short-read aligner, while Samtools is a suite of programs intended to work with mapping results. Mapping step requires you to first index the assemblies with bwa index so BWA can quickly access them. After it, use bwa mem to align the sequences to the assemblies and save the results in a SAM format file (ie. crossassembly.sam and separate.sam). Then use samtools view to convert the SAM files to BAM format (ie. crossassembly.bam and separate.bam), which is the binary form of the SAM format. Once you have the BAM files, sort them with samtools sort (output could be crossassembly_sorted.bam and separate_sorted.bam). Last, index the sorted BAM files to allow for an efficient processing, and get basic stats of the mapping using samtools flagstats.

# index the assemblies
$ bwa index <ASSEMBLY_FASTA>

# map the reads to each assembly
$ bwa mem ... > 1_assemblies/assessment/<OUTPUT_SAM>

# convert SAM file to BAM file
$ samtools view ...

# sort the BAM file
$ samtools sort ...

# index the sorted BAM file
$ samtools index ...

# get mapping statistics
$ samtools flagstats ...

Compare both assemblies

So far you have calculated some metrics to assess the quality of the assemblies, but bare in mind there also exist also others we can check for this, such as the number of ORFs or the depth of coverage across the contigs. In the report generated by Quast, look at metrics regarding scaffolds length, such as the N50. Can you explain the difference between both assemblies? Regarding the raw data containment, how different are both assemblies? Which metric do you find more relevant for metagenomics? Can you think of other metric to assess the quality of an assembly?

Key Points