Assembly and cross-assembly

Overview

Teaching: 30 min
Exercises: 180 min
Objectives
  • Assemble the metavirome from 3 samples

In this lesson, you will assemble the metavirome in two different ways using the tool Flye introduced in this article. Flye was designed to work well with noisy long reads and for metagenomic samples. Since you work on Draco, everything even slightly computationally expensive will be run through slurm. Please organize all the following steps in one or more sbatch scripts, as you learned yesterday. Since every tool needs different resources, it is recommended to have a single script per tool. All code snippets presented here assume that you put them in an adequate sbatch script. The necessary resources are mentioned in the comments or descriptions.

Cross-assembly

In a cross-assembly, reads from multiple samples are combined and assembled together. This allows for the discovery of shared sequence elements between the samples. If a virus (or other sequence element) is present in several samples, its sequencing reads from the different samples may be assembled together in one contig. Of course, whether this happens depends on how similar the sequence elements are, and on the stringency settings of the assembly algorithm. To figure out which contigs are present in which sample, we can map the sequencing reads from each sample to the contigs, and estimate the abundance by measuring the depth of coverage, i.e. the number of times that each nucleotide occurs in the reads. (We note that this can also be done with contigs that were assembled from a single metagenome.)

Exercise - Use flye to assemble a metagenome

You will perform a cross-assembly of the samples you started working with yesterday. To this end, you will first concatenate the sample files and then run Flye on the merged file. Gzipped files can be concatenated just like text files with the command line tool cat. Flye is installed in a conda environment on Draco.

# activate conda environment with flye installation on Draco
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate flye_v2.9.2

# merge the sequences
cat /path/to/*.fastq.gz > /path/to/all_samples.fastq.gz

# create a folder for the output cross-assembly
mkdir -p 10_results_assembly_flye/cross_assembly

# complete the flye command. This is the computationally expensive part
# and profits from many cores (30 is a good number). The used memory should
# not exceed 20GB of RAM. Think about the parameters you can pass to flye listed
# in the link below.
flye --nano-raw /path/to/all_samples.fastq.gz --meta --out-dir 10_results_assembly_flye/cross_assembly

Here you can find an overview over the possible parameters. Flye can be used for single-organism assemblies as well as metagenomic assemblies. Your sequences were generated with the Nanopore MinION platform and filtered to contain only high quality reads. You can pass an estimate of the size of your metagenome to flye, i.e., the combined length of the assembled contigs. It is difficult to accurately predict this if you do not know what’s in your samples (as in our case). How would you very roughly estimate this number?

After you have finalized your sbatch script with the resource assignments and the completed commands, you can either run it directly, or continue to expand it with the commands for the second approach described below. If you run it now, remember you can check the output of your script in the slurm log files which you can set with the sbatch parameters at the beginning of your script. Check both files. Often, the error output contains more than just errors, and this file is the more informative one.

#SBATCH --output=10_results_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=10_results_assembly_flye/assembly_flye.slurm.%j.err

Flye creates an assembly in multiple steps. You can read through the Flye log file that you defined with the #SBATCH --error= parameter. Afterward, you can find a list of all assembled contigs with additional information in the file assembly_info.txt, located in the Flye output folder.

  1. What are the longest and shortest contig lengths?
  2. What is the range of depth (coverage) values?
  3. How many circular contigs were assembled? Do you think this percentage is high or low?
  4. How does your guess about the size of the metagenome (above) compare to the actual size?

Separate assemblies

This part is recommended but optional. If you cannot finish it in the suggested time, move on to the next section “Assessing assembly quality”, which will be important for the coming days.

The second approach consists of performing separate assemblies for each sample and merging the resulting contigs by similarity in the end. Note that if a species is present in several samples, this final set will contain multiple contigs representing the same sequence, each coming from one sample. Because of this, we will further de-replicate the final contigs to get representative sequences.

Exercise - Use flye to assemble each sample individually

To run separate assemblies, you can adapt the flye command used for the cross-assembly to take each sample file individually and output the assemblies in separate folders in 10_results_assembly_flye. For this you can run a for loop over the respective files. The following expects the sample files to be called barcodeN.fastq.gz with N in (62, 63, 64):

# The individual assemblies need less memory than the cross-assembly,
# but you can still use the same resources as before.
for barcode in $(seq 62 64)
do
	flye ... /path/to/barcode$barcode.fastq.gz ...
done

Here, the command seq generates a sequence of integer numbers between its two arguments. Once the assemblies have finished, you will combine the contigs generated for each sample into a single file. Since the generated contigs are only assigned numbers by flye (not necessarily sequential), the same names will be present in each assembly. We have to rename them according to the sample they originate from for all contigs to have unique names. We can do this using Python and the Biopython package. Biopython provides many tools for the analysis of sequencing data, including tools for parsing and writing .fasta files. You can find the documentation on these here.

To use this Python package on Draco, use the virtual environment you set up yesterday. If you did not already install the biopython package yesterday, do so now. You can now write a python script that reads your contigs, changes their name, and writes a new file with the results using biopython. The basic structure of the script could look like this:

from Bio import SeqIO
import os, sys

# read the arguments passed to your script. They are stored as a list in sys.argv.
# sys.argv contains the parameters passed to the python binary, so sys.argv[0]
# always contains the file name of your script. The arguments passed to the script
# then start at sys.argv[1]
assembly = os.path.abspath(sys.argv[1])

# open a file handle using the with open(x) as y syntax. This ensures the file is
# closed properly after the code block.
with open(assembly) as file_handle:
  for record in SeqIO.parse(file_handle, "fasta"):
    # do something with the sequence name

# write the records with the new names
...

This script should not be computationally expensive, but we will anyways execute it from within an sbatch script. We can also combine it with the script or scripts you already have written for the assemblies themselves. In this case, it is important to deactivate the conda environment holding the flye installation before activating the environment you just created.

python script for renaming contigs

import os, sys
from Bio import SeqIO

def main():

   # access parameters passed to your script
   assembly = os.path.abspath(sys.argv[1])

   # throw an error if the statement behind assert is not true
   assert assembly.endswith(".fasta")
   
   out_fasta = os.path.abspath(sys.argv[2])
   assert out_fasta.endswith(".fasta")
   
   sample_id = sys.argv[3]
   
   # modify names of the scaffols and store in to_write list
   to_write = list()
   with open(assembly) as handle:
       for record in SeqIO.parse(handle, "fasta"):
           # make sure not to do this twice, if run several times. Its not necessary.
           if record.id.startswith(sample_id): continue

           # adjust the id of the record object
           record.id = f"{sample_id}_{record.id}"

           # append the object to the list for writing back to a file later.
           to_write.append(record)

   # write records in to_write to .fasta file
   with open(out_fasta, "w") as fout:
       SeqIO.write(to_write, fout, "fasta")


if __name__ == "__main__":
   main()

sbatch script for assemblies

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=20G
#SBATCH --time=01:00:00
#SBATCH --job-name=assembly_flye
#SBATCH --output=./1.2_assembly/10_results_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=./1.2_assembly/10_results_assembly_flye/assembly_flye.slurm.%j.err

# run flye in metagenomic mode for de-novo assembly of viral contigs
# First, activate the conda environment which holds the flye installation on draco:


# First, activate the conda environment which holds the flye installation on draco:
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate flye_v2.9.2

# set a data directory holding your samples in .fastq.gz format
datadir="1.1_QC/20_chopper"

# create a merged fastq.gz file by concatenating three samples:
cat $datadir/barcode*_filtered_reads.fastq.gz > $datadir/merged_filtered.fastq.gz

# set a directory for ouputting results
outdir="./1.2_assembly/10_results_assembly_flye/cross_assembly"

# create a folder for the cross-assembly
mkdir -p $outdir

# flye parameters (https://gensoft.pasteur.fr/docs/Flye/2.9/USAGE.html)
# --nano-raw: tells flye about the input data. --nano-hq is also possible but results in a smaller assembly (Why?).
# --meta: for metagenomes with uneven coverage
# --genome-size: estimated size for metagenome assembly (mg size)
# -t: threads
flye --nano-raw $datadir/merged_filtered.fastq.gz --meta --genome-size 30m --out-dir $outdir -t 30

# single assemblies
# Run flye on all samples sequentially, save the results in separate folders named like the samples:

outdir="./1.2_assembly/10_results_assembly_flye/single_assemblies"
mkdir -p $outdir
for barcode in $(seq 62 64)
do
        flye --nano-raw $datadir/barcode${barcode}_filtered_reads.fastq.gz --meta --genome-size 10m --out-dir $outdir/barcode$barcode -t 30
done

# close the conda environment
conda deactivate

# activate your virtual environment
source ./py3env/bin/activate

# rename the contings in the generated single assemblies with a python script.
# It assumes three parameters here, first the input file name, second the output file name and third,
# the sample name to add to each contig name in the respective fasta file.
for barcode in $(seq 62 64)
do
        python3 python_scripts/assembly/10_run_rename_scaffolds.py $outdir/barcode$barcode/assembly.fasta $outdir/barcode$barcode.fasta barcode$barcode
done

deactivate

cat $outdir/barcode*.fasta > $outdir/assembly.fasta

Detecting similar sequences within and between assemblies

Some samples will contain the same virus strains. This will lead to the assembly of the same sequence in several times. To de-replicate the scaffolds of the single asseblies, we will cluster them at 95% Average Nucleotide Identify (ANI) over 85% of the length of the shorter sequence. These cutoffs are often used to cluster viral genomes at the species rank. This can be done with the tool vClust and results in both, clustering complete viral genomes at the species level and clustering genome fragments along with very similar and longer sequences.

Exercise - use vClust to find simililar contigs

vClust can also output cluster representatives, which will be the longest sequences within a cluster. Use this mode to get the information we need to dereplicate our merged single assemblies. Also apply vClust to the cross-assembly, this will give us more information on the assembled sequences. You can play with the similarity cutoffs (and metrics) used for clustering to see how they affect the results. vClust needs to align all sequences to each other and can run heavily in parallel. Remember to put this step into a sbatch script again and assign around 30 cores and 20GB of RAM.

# vClust is a python script and can be run by simply calling it on draco
# you have to run it with python 3.9
vclust='python3.9 /home/groups/VEO/tools/vclust/v1.0.3/vclust.py'

$vclust prefilter -i 10_results_assembly_flye/single_assemblies/assembly.fasta ...
$vclust align ...
$vclust cluster ...

After vClust ran successfully, you need to write a Python script to filter the assembly according to the information provided by vClust. To parse a tabular file in the format of CSV or TSV (comma- or tab-separated valus), the Python package pandas can be used. If its not in your virtual environment, install it now.

Now you can write a Python script that filters the assemblies corresponding to the output of vClust:

from Bio import SeqIO
import pandas as pd
import os, sys

# read the tsv file generated by vClust
cluster_df = pd.read_csv('path/to/the/file.tsv', sep='\t')
 
with open(assembly) as file_handle:
  for record in SeqIO.parse(file_handle, "fasta"):
    # check if the contig is in the set of representatives

# write the records which passed the test
...

Python script for filtering similar contigs

import os, sys
import pandas as pd
from Bio import SeqIO

def main():

    assembly_filename = os.path.abspath(sys.argv[1])
    assert assembly_filename.endswith(".fasta")
   
    out_filename = os.path.abspath(sys.argv[2])
    assert out_filename.endswith(".fasta")
   
    cluster_reps_filename = os.path.abspath(sys.argv[3])
    assert cluster_reps_filename.endswith(".tsv")

    cluster_reps_df = pd.read_csv(cluster_reps_filename, sep='\t')
    cluster_reps_set = set(cluster_reps_df["cluster"])
	
    to_write = list()
    with open(assembly_filename) as handle:
        for record in SeqIO.parse(handle, "fasta"):
            if record.id in cluster_reps_set: to_write.append(record)

    # write records in to_write to .fasta file
    with open(out_filename, "w") as fout:
        SeqIO.write(to_write, fout, "fasta")


if __name__ == "__main__":
    main()

sbatch script for dereplication

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard,interactive
#SBATCH --mem=2G
#SBATCH --time=01:00:00
#SBATCH --job-name=assessment_vclust
#SBATCH --output=./1.2_assembly/20_results_assessment_vclust/assessment_vclust.slurm.%j.out
#SBATCH --error=./1.2_assembly/20_results_assessment_vclust/assessment_vclust.slurm.%j.err


vclust='python3.9 /home/groups/VEO/tools/vclust/v1.0.3/vclust.py'

indir='./1.2_assembly/10_results_assembly_flye/cross_assembly'
outdir='./1.2_assembly/20_results_assessment_vclust/cross_assembly'
mkdir -p $outdir

$vclust prefilter -i $indir/assembly.fasta -o $outdir/fltr.txt
$vclust align -i $indir/assembly.fasta -o $outdir/ani.tsv -t 30 --filter $outdir/fltr.txt
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_90.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_70.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.70
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_ani_90.tsv --ids $outdir/ani.ids.tsv --metric ani --ani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusterreps.tsv --ids $outdir/ani.ids.tsv --algorithm uclust --metric ani --ani 0.95 --cov 0.85 --out-repr

source ./py3env/bin/activate

python3 ./python_scripts/assembly/20_run_filter_representatives.py $indir/assembly.fasta $outdir/assembly.fasta $outdir/clusterreps.tsv

deactivate

echo Done running python script for cross-assembly...

indir='./1.2_assembly/10_results_assembly_flye/single_assemblies'
outdir='./1.2_assembly/20_results_assessment_vclust/single_assemblies'
mkdir -p $outdir

$vclust prefilter -i $indir/assembly.fasta -o $outdir/fltr.txt
$vclust align -i $indir/assembly.fasta -o $outdir/ani.tsv -t 30 --filter $outdir/fltr.txt
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_90.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_70.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.70
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_ani_90.tsv --ids $outdir/ani.ids.tsv --metric ani --ani 0.90
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusterreps.tsv --ids $outdir/ani.ids.tsv --algorithm uclust --metric ani --ani 0.95 --cov 0.85 --out-repr

source ./py3env/bin/activate

python3 ./python_scripts/assembly/20_run_filter_representatives.py $indir/assembly.fasta $outdir/assembly.fasta $outdir/clusterreps.tsv

deactivate

Questions - Go through the results of vClust

vClust is a great tool to get a feeling for the diversity within your assemblies (or any kind of set of sequences).

  1. How many similar sequences were found in each assembly?
  2. How do the results depend on the choice of your metric (ANI, TANI, or GANI) and your cutoff values?

Key Points

  • Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.

  • Samples can be assembled individually and combined in a cross-assembly

  • vClust can be used to assess the diversity of sequences in your assembly