Assembly of a metavirome

Overview

Teaching: 30 min
Exercises: 210 min

Objectives

Assemble a metavirome from reads

In this lesson, you will assemble the metavirome using the tool Flye introduced in this article. To then get a first idea about the diversity of the viral sequences contained in it, we will use vclust to group sequences together based on their similarity. Flye was designed to work well with noisy long reads and for metagenomic samples. Since you work on Draco, everything even slightly computationally expensive will be run through slurm. Please organize all the following steps in one or more sbatch scripts, as you learned yesterday. Since every tool needs different resources, it is recommended to have a single script per tool. All code snippets presented here assume that you put them in an adequate sbatch script. The necessary resources are mentioned in the comments or descriptions.

Assembly of a metavirome

There are multiple ways in which you can assemble reads from multiple samples as we have in our experimental data. We can just combine all data we have (replicates and conditions), to increase the coverage per virus. This can often be beneficial if the sequencing depth is not deep enough in the individual samples. Alternatively, we can assemble each sample indivudally. This preserves differences between samples which could get lost otherwise. Here, we will focus on a single assembly of a single file.

Exercise - Use flye to assemble one sample

To run a single sample assembly, you can use flye and output the assemblys into the folder 10_results_assembly_flye:
# The assembly can require large amounts of memory and time, depending on the number of reads
# and the diversity of the virome we want to study. Since we reduced the number of reads
# specifically to get a quick assembly, we can use about 50GB of RAM and 10 cores.

flye --nano-hq /path/to/barcode$barcode.fastq.gz --meta --genome-size 30m -t 10
Here you can find an overview over the possible parameters.

Flye can be used for single-organism assemblies as well as metagenomic assemblies. Your sequences were generated with the Nanopore MinION platform and filtered to contain only high quality reads. You can pass an estimate of the size of your metagenome to flye, i.e., the combined length of the assembled contigs. It is difficult to accurately predict this if you do not know what’s in your samples (as in our case). How would you very roughly estimate this number by upper and lower bounds?

After you have finalized your sbatch script with the resource assignments and the completed commands, you can run it to submit the assembly as a job to the Draco, our computational cluster. You can check the output of your script in the slurm log files which you can set with the sbatch parameters at the beginning of your script. Check both files. Often, the error output contains more than just errors, and this file is the more informative one.
#SBATCH --output=10_results_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=10_results_assembly_flye/assembly_flye.slurm.%j.err
Flye creates an assembly in multiple steps. You can read through the Flye log file that you defined with the #SBATCH --error= parameter. Afterward, you can find a list of all assembled contigs with additional information in the file assembly_info.txt, located in the Flye output folder.

What are the longest and shortest contig lengths?

What is the range of depth (coverage) values?

How many circular contigs were assembled? Do you think this percentage is high or low?

How does your guess about the size of the metagenome (above) compare to the actual size?

sbatch script for the assembly

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=20G
#SBATCH --time=01:00:00
#SBATCH --job-name=assembly_flye
#SBATCH --output=10_assembly_flye/assembly_flye.slurm.%j.out
#SBATCH --error=10_assembly_flye/assembly_flye.slurm.%j.err

# run flye in metagenomic mode for de-novo assembly of viral contigs
# First, activate the conda environment which holds the flye installation on draco:

# First, activate the conda environment which holds the flye installation on draco:
source /vast/groups/VEO/tools/miniconda3_2024/etc/profile.d/conda.sh && conda activate flye_v2.9.2

# set a data directory holding your samples in .fastq.gz format
datadir="../1.1_QC/20_chopper"

# single assemblies
# Run flye on all samples sequentially, save the results in separate folders named like the samples:

outdir="10_assembly_flye"

flye --nano-hq $datadir/virome_filtered_reads.fastq.gz --meta --genome-size 10m --out-dir $outdir -t 30

conda deactivate

Estimating the sequence diversity in your assembly

Which viruses did we sequence now? Without further analysis, we only have a nucleotide sequence and corresponding abundance estimates from the assembly. In the next days we will go through several steps to characterize the sequences we assembled now in more detail. Now, as a first overview over the diversity of the sequences, we will cluster them at 95% Average Nucleotide Identify (ANI) over 85% of the length of the shorter sequence. These cutoffs are often used to cluster viral genomes at the species rank. This can be done with the tool vClust and results in both, clustering very similar complete viral genomes and grouping genome fragments along with similar and longer sequences.

Exercise - use vClust to find simililar contigs

vClust can also output cluster representatives, which will be the longest sequences within a cluster. We use this mode to get the information we need to to estimate the diversity of the virome we sequenced. You can play with the similarity cutoffs (and metrics) used for clustering to see how they affect the results. vClust needs to align all sequences to each other and can run heavily in parallel. Remember to put this step into a sbatch script again and assign around 30 cores and 20GB of RAM.
# vClust is a python script and can be run by simply calling it on draco
# you have to run it with python 3.9
vclust='python3.9 /home/groups/VEO/tools/vclust/v1.0.3/vclust.py'

$vclust prefilter -i 10_results_assembly_flye/single_assemblies/assembly.fasta ...
$vclust align ...
$vclust cluster ...
After vClust ran successfully, you can use a Python script to analyze the clusters computed by vClust. To parse a tabular file in the format of CSV or TSV (comma- or tab-separated values), the Python package pandas can be used. If its not in your virtual environment, install it now.

Pandas can be used to analyze the output of vClust. Some interesting aspects of the clusters we can compute from the cluster assignments allone are the number of clusters and how large the clusters are.
import pandas as pd
import os
from collections import Counter

# read the tsv file generated by vClust
cluster_df = pd.read_csv('path/to/the/file.tsv', sep='\t')

# use python's Counter object to count the "cluster" column in the vClust output table
cluster_counter = Counter(cluster_df["cluster"])

# use the counter object to compute how many clusters there are and how large they are
...

Python script for analyzing cluster sizes

import os
import pandas as pd
from collections import Counter


vclust_results_path = './1.2_assembly/20_results_assessment_vclust/'

for results_file in ("clusters_tani_90.tsv", "clusters_tani_70.tsv", "clusters_ani_90.tsv"):
    # read the tsv file generated by vClust
    cluster_df = pd.read_csv(os.path.join(vclust_results_path, results_file), sep='\t')

    # use python's Counter object to count the "cluster" column in the vClust output table
    cluster_counter = Counter(cluster_df["cluster"])
    n_clusters = len(cluster_counter)
    largest_clusters = cluster_counter.most_common(10)

    # output number of clusters and the sizes of the largest clusters
    print(results_file)
    print(f"Number of clusters: {len(clusters}")
    print("\n".join([f"{mc[0]}: {mc[1]}" for mc in largest_clusters]))

sbatch script for running vclust

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --cpus-per-task=32
#SBATCH --partition=short,standard
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=vclust
#SBATCH --output=20_vclust/vclust.slurm.%j.out
#SBATCH --error=20_vclust/vclust.slurm.%j.err


vclust='python3.9 /home/groups/VEO/tools/vclust/v1.2.7/vclust/vclust.py'

indir='10_assembly_flye/'
outdir='20_vclust/'

$vclust prefilter -i $indir/assembly.fasta -o $outdir/fltr.txt
$vclust align -i $indir/assembly.fasta -o $outdir/ani.tsv -t 30 --filter $outdir/fltr.txt
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_90.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.90 --out-repr
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_tani_70.tsv --ids $outdir/ani.ids.tsv --metric tani --tani 0.70 --out-repr
$vclust cluster -i $outdir/ani.tsv -o $outdir/clusters_ani_90.tsv --ids $outdir/ani.ids.tsv --metric ani --ani 0.90 --out-repr

source ../py3env/bin/activate

python3 ../python_scripts/1.2_vclust_results.py

deactivate

Questions - Go through the results of vClust

vClust is a great tool to get a feeling for the diversity within your assemblies (or any kind of set of sequences).

How many sequence clusters were found in your assembly and how many sequences were grouped in the largest one?

How do the results depend on the choice of your metric (ANI, TANI, or GANI) and your cutoff values?

Key Points

Flye can be used to assemble long and noisy nanopore reads from metagenomic samples.

Samples can be assembled individually and combined in a cross-assembly

vClust can be used to assess the diversity of sequences in your assembly

previous episode

Viromics2025

next episode

Assembly of a metavirome

Overview