Getting to know your files

Overview

Teaching: 120 min
Exercises: 0 min

Objectives

Interpret different file formats used in bioinformatics

Utilize bash commands to look into sequence data

Get a basic understanding of what sequence data looks like

Work with sequence data in Python or R to create basic plots

Navigating your filesystem space

First, let’s copy the sequence data to your own home folders.

Exercise - get your sequence data

The data is in the folder /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome

# make a ./data/sequences directory if it doesn't already exist IN YOUR OWN HOME DIRECTORY

mkdir -p viromics/data/sequences

# go into your sequences directory
cd viromics/data/sequences

# copy over the 3 barcode files you need
cp /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome/full_barcode*.fastq.gz ./

Exercise - check out your files

Take a look at the file structure and files. We have given you a subset of viromics data ./data/sequences. Some helpful commands might be: Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
# list
ls
ll

# path to working directory
pwd

# print first 10 lines
zcat file.fastq.gz | head -10
# print last 10 lines
zcat file.fastq.gz | tail -10

# open file on terminal (press 'q' to exit less)
less
more

# count number of lines
wc -l file.txt

# search for "@" inside file 
grep "@" file.fastq

# or gzipped files
zgrep "@" file.fastq.gz | head -10

# count number of sequence headers in a fasta file
grep -c ">" file.fasta
zgrep -c ">" file.fasta.gz

See more useful commands and one-liners here

For creating new directories - please ensure to name your directories with a number and a meaningful header. An example is here

Understanding bioinformatics file formats

It is always a good idea to check the contents of your data files and make regular “sanity checks” to see if you understand everything. The first video covers different file formats commonly used in bioinformatics. Namely, FASTA, FASTQ, BAM, SAM, VCF, BCF, GFF, GTF and BED files (9 minutes).

The second video covers different types of sequence files, including fasta and fastq, phred scores, files containing metadata (tsv files) and compressed files (11 minutes).

Exercise

What information is contained in sequencing files?
a) Choose 3 file formats and describe them

Choose the barcode62.fastq.gz file in your data/sequences/ folder
a) How many lines does this file have?
b) How many sequences are present in the file?

Print the first 5 lines and the last 5 lines in the terminal.
Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
You don’t need to print this output into your lab-book
zcat barcode62.fastq.gz | cut -b -100 | head -10 - to print the first 100 characters of the first 10 lines
a) What differences do you observe between these lines?

Key Points

Sequence data and information about genes and genomes can come in many different formats

The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files

previous episode

Viromics2024

next episode

Getting to know your files

Overview

Navigating your filesystem space

Exercise - get your sequence data

Exercise - check out your files

Understanding bioinformatics file formats

Exercise

Key Points

previous episode

next episode