Getting to know your files

Overview

Teaching: 120 min
Exercises: 0 min
Objectives
  • Interpret different file formats used in bioinformatics

  • Utilize bash commands to look into sequence data

  • Get a basic understanding of what sequence data looks like

  • Work with sequence data in Python or R to create basic plots

First, let’s copy the sequence data to your own home folders.

Exercise - get your sequence data

The data is in the folder /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome


# make a ./data/sequences directory if it doesn't already exist IN YOUR OWN HOME DIRECTORY

mkdir -p viromics/data/sequences

# go into your sequences directory
cd viromics/data/sequences

# copy over the 3 barcode files you need
cp /work/groups/VEO/shared_data/Viromics2024Workspace/data/sequences/viral_metagenome/full_barcode*.fastq.gz ./

Exercise - check out your files

  1. Take a look at the file structure and files. We have given you a subset of viromics data ./data/sequences. Some helpful commands might be: Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
# list
ls
ll

# path to working directory
pwd

# print first 10 lines
zcat file.fastq.gz | head -10
# print last 10 lines
zcat file.fastq.gz | tail -10

# open file on terminal (press 'q' to exit less)
less
more

# count number of lines
wc -l file.txt

# search for "@" inside file 
grep "@" file.fastq

# or gzipped files
zgrep "@" file.fastq.gz | head -10

# count number of sequence headers in a fasta file
grep -c ">" file.fasta
zgrep -c ">" file.fasta.gz

See more useful commands and one-liners here

For creating new directories - please ensure to name your directories with a number and a meaningful header. An example is here

Understanding bioinformatics file formats

It is always a good idea to check the contents of your data files and make regular “sanity checks” to see if you understand everything. The first video covers different file formats commonly used in bioinformatics. Namely, FASTA, FASTQ, BAM, SAM, VCF, BCF, GFF, GTF and BED files (9 minutes).

Introduction to common file formats in bioinformatics & computational biology

The second video covers different types of sequence files, including fasta and fastq, phred scores, files containing metadata (tsv files) and compressed files (11 minutes).

Basic file formats in bioinformatics

Exercise

  1. What information is contained in sequencing files?
    a) Choose 3 file formats and describe them
  2. Choose the barcode62.fastq.gz file in your data/sequences/ folder
    a) How many lines does this file have?
    b) How many sequences are present in the file?
  3. Print the first 5 lines and the last 5 lines in the terminal.
    Hint: for gzipped files - you need to zcat them first. e.g. zcat file.fastq.gz | head -10
    You don’t need to print this output into your lab-book
    zcat barcode62.fastq.gz | cut -b -100 | head -10 - to print the first 100 characters of the first 10 lines
    a) What differences do you observe between these lines?

Key Points

  • Sequence data and information about genes and genomes can come in many different formats

  • The most common file formats are fasta (nucl. and amino acid), fastq, sam and bam, genbank, gff and tsv files