Lab 8: Aligning sequences to a reference genome

Lab 8 Overview

Many applications in genomics and bioinformatics depend on aligning sequenced data to a reference genome. These are called “re-sequencing” projects. In this lab, we will align cholera strains isolated from the 2010 Haitian outbreak to the Vibrio cholerae reference genome and view our alignments. Eventually, we will use an expanded dataset to build a phylogenetic tree to address the origins and spread of the 2010 Haitian cholera outbreak.

We will use a program called BWA (Burrows Wheeler Aligner, Li and Durbin 2009) to rapidly map reads to the reference genome. When we align or “map” reads against a reference, what we mean is that we want to go from a FASTQ file listing lots of reads to another type of file which lists the reads and where/if it maps against the reference genome. The second file is called a SAM file and will be discussed later. BWA also allows for a certain number of mismatches to account for variation present in our strain vs. the reference genome.

Tidy up and data prep!

Log on to the server and move into your /workdir/genomics/your_id/Data directory.
You will need the files below to complete the tasks in this lab. Copy any files you do not already have from /workdir/genomics/Data into your /workdir/genomics/your_id/Data directory.

SRR531199_1.fastq.gz
SRR531199_2.fastq.gz
VibrioCholerae_N16961_genome.fna – the reference genome of V. cholerae
VibrioCholerae_N16961_genes.gff – a file describing the genes of V. cholerae

Make a new directory called Lab7. Move all other data and files to this Lab7 directory.

Align sequences to a reference genome

Let’s start with aligning the 500,000 reads isolated from the Haitian cholera outbreak to our reference genome. Perform adaptor and quality trimming using trimmomatic, as you have done previously. Name your output files like SRR531199_1_500K_paired.fastq, SRR531199_1_500K_unpaired.fastq, ….

Now, we are ready to turn to our alignment step. Before we can start aligning reads to a reference genome, the genome needs to be “indexed”. This means sorting the genome into easily searchable chunks. To index the reference genome, use the following command:

$ bwa index reference_file_name

TASK: Align reads 1 and 2 to the reference V. cholerae genome. We will be using a specific algorithm within the bwa package called mem (Maximal Exact Matches). Use the bwa mem help file ($ bwa mem) to write a command that will align our reads to the reference genome. You will want to run the alignment while increasing the number of threads to 4 (-t 4). Redirect the output to a file called SRR531199.sam.

Specify the full command you entered.

How many lines does your SAM file have?
|
|
|
|

The `SAM` header

Your raw alignment is stored in what’s called the SAM (Simple Alignment Method) format. This is why we gave your step #3 output the .sam extension. Because SAM files are plain text format, you can view your file with less or head.

SAM files consist of a header and an alignment section. A SAM file header for a C. elegans alignment might look something like the file below. C. elegans has six chromosomes (I-V, X) and chromosome I is 15,072,434 bp long.

@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
@SQ SN:IV LN:17493829
@SQ SN:V LN:20924180
@SQ SN:X LN:17718942
@RG ID:VB00023_L001 SM: celegans-01
@PG ID:bwa PN:bwac VN:0.7.10-r789 […]
I_2011868_2012306_0:0:0_0:0_2489 83 I 2012257 40 […]

Term	Meaning
`@SQ`	stores information about the reference sequence (e.g., chromosome number and length)
`@RG`	contains important information about read group and sample metadata
`ID`	sample identifier and must be unique
`SM`	keeps track of the sample information
`PL`	keys for platform identification.
`@PG`	contains metadata about the programs used to create and process a set of SAM/BAM files like program name and version
`I_2011868`	first line of your alignment section (i.e., no “@”; see below)

NOTE: your file does not yet have an “@RG” header line. We will add this in a later lab.

In addition to using head or less to view a sam file, you can extract and view only the header using a program called SAMTOOLS. SAMTOOLS is very useful for interacting with SAM or BAM (more later…) files.

TASK: View your header file with SAMTOOLS using the command below. Then answer the questions below.

$ samtools view -H file_name

What are the chromosome names and how long are they?

|
|
|

The `SAM` alignments

The alignment section of a SAM file contains read alignments (and may include reads that did not align). Each alignment entry is composed of 11 required fields and may have some optional fields.

Let’s step through each field in an alignment. Because alignment lines are quite lengthy and would overflow the page width, return each field on a new line. To do this, use $ samtools view with no options to return only alignment lines (i.e., no header lines). Then, employ a nifty tool called tr to convert tabs to newlines. Finally, just take the first 11 lines of this output:

$ samtools view f1.sam | tr '\t' '\n' | head -n 11

Term	Meaning
`SRR531199.2`	QNAME, query name (i.e., sequenced read’s name)
`83`	bitwise flag, contains info about the alignment
`NC_002505.1`	RNAME, reference name, tells which chromosome the read aligned to
`1677874`	POS, position on the reference sequence of the first mapping base (leftmost) in the query sequence
`60`	MAPQ, mapping quality, measures how likely the read is to actually originate from the position it maps to
`@72M`	CIGAR, CIGAR strings are specialized formats for describing the alignment
`=`	RNEXT, paired end (i.e, “next read”) is named properly
`1677781`	RPOS, position of paired end read
`-165`	TLEN, template or “insert” length for paired-end reads
`GCCTAT[…]`	SEQ, stores the original sequence, will always be in the orientation it’s aligned in
`@@@DD=[…]`	QUAL, stores the original read base quality

TASK: A mapping quality of “60” is the highest quality possible. Use a combination of commands to determine (1) how many sequences have a mapping quality lower than 60 and (2) the lowest observed mapping quality.

How do you interpret your lowest mapping quality (Google may help here)?

|
|
|
|

Use a combination of commands to determine the ranges of insert sizes in your data (i.e., the shortest (3) and longest (4) insert). Hint: you can restrict your search to positive numbers. For each negative insert size, the paired read will have the same positive value.

Do your answers above surprise you in any way?

|
|
|
|

TASK: One common filtering step is to restrict alignments to only uniquely mapped reads. Write a UNIX command that generates a new file, filtered.sam, containing only uniquely mapped reads. You will need to generate a header file, the filtered reads, and then concatenate those together.
Specify your command below and determine the number of reads before and after filtering.

Number of reads raw SAM:

Number of reads filtered SAM:

`SAM` to `BAM`

SAM files are nice an human interpretable, but they can be quite large. BAM files are binary versions of SAM files. Therefore, BAM are not human readable, but they have the advantage of being much smaller than SAM files.

Convert your SAM file to a Binary Alignment Method (BAM) files and then sort it. $ samtools view converts SAM to BAM files with the -b option. We can then pipe that output to samtools sort and redirect the final output to a newly named BAM file, filtered.sort.bam.

Specify your command below:

Use head to check out the BAM and SAM files you generated. Does the BAM file make any sense to you?

TASK: Compare the sizes of your BAM and SAM files using the command below.

$ du -sh file_name

BAM file size:

SAM file size:

For the rest of the lab, we will be working with our BAM files, because they take up much less space. Plus, we can always convert back to a SAM file using samtools view. Given that, let’s clean up any large SAM files. Removing such intermediate files is a good practice when performing any workflow.

$ rm -i file_name

Visualize your alignment with `IGV`

Now, let’s visualize your alignment in a graphical framework using a program called the Integrated Genome Viewer, or IGV. IGV is an interactive, powerful GUI that allows for flexible and targeted viewing of alignments. However, it can also be a bit of a memory suck, so please be patient if it’s taking a while (or crashes…). To use IGV, we will need:

a sorted, binary version of your SAM file
an indexed file that references the file in # 1
our reference genome
a reference gene file

We have already got #1, #3, and #4. Now, let’s index our SAM file using samtools index.

$ samtools index bam_file

You will be utilizing IGV on your own computer. First, download IGV at this website (http://software.broadinstitute.org/software/igv/). Then, move the appropriate files to a folder on your desktop using sftp as you have done before.

To run IGV:

Start the IGV application on your local computer.
From the drop-down menu, select Genomes -> Load Genome from File…
Select your V. cholerae reference genome. You should now see two chromosome names (NC_002505.1 and NC_002506.1) in a drop-down menu at the top of the application.
To load your file, select File -> Load and choose your sorted BAM file.

Let’s explore IGV. Zoom in on a region by dragging a smaller window onto the chromosome. Then, double click until you see coverage. It should look something like this:

Horizontal bars represent reads (i.e., Illumina sequences).
The total number of reads overlapping a single nucleotide is coverage for that nucleotide (we previously calculated average, genome-wide coverage).
Gray reads represent reads with expected insert sizes.
Red reads indicate that the inferred insert size is larger than expected (and therefore a section of DNA is deleted in the subject genome relative to the reference).
Blue reads indicate that the inferred insert size is smaller than expected (and therefore a section of DNA is inserted in the subject genome relative to the reference.
Paired ends mapping to different chromosomes are colored by chromosome and indicate rearrangements.
You can read more about insert size interpretations here (http://software.broadinstitute.org/software/igv/interpreting_insert_size).

Right-click on the Alignments panel and select View as pairs to see the two components of paired-end reads represented together facing each other and connected with a line (you may need to zoom in to see this). Hovering your mouse over a specific pair of reads will give you a yellow box with detailed information for that pair. Please note that insert size refers to the total distance between the 5’-ends of two reads in a pair.

TASK: Hover over a gray read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?

|
|
|
|

Hover over a blue read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?

|
|
|
|

Hover over a red read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?

|
|
|
|

Viewing genes with `IGV`

We’ve looked at IGV using the DNA sequence information from our reference, but we also may want to know something about genes (and variation within them). We can use VibrioCholerae_N16961_genes.gff to display information about genes.

From the drop-down menu, select Genomes -> Create .genome file
Assign your genome an appropriate Unique identifier and Descriptive name. Upload the VibrioCholerae_N16961_genome.fna file to the FASTA file box. Upload the VibrioCholerae_N16961_genes.gff to the Gene file box. Click OK.
To load your file, select File -> Load from File and choose your sorted BAM file.

Your screen should look similar, but with an additional row of horizontal blue bars in the Gene track. The thick bars are protein-coding regions and the thin bars are regions that do not code for proteins (but may be transcribed). V. cholerae is a bacteria and, therefore, has no introns. In eukaryotes, thin bars indicate introns and intergenic space (i.e., not transcribed) has no bars. Bacteria also have very high gene density. Notice how close the genes are.

TASK: Zoom in on a single gene. Hover over the genes and look for the gene name and product. Find a gene that has a described product (i.e., is not a hypothetical protein). What is the name of your chosen gene? What is its function?

|
|
|
|
|

Snip? SNP!

Single nucleotide polymorphisms or SNPS are nucleotide sites where the subject (i.e., your re-sequence data) varies from the reference. If you want to know about genetic variation in a population or across species, you would start by identifying SNPs.

TASK: Scroll through your IGV output until you find a site that you feel confident is a SNP (and that is different than the one below). This should be a site where all or nearly all of the reads differ from the reference in the same way. There are not many SNPs in the genome, so it may take a few minutes of searching until you find one.

An example of SNP (along with other important features) is below:

Zoom in on the selected SNP until you can read the reference nucleotide sequence:

Is your SNP in a protein-coding region of a gene? If not, find a SNP in a region that is in a protein-coding region?

|
|
|

What is the location of your SNP ?

|
|
|

What is the coverage at your SNP? To find this, simply count the number of reads that cover your SNP.

|
|
|

What nucleotide does the reference genome possess at this site and what nucleotide is our Haitian cholera strain?

REF:

Our strain:

At the amino acid that overlaps with your SNP, what amino acid does the reference encode for? What amino acid does the Haitian cholera strain encode? Consult the chart below to translate your codon.

REF:

Our strain:

Does your SNP change the amino-acid sequence? What effect might this have on protein activity? Note: amino acids that are color-coded the same in the chart below have similar activity (i.e., charge). How might this affect your interpretation?

|
|
|
|
|
|

Amino acid translation chart:

Lab 8: Aligning sequences to a reference genome

Findley Finseth

23 Feb 2018

Objectives

Lab 8 Overview

Tidy up and data prep!

Align sequences to a reference genome

The `SAM` header

The `SAM` alignments

`SAM` to `BAM`

Visualize your alignment with `IGV`

Viewing genes with `IGV`

Snip? SNP!

Lab 8: Aligning sequences to a reference genome

Findley Finseth

23 Feb 2018

Objectives

Lab 8 Overview

Tidy up and data prep!

Align sequences to a reference genome

The SAM header

The SAM alignments

SAM to BAM

Visualize your alignment with IGV

Viewing genes with IGV

Snip? SNP!

The `SAM` header

The `SAM` alignments

`SAM` to `BAM`

Visualize your alignment with `IGV`

Viewing genes with `IGV`