Develop and implement workflows (i.e., “pipelines”) that transform raw data into alignments (bwa
, samtools
)
Understand the SAM
format
Visualize and navigate alignments using the Integrated Genome Viewer
Understand and interpret SNPs and how they relate to genetic variation
Many applications in genomics and bioinformatics depend on aligning sequenced data to a reference genome. These are called “re-sequencing” projects. In this lab, we will align cholera strains isolated from the 2010 Haitian outbreak to the Vibrio cholerae reference genome and view our alignments. Eventually, we will use an expanded dataset to build a phylogenetic tree to address the origins and spread of the 2010 Haitian cholera outbreak.
We will use a program called BWA
(Burrows Wheeler Aligner, Li and Durbin 2009) to rapidly map reads to the reference genome. When we align or “map” reads against a reference, what we mean is that we want to go from a FASTQ
file listing lots of reads to another type of file which lists the reads and where/if it maps against the reference genome. The second file is called a SAM
file and will be discussed later. BWA
also allows for a certain number of mismatches to account for variation present in our strain vs. the reference genome.
Log on to the server and move into your /workdir/genomics/your_id/Data
directory.
You will need the files below to complete the tasks in this lab. Copy any files you do not already have from /workdir/genomics/Data
into your /workdir/genomics/your_id/Data
directory.
SRR531199_1.fastq.gz
SRR531199_2.fastq.gz
VibrioCholerae_N16961_genome.fna
– the reference genome of V. choleraeVibrioCholerae_N16961_genes.gff
– a file describing the genes of V. choleraeLab7
. Move all other data and files to this Lab7 directory.Let’s start with aligning the 500,000 reads isolated from the Haitian cholera outbreak to our reference genome. Perform adaptor and quality trimming using trimmomatic
, as you have done previously. Name your output files like SRR531199_1_500K_paired.fastq
, SRR531199_1_500K_unpaired.fastq
, ….
Now, we are ready to turn to our alignment step. Before we can start aligning reads to a reference genome, the genome needs to be “indexed”. This means sorting the genome into easily searchable chunks. To index the reference genome, use the following command:
$ bwa index reference_file_name
TASK: Align reads 1 and 2 to the reference V. cholerae genome. We will be using a specific algorithm within the bwa
package called mem
(Maximal Exact Matches). Use the bwa mem
help file ($ bwa mem
) to write a command that will align our reads to the reference genome. You will want to run the alignment while increasing the number of threads to 4 (-t 4
). Redirect the output to a file called SRR531199.sam.
Specify the full command you entered.
$
How many lines does your SAM
file have?
|
|
|
|
SAM
headerYour raw alignment is stored in what’s called the SAM
(Simple Alignment Method) format. This is why we gave your step #3 output the .sam
extension. Because SAM
files are plain text format, you can view your file with less
or head
.
SAM
files consist of a header and an alignment section. A SAM
file header for a C. elegans alignment might look something like the file below. C. elegans has six chromosomes (I-V, X) and chromosome I is 15,072,434 bp long.
@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
@SQ SN:IV LN:17493829
@SQ SN:V LN:20924180
@SQ SN:X LN:17718942
@RG ID:VB00023_L001 SM: celegans-01
@PG ID:bwa PN:bwac VN:0.7.10-r789 […]
I_2011868_2012306_0:0:0_0:0_2489 83 I 2012257 40 […]
Term | Meaning |
---|---|
@SQ |
stores information about the reference sequence (e.g., chromosome number and length) |
@RG |
contains important information about read group and sample metadata |
ID |
sample identifier and must be unique |
SM |
keeps track of the sample information |
PL |
keys for platform identification. |
@PG |
contains metadata about the programs used to create and process a set of SAM/BAM files like program name and version |
I_2011868 |
first line of your alignment section (i.e., no “@”; see below) |
NOTE: your file does not yet have an “@RG” header line. We will add this in a later lab.
In addition to using head
or less
to view a sam file, you can extract and view only the header using a program called SAMTOOLS
. SAMTOOLS
is very useful for interacting with SAM
or BAM
(more later…) files.
TASK: View your header file with SAMTOOLS
using the command below. Then answer the questions below.
$ samtools view -H file_name
What are the chromosome names and how long are they?
|
|
|
SAM
alignmentsThe alignment section of a SAM
file contains read alignments (and may include reads that did not align). Each alignment entry is composed of 11 required fields and may have some optional fields.
Let’s step through each field in an alignment. Because alignment lines are quite lengthy and would overflow the page width, return each field on a new line. To do this, use $ samtools view
with no options to return only alignment lines (i.e., no header lines). Then, employ a nifty tool called tr
to convert tabs to newlines. Finally, just take the first 11 lines of this output:
$ samtools view f1.sam | tr '\t' '\n' | head -n 11
Term | Meaning |
---|---|
SRR531199.2 |
QNAME, query name (i.e., sequenced read’s name) |
83 |
bitwise flag, contains info about the alignment |
NC_002505.1 |
RNAME, reference name, tells which chromosome the read aligned to |
1677874 |
POS, position on the reference sequence of the first mapping base (leftmost) in the query sequence |
60 |
MAPQ, mapping quality, measures how likely the read is to actually originate from the position it maps to |
@72M |
CIGAR, CIGAR strings are specialized formats for describing the alignment |
= |
RNEXT, paired end (i.e, “next read”) is named properly |
1677781 |
RPOS, position of paired end read |
-165 |
TLEN, template or “insert” length for paired-end reads |
GCCTAT[…] |
SEQ, stores the original sequence, will always be in the orientation it’s aligned in |
@@@DD=[…] |
QUAL, stores the original read base quality |
TASK: A mapping quality of “60” is the highest quality possible. Use a combination of commands to determine (1) how many sequences have a mapping quality lower than 60 and (2) the lowest observed mapping quality.
$
$
How do you interpret your lowest mapping quality (Google may help here)?
|
|
|
|
Use a combination of commands to determine the ranges of insert sizes in your data (i.e., the shortest (3) and longest (4) insert). Hint: you can restrict your search to positive numbers. For each negative insert size, the paired read will have the same positive value.
$
$
Do your answers above surprise you in any way?
|
|
|
|
TASK: One common filtering step is to restrict alignments to only uniquely mapped reads. Write a UNIX
command that generates a new file, filtered.sam
, containing only uniquely mapped reads. You will need to generate a header file, the filtered reads, and then concatenate those together.
Specify your command below and determine the number of reads before and after filtering.
$
Number of reads raw SAM
:
Number of reads filtered SAM
:
SAM
to BAM
SAM
files are nice an human interpretable, but they can be quite large. BAM
files are binary versions of SAM
files. Therefore, BAM
are not human readable, but they have the advantage of being much smaller than SAM
files.
Convert your SAM
file to a Binary Alignment Method (BAM
) files and then sort it. $ samtools view
converts SAM
to BAM
files with the -b
option. We can then pipe that output to samtools sort
and redirect the final output to a newly named BAM
file, filtered.sort.bam
.
Specify your command below:
$
Use head
to check out the BAM
and SAM
files you generated. Does the BAM
file make any sense to you?
TASK: Compare the sizes of your BAM
and SAM
files using the command below.
$ du -sh file_name
BAM
file size:
SAM
file size:
For the rest of the lab, we will be working with our BAM
files, because they take up much less space. Plus, we can always convert back to a SAM
file using samtools view
. Given that, let’s clean up any large SAM
files. Removing such intermediate files is a good practice when performing any workflow.
$ rm -i file_name
IGV
Now, let’s visualize your alignment in a graphical framework using a program called the Integrated Genome Viewer, or IGV
. IGV
is an interactive, powerful GUI that allows for flexible and targeted viewing of alignments. However, it can also be a bit of a memory suck, so please be patient if it’s taking a while (or crashes…). To use IGV
, we will need:
SAM
fileWe have already got #1, #3, and #4. Now, let’s index our SAM
file using samtools index
.
$ samtools index bam_file
You will be utilizing IGV
on your own computer. First, download IGV
at this website (http://software.broadinstitute.org/software/igv/). Then, move the appropriate files to a folder on your desktop using sftp
as you have done before.
To run IGV
:
Start the IGV
application on your local computer.
From the drop-down menu, select Genomes
-> Load Genome from File…
Select your V. cholerae reference genome. You should now see two chromosome names (NC_002505.1
and NC_002506.1
) in a drop-down menu at the top of the application.
To load your file, select File
-> Load
and choose your sorted BAM
file.
Let’s explore IGV
. Zoom in on a region by dragging a smaller window onto the chromosome. Then, double click until you see coverage. It should look something like this:
Horizontal bars represent reads (i.e., Illumina sequences).
The total number of reads overlapping a single nucleotide is coverage for that nucleotide (we previously calculated average, genome-wide coverage).
Gray reads represent reads with expected insert sizes.
Red reads indicate that the inferred insert size is larger than expected (and therefore a section of DNA is deleted in the subject genome relative to the reference).
Blue reads indicate that the inferred insert size is smaller than expected (and therefore a section of DNA is inserted in the subject genome relative to the reference.
Paired ends mapping to different chromosomes are colored by chromosome and indicate rearrangements.
You can read more about insert size interpretations here (http://software.broadinstitute.org/software/igv/interpreting_insert_size).
Right-click
on the Alignments
panel and select View as pairs
to see the two components of paired-end reads represented together facing each other and connected with a line (you may need to zoom in to see this). Hovering your mouse over a specific pair of reads will give you a yellow box with detailed information for that pair. Please note that insert size refers to the total distance between the 5’-ends of two reads in a pair.
TASK: Hover over a gray read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?
|
|
|
|
Hover over a blue read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?
|
|
|
|
Hover over a red read. What is the name of the read pair that you are hovering over? What is it’s insert size and what is each read’s mapping quality?
|
|
|
|
IGV
We’ve looked at IGV
using the DNA sequence information from our reference, but we also may want to know something about genes (and variation within them). We can use VibrioCholerae_N16961_genes.gff
to display information about genes.
From the drop-down menu, select Genomes
-> Create .genome file
Assign your genome an appropriate Unique identifier
and Descriptive name
. Upload the VibrioCholerae_N16961_genome.fna
file to the FASTA
file box. Upload the VibrioCholerae_N16961_genes.gff
to the Gene
file box. Click OK
.
To load your file, select File
-> Load from File
and choose your sorted BAM
file.
Your screen should look similar, but with an additional row of horizontal blue bars in the Gene
track. The thick bars are protein-coding regions and the thin bars are regions that do not code for proteins (but may be transcribed). V. cholerae is a bacteria and, therefore, has no introns. In eukaryotes, thin bars indicate introns and intergenic space (i.e., not transcribed) has no bars. Bacteria also have very high gene density. Notice how close the genes are.
TASK: Zoom in on a single gene. Hover over the genes and look for the gene name and product. Find a gene that has a described product (i.e., is not a hypothetical protein). What is the name of your chosen gene? What is its function?
|
|
|
|
|
Single nucleotide polymorphisms or SNPS are nucleotide sites where the subject (i.e., your re-sequence data) varies from the reference. If you want to know about genetic variation in a population or across species, you would start by identifying SNPs
.
TASK: Scroll through your IGV
output until you find a site that you feel confident is a SNP
(and that is different than the one below). This should be a site where all or nearly all of the reads differ from the reference in the same way. There are not many SNPs in the genome, so it may take a few minutes of searching until you find one.
An example of SNP (along with other important features) is below:
Zoom in on the selected SNP
until you can read the reference nucleotide sequence:
SNP
in a protein-coding region of a gene? If not, find a SNP
in a region that is in a protein-coding region?|
|
|
SNP
?|
|
|
SNP
? To find this, simply count the number of reads that cover your SNP
.|
|
|
REF:
Our strain:
SNP
, what amino acid does the reference encode for? What amino acid does the Haitian cholera strain encode? Consult the chart below to translate your codon.REF:
Our strain:
SNP
change the amino-acid sequence? What effect might this have on protein activity? Note: amino acids that are color-coded the same in the chart below have similar activity (i.e., charge). How might this affect your interpretation?|
|
|
|
|
|
Amino acid translation chart: