Lab 15: GWAS — Visualization and Interrogation

##

## For example usage please run: vignette('qqman')

##

## Citation appreciated but not required:

## Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165 (2014).

##

## See example usage at http://sahirbhatnagar.com/manhattanly/

Objectives

Generate and interpret a Manhattan plot to visualize the results of a GWAS
Independently write code to generate a Manhattan plot by using a vignette as a guide
Subset dataframes in R using $, brackets, and/or the subset function

Note: This lab is heavily adapted from a GWAS tutorial designed by the Foulkes lab at Mount Holyoke College (http://www.stat-gen.org/tut/tut_intro.html).

GWAS overview

In this module, we performed a Genome Wide Association Study on 1401 individuals to identify the genetic basis of coronary artery disease (CAD). In previous labs, we’ve generated results for both typed and imputed SNPs based on their association with HDL cholestrol. The last step in the GWAS is to conduct a post-analytic genomic interrogation to provide further insight into the association results. We will first combine our imputed and typed results and isolate the SNPs in a region of interest. Then, we will visualize the dataset in a Manhattan plot.

Data prep and package installation

Set your working directory in R to the directory used in Labs 12-14.
Load the necessary R packages below. If a package is not already installed, be sure to install it first.

$ library(plyr)
$ library(manhattanly)
$ library(qqman)

Load your saved data.

$ load("GWAA_4.Rdata")  ## Yours may also be coded as "GWAS_4.Rdata"

Combining imputed and typed results

We will now combine our imputed and typed results and isolate the SNPs in a region of interest. First, the typed SNP results are loaded from the file generated by the GWAA function and manipulated to have them in the same format as the imputed SNP results. We follow similar steps as in the last lab to attach chromosome and position information to each SNP, order them by significance, and take -log10(p values). We also want to add the same type column to the typed SNP results data as we did to the imputed data. Note that we can get the SNP coordinate information from genoBim which contains genomic information on every SNP.

# Format the typed dataset in the same way that we formatted the imputed data
$ GWASout <- read.table("GWAA.txt", header = TRUE, colClasses = c("character",
    rep("numeric", 4)))
$ GWASout$Neg_logP <- -log10(GWASout$p.value)
$ GWASout <- merge(GWASout, genoBim[, c("SNP", "chr", "position")])
$ GWASout <- arrange(GWASout, -Neg_logP)
$ GWASout$type <- "typed"

# Check the imputed and typed datasets to compare formats
$ head(GWASout)
$ head(imputeOut)

##          SNP   Estimate  Std.Error   t.value      p.value Neg_logP chr
## 1  rs1532625  0.2076724 0.03759827  5.523456 4.025954e-08 7.395131  16
## 2  rs3803768 -0.3128172 0.06770158 -4.620530 4.209979e-06 5.375720  17
## 3 rs16951746 -0.3220072 0.07856747 -4.098479 4.418944e-05 4.354682  15
## 4 rs17778044  0.1882157 0.04804339  3.917619 9.418208e-05 4.026032  17
## 5 rs17643302  0.2421791 0.06182924  3.916903 9.439839e-05 4.025035  17
## 6  rs4149504 -0.1824512 0.04664569 -3.911427 9.654280e-05 4.015280  16
##   position  type
## 1 57005301 typed
## 2 80872028 typed
## 3 68605486 typed
## 4 68310164 typed
## 5 64770822 typed
## 6 75565845 typed

##          SNP      p.value position chr    type Neg_logP
## 1  rs1800775 3.970871e-08 56995236  16 imputed 7.401114
## 2  rs3816117 3.970871e-08 56996158  16 imputed 7.401114
## 3  rs1532624 4.745842e-08 57005479  16 imputed 7.323687
## 4  rs7205804 4.745842e-08 57004889  16 imputed 7.323687
## 5 rs12933833 2.127739e-05 56697684  16 imputed 4.672082
## 6 rs11076159 2.420860e-05 56670827  16 imputed 4.616030

The only difference now between the datasets is that imputeOut is missing Estimate, Std.Error, and t.value columns. That’s actually OK for the imputed dataset, so we will just make those columns and fill them in with NAs.

Now we need to merge the typed SNP results (GWASout) with the imputed (imputeOut) SNP results. The two tables of typed and imputed genotypes are combined into a single table using function rbind.fill, which fills in missing columns with NAs.

# Combine imputed and non-imputed SNPs
$ GWAScomb<-rbind.fill(GWASout, imputeOut)
$ head(GWAScomb)

##          SNP   Estimate  Std.Error   t.value      p.value Neg_logP chr
## 1  rs1532625  0.2076724 0.03759827  5.523456 4.025954e-08 7.395131  16
## 2  rs3803768 -0.3128172 0.06770158 -4.620530 4.209979e-06 5.375720  17
## 3 rs16951746 -0.3220072 0.07856747 -4.098479 4.418944e-05 4.354682  15
## 4 rs17778044  0.1882157 0.04804339  3.917619 9.418208e-05 4.026032  17
## 5 rs17643302  0.2421791 0.06182924  3.916903 9.439839e-05 4.025035  17
## 6  rs4149504 -0.1824512 0.04664569 -3.911427 9.654280e-05 4.015280  16
##   position  type
## 1 57005301 typed
## 2 80872028 typed
## 3 68605486 typed
## 4 68310164 typed
## 5 64770822 typed
## 6 75565845 typed

##               SNP Estimate Std.Error t.value   p.value     Neg_logP chr
## 190467 rs61623275       NA        NA      NA 0.9999513 2.117120e-05  16
## 190468  rs6500987       NA        NA      NA 0.9999513 2.117120e-05  16
## 190469  rs7195973       NA        NA      NA 0.9999513 2.117120e-05  16
## 190470  rs7203036       NA        NA      NA 0.9999513 2.117120e-05  16
## 190471  rs7501006       NA        NA      NA 0.9999513 2.117120e-05  16
## 190472     rs6834       NA        NA      NA 0.9999891 4.736465e-06  16
##        position    type
## 190467  7621211 imputed
## 190468  7621129 imputed
## 190469  7619538 imputed
## 190470  7620768 imputed
## 190471  7620546 imputed
## 190472 75661803 imputed

There! Now our two datasets are together.

Next, we use the map2gene function defined in the previous lab to subset the data for SNPs in the CETP gene and print the observations. Recall that the CETP gene had many associated SNPs within its boundary. Note that the function was saved and reloaded into your workspace, so you don’t need to go searching for it.

# Subset for CETP SNPs
$ CETP <- map2gene("CETP", coords = genes, SNPs = GWAScomb)
$ CETP <- CETP[, c("SNP", "p.value", "Neg_logP", "chr", "position", "type", "gene")]
$ print(CETP)

##                SNP      p.value  Neg_logP chr position    type gene
## 1        rs1532625 4.025954e-08 7.3951312  16 57005301   typed CETP
## 15        rs289742 3.485697e-04 3.4577104  16 57017762   typed CETP
## 130       rs289715 4.039906e-03 2.3936287  16 57008508   typed CETP
## 32369    rs1800775 3.970871e-08 7.4011142  16 56995236 imputed CETP
## 32370    rs3816117 3.970871e-08 7.4011142  16 56996158 imputed CETP
## 32371    rs1532624 4.745842e-08 7.3236868  16 57005479 imputed CETP
## 32372    rs7205804 4.745842e-08 7.3236868  16 57004889 imputed CETP
## 32386   rs17231506 3.454117e-05 4.4616630  16 56994528 imputed CETP
## 32388     rs183130 3.454117e-05 4.4616630  16 56991363 imputed CETP
## 32391    rs3764261 3.454117e-05 4.4616630  16 56993324 imputed CETP
## 32392     rs821840 3.454117e-05 4.4616630  16 56993886 imputed CETP
## 32394     rs820299 4.459160e-05 4.3507469  16 57000284 imputed CETP
## 32420   rs12149545 9.961987e-05 4.0016540  16 56993161 imputed CETP
## 32445   rs12447839 2.059154e-04 3.6863112  16 56993935 imputed CETP
## 32446   rs12447924 2.059154e-04 3.6863112  16 56994192 imputed CETP
## 32448    rs4783962 2.059154e-04 3.6863112  16 56995038 imputed CETP
## 32480   rs34620476 3.515561e-04 3.4540054  16 56996649 imputed CETP
## 32481     rs708272 3.515561e-04 3.4540054  16 56996288 imputed CETP
## 32482     rs711752 3.515561e-04 3.4540054  16 56996211 imputed CETP
## 32525   rs12447620 3.562447e-04 3.4482516  16 57014319 imputed CETP
## 32526     rs158480 3.562447e-04 3.4482516  16 57008227 imputed CETP
## 32527     rs158617 3.562447e-04 3.4482516  16 57008287 imputed CETP
## 32748   rs11508026 1.931023e-03 2.7142126  16 56999328 imputed CETP
## 32749   rs12444012 1.931023e-03 2.7142126  16 57001438 imputed CETP
## 32750   rs12720926 1.931023e-03 2.7142126  16 56998918 imputed CETP
## 32751    rs4784741 1.931023e-03 2.7142126  16 57001216 imputed CETP
## 33223  rs112039804 4.047413e-03 2.3928225  16 57018856 imputed CETP
## 33224   rs12708985 4.047413e-03 2.3928225  16 57014610 imputed CETP
## 33225     rs736274 4.047413e-03 2.3928225  16 57009769 imputed CETP
## 37700     rs289746 3.222405e-02 1.4918199  16 57020205 imputed CETP
## 73128   rs12934552 2.567757e-01 0.5904460  16 57021433 imputed CETP
## 84923   rs12708983 3.243383e-01 0.4890017  16 57014411 imputed CETP
## 116860  rs12598522 5.225887e-01 0.2818400  16 57022352 imputed CETP
## 116861  rs56315364 5.225887e-01 0.2818400  16 57021524 imputed CETP

TASK: How many CETP SNPs are there total? How many were imputed and how many were typed?

|
|
|
|

Visualizing the results with a Manhattan plot

Today, we will be using the qqman package to visualize your data with a Manhattan plot. Manhattan plots are used to visualize GWAS significant results by chromosome location. It is a common way to take a glance at where signal is coming from and how strong it is. Often, significant results are highlighted in a different color, and a significance threshold line is presented. It is common practice to plot the negative log p values on the y-axis, with genomic location on the x-axis. In today’s lab, we will be making a standard Manhattan plot. For your homework, you will be making an interactive Manhattan plot.

In this lab, we will set up a plot for chromosomes 15-17, as these are the chromosomes we ran our analysis on. Don’t forget we also have imputed data available for chromosome 16.

Up until now, our R labs have generally had the code laid out for you. The remainder of the lab is going to be thin on code and heavy on you writing some of your own code.

We will be using a vignette for the qqman package as a guide. In R, vignettes can be thought of as short tutorials or manuals for a particular package. In “real life”, vignettes are often essential guides for analyzing bioinformatics data with new packages. The vignette can be found here (https://cran.r-project.org/web/packages/qqman/vignettes/qqman.html). Alternatively, you can access the vignette with the vignette('qqman') function.

Be sure to save all your code in an R script, so that you can access it later.

Re-format your GWAS results

Check out the GWAScomb object using str() or head(). Compare your data object to the one in the vignette. We will want to make a new gwasResults object that is formatted the same way as the ones in the vignette.

Make a new gwasResults object that retains only the following columns (in this order) by subsetting your GWAScomb dataframe: SNP, chr, position, p.value, type. Subsetting can be accomplished using the subset() function or using bracketing as done previously.

Change the column names to “SNP”, “CHR”, “BP”,“P”, “Type”. To do this, first, assign a new vector to the column names. The names will need to be recognized as strings, so should be in quotes (“”). Then, use the colnames() function to assign that vector to the column names in your gwasResults object. If you need to review the colnames() function, use ?colnames() to access the help file.

Make a Manhattan plot

Now, use the vignette as a guide to make your Manhattan plots and answer the questions below. When you have an orange and blue Manhattan plot, make sure I check it out before you proceed. Note that you will need to modify the vignette’s code to apply to your particular dataset. Then, answer the questions below by applying the vignette’s use of the table function to guide you.

TASK:

What does the Manhattan plot represent? What do you think the lines represent? What does it mean for a SNP to be above the line (basic plot)? Use the ?manhattan() function if you are stuck.

|
|
|
|
|
|
|
|
|

How many SNPs are on each chromosome? Why are there so many on chromosome 16?

|
|
|

How many SNPs are imputed and how many are typed?

|
|
|

Highlight SNPs of interest

Move on to begin highlighting some SNPs of interest. Using the vignette as a guide, highlight all of those SNPs that are in the gene we identified as being associated earlier, CETP. Earlier in the lab, we identified SNPs in the CETP gene as being of interest in the CETP object.

Take a look at the CETP object using head(). Make a vector called snpsOfInterest that represents the SNPs in the “SNP” column in the CETP object. Using the $ or bracketing may useful to generate your vector. Then, highlight those as done in the vignette.

In your own words, what SNPs did we highlight?

|
|
|
|

Do you think this highlights all SNPs of interest in the CETP region? Why might we be missing some that look associated?

|
|
|

Let’s zoom in on the CETP region and take a closer look at what’s going on. Using the vignette’s chromosome 3 example as a guide, subset your data to look only at chromosome 16 between 56,900,000 and 57,100,000 (and highlight the SNPs of interest).

OK… based on this it DOES look like we are missing some key SNPs that are likely associated. Zoom in even closer to get a better handle on the coordinates of interest. This time, isolate the region between 56,980,000 and 57,030,000 on chromosome 16.

In your opinion, what range of SNPs should be highlighted (best guesses are fine)?

|
|
|

Notice how some of the SNPs form a horizontal line. That seems a bit odd and potentially a result from estimated (i.e., imputed) data. I wonder if those are SNPs that we typed or SNPs that were imputed? Let’s check it out.

To do so, create two new subsetted datasets from gwasResults that retain either only typed or imputed data. You can use either subset() or bracketing to do this. From these subsetted datasets, graph typed SNPs in one graph and imputed SNPs in another. Include only the region to chromosome 16 between 56,980,000 and 57,030,000.

Are most of the SNPs of interest typed or imputed?

|
|
|

How many “typed” SNPs are driving the association? How does this make you feel about the data?

|
|
|

Continue with the annotation portion of the vignette. Stop before you get to the Z score part.

How many SNPs are annotated at P < 0.01, with default settings?

|
|
|

Turn the annotateTop option off and on with FALSE and TRUE. What do you think that option does?

|
|
|

When you zoomed in on chromosome 16, you noticed that there are some SNPs that are not highlight that seem like they should be highlighted. To this end, you determined a range of positions you felt should be highlighted.

First, determine if you like your chose range by to using the abline() function to draw vertical lines on your Manhattan plot corresponding to your range ( ?abline() for help). You will likely want to re-plot your Manhattan plot, using a “zoomed in” version with both typed and imputed SNPs. If you like it, move on. If you don’t, play with the values until your find a range you like.

Second, write code that will highlight SNPs in this range. To do this, first, make a database that represents only the SNPs with the positions you want. Then, choose only the SNPs from that database by selecting the column for “SNPs”. Finally, substitute your new vector for snpsOfInterest in the manhattan() function.

How many SNPs did you end up highlighting?

|
|
|

Save your work

$ save(genotype, genoBim, clinical, pcs, imputed, target, rules,
     phenoSub, phenodata, genodata_sub, support,
     genes, impCETP,  impCETPgeno, imputeOut, GWAA, map2gene, GWASout, GWAScomb, CETP, snpsOfInterest, gwasResults, file = "GWAS_5.Rdata")

Homework

Due Monday, April 9, by midnight.

Take a look are your blue and orange graph. Do you like the way that the graph looks? Check out the manhattan function by typing ?manhattan(), Choose a color palette you prefer and apply it to the Manhattan plot. You may also decide you want to give each chromosome a different color or make your points even smaller. Be aware that highlighted SNPs will be in lime green (very difficult to change), so you should consider that when choosing your palette. Turn in a screenshot of your Manhattan plot in the new colors.
It’s nice that those important SNPs are highlighted and all, but wouldn’t it be even nicer if we could just scroll over them with our mice and the corresponding gene names would pop up?! That way we could just really easily know which genes are interesting. And maybe other info could pop up, too….like whether or not that SNP is typed vs. imputed or how the SNP effects the phenotype (i.e., what the “estimate” is). Yeah…that would be pretty swish.

Wait! It’s not just a pipe (|) dream. Thanks to Sahir Bhatnagar – a data scientist who occasionally writes software packages for R – we can actually turn our Manhattan plots into interactive plots pretty easily. Check it out here (http://moderndata.plot.ly/manhattanly-r-package-for-interactive-manhattan-plots/).

For your interactive Manhattan plot, you will need to have the manhattanly package and your GWAS_5.Rdata loaded. You will also need to do some data formatting prior to generating the plot. Specifically, you will first need to map your subset of SNPs on chromosomes 15 - 17 to genes using a for loop as done below. Then you will need to re-format to match the data input required by manhattanly.

#Select the genes that are on chromsomes 
$ genesSub <- genes[genes$chr == 15 | genes$chr == 16 | genes$chr == 17,]

#Initiate the variables in the for loop
$ final_df = NULL
$ temp_df = NULL

#Assign gene ids to SNPs
$ for(i in 1:nrow(genesSub)){
  possibleError <- tryCatch(
    map2gene(genesSub$gene[i], coords = genes, SNPs = GWAScomb),
    error=function(e) e)
  if(inherits(possibleError, "error")) next
  temp_df <- map2gene(genesSub$gene[i], coords = genes, SNPs = GWAScomb) 
  final_df <- rbind(final_df,temp_df)
}

# Confirm that your final_df object is generated properly
$ head(final_df)

#Re-format and subset your final dataframe to be in line with the requirements for manhattanly
$ df_sub <- final_df[,c("SNP", "chr", "position", "p.value", "Estimate", "type", "gene")]
$ colnames(df_sub) <- c("SNP", "CHR", "BP", "P", "Estimate", "Type", "GENE")
$ head(df_sub)

Now, generate an interactive plot using the generalized format below (you will need to modify).

$ manhattanly(your_dataframe_here, snp = "SNP", gene = "GENE", highlight = snps_you_want_to_highlight_here, annotation1 = "Type", annotation2 = "Estimate")

Scroll across your graph and notice any SNPs that pop up as strongly associated with HDL levels. Find an associated SNP that is associate that is NOT in the CETP gene. Scroll over that SNP until the interactive box pops up. Attach a screenshot of this plot.

Lab 15: GWAS — Visualization and Interrogation

Findley Finseth

4/4/2018

Objectives

GWAS overview

Data prep and package installation

Combining imputed and typed results

Visualizing the results with a Manhattan plot

Re-format your GWAS results

Make a Manhattan plot

Highlight SNPs of interest

Save your work

Homework