Lab 3: Parsing text with UNIX tools

Objectives

Gain proficiency with the UNIX tools grep sed, cut, sort, wc, and uniq
Manipulate, summarize, and parse text files through applying UNIX tools
Practice building pipelines by combining UNIX commands with pipes and file re-directions
Get more familiar with regular expressions such as | and .

Note: This lab was originally adapted from Daniel Martinez and Anthony Bellantuono’s Genomics and Transcriptomics Without Tears. It also borrows heavily from “Chapter 7” of Bioinformatics Data Skills, by Vince Buffalo:

Lab 3 overview

Today we will mainly work with two files, matrix and Hydra_rna.fa. The matrix file includes counts of particular mRNA transcipts. You have seen Hydra_rna.fa before, which contains RNA nucleotide sequences from Hydra magnipapillata genes in the FASTA format.

Starting today, you will begin to see commands introduced in a generalized format (e.g., $ head file_name). It will be up to you to tailor the command for your specific task (e.g., $ head Hydra_rna.fa). You will also have to figure out several commands on your own.

In the last exercise of today’s lab, you will save a record of all the commands you implemented during the current computer session (from login to logout/termination). This record—–called history—–can be very useful when you want to repeat certain steps in the future (in addition to allowing Findley to verify that you have completed the exercises successfully). It is good practice, therefore, to get in the habit of saving your history after working through lab exercises.

Tidy up and get your data!

Log on to the server and move into your /workdir/genomics/my_lab_id folder. Make a new directory lab2. Remove your results_XYZ directory. Move all remaining data and output files into your lab2 folder.
Copy the all the files from /workdir/genomics/data/lab3 into your working directory. Verify that the files were transferred.

GREPing for foxes and flies – review of Lab 2

Take a look at your Hydra_rna.fa using head or less and remind yourself of the FASTA format. As a review of last lab, write a simple grep command that will just display header lines. What command did you write?

Let’s assume you want to generate some summary stastics on this file. You may want to know the number of sequences in the file or the number of sequences that match certain criteria. To do this, you can count using an option in grep you have seen before. If you don’t remember the option, take a look at the manual for grep or the course google doc cheatsheet.

Write a command that you can use to count the number of sequences in the file.

TASK: LAB 3, #1

NOTE: This task is on the accompanying worksheet on Sakai

Now, grep for forkhead gene headers and save the names into a file called fox using the redirection operator (>) you learned about in Lab 2 (we will use this file later today). Then, grep for Drosophila gene headers and save the names into a file called Droso (you will use this file on Wednesday).

Specify the full commands you entered.

$   
  
$

Using either less or cat, verify that your fox file contains only forkhead genes and your Droso file contains only Drosophila genes.

Calculating GC content with simple pipelines

One important feature of sequences is the percent of G’s and C’s. G–C pairings have three hydrogen bonds, while A–T pairings only share two. Therefore, the % of GCs can tell you a lot about the local environment of DNA.

To calculate GC content in Hydra_rna.fa, we will need to count the number of Gs, Cs, As, Ts, and Ns (unknown nucleotides) in only the sequence lines (not the header lines). We can do this by combining grep commands with a pipe |, as you did in Lab 2.

The first command should select only the lines that contain sequence (i.e., no header lines). You may want to use a grep option you learned in Lab 2 to invert the output and return all the lines that do not match a pattern. Pipe this output to head or less, as it will likely be very big.

Specify the complete first command:

The second command will count (one at a time) the Gs, As, Ts, Cs, and Ns (or any other characters you may want to find). For this command, you will grep with the -o option. Option -o instructs grep to search for an individual character rather than a line.

Specify the complete second command that you would use to return all the Gs. Again, pipe this output to head or less.

grep -o will report each of the matched characters in a single line (one character per line). From there, we can simply count the number of lines (i.e., number of matched characters) using a new word count command (wc), set to count the number of lines (-l).

The third command is:

$   wc -l

To summarize, the first command retrieves only the sequence lines, the second returns the characters that match a certain pattern like the number of Gs (one per line), and the third command will count the number of lines (and therefore the number of character matches). All of these can performed by connecting each command with a pipe (|) on a single line.

Write the entire pipeline to count the number of Gs in Hydra_rna.fa:

TASK: LAB 3, #2

NOTE: This task is on the accompanying worksheet on Sakai

Entering the commands one at a time is a lot of work. In addition to connecting commands, pipes (|) can also be interpreted as a regular expression with the meaning “or”. Regular expressions are a powerful language for search and replace operations that we have seen before. To turn on something called extended regular expressions in grep, we use the -E option. Then, the | in the following grep command is interpreted as “or” and searches for multiple characters at once.

$ grep -o -E ‘character1|character2|character3’ file_name

Specify how you would count the number of Gs and Cs at once:

Use a similar command to count the total number of bases. Re-calculate the %GC. Did you get the same answer as before?

There are more elegant (and faster) ways of counting all of the bases in a FASTA file at the same time. Note that sometimes, you will have gaps (-) that you should also count (though none exist here).

The two grep commands below will generate the same result when used on the file Hydra_rna.fa, and should work on most FASTA files. The last one, however, has an advantage. It uses the regular expression of a period (.), which can represent any single character. Thus the last has the advantage that it would report any other characters that may be present in the sequences. The other command will only report the number of each of the specified characters.

$ grep -v “>” Hydra_rna.fa | grep -o -E 'C|G|A|T|N|-' | wc -l   

$ grep -v “>” Hydra_rna.fa | grep -o . | wc -l

Cutting, sorting, and being unique

Take a look at the matrix file using head or less. This file has three columns corresponding (from left to right) to transcript names, abundance values for sample 1, and abundance values for sample 2. There are 51,863 entries in the matrix file (you can confirm this with wc, if you like).

Let’s imagine that you want to know the 5 lowest unique abundance values for sample 1. First, you will want to isolate sample 1 by extracting column 2 (that’s where sample 1’s data lives). One easy way to do this is to use cut. This program cuts out specificed columns (also known as fields) from a text file. By default, cut recognizes tabs as column separators (which is what matrix uses). To extract the second column we use:

$ cut -f2 matrix | head

Now let’s sort the extracted column. sort is a very commonly used in bioinformatics, as we often need to work with sorted plain text files. This is because some operations are much more efficient on sorted data and it’s also a prerequisite to finding unique values (more on that later).

First, let’s just check out the native behavior of sort.

$ sort matrix | head

Notice that the native behavior of sort is to sort on the first column in alphanumerically. sort also has the ability to sort on a particular column by specifying the -k option followed by a column number (in this case 2).

$ sort -k2  matrix | head

Here, below we are again telling sort to sort on the second column (2). But now, we want to make sure sort sorts the column numerically (-n). The native behavior would be to sort (5,10, 1, 2, 11) like (1,10,11,5,2). By specificying -n, sort will return (1,2,5,10,11).

$ sort -k2 -n  matrix | head

Now, let’s combine sort and cut to extract the second column and then sort it numerically. Notice that we no longer need to specify the column (“2”) with sort, as cut has already extracted it.

$ cut -f2 matrix | sort -n | head

Notice that all of the values are 0.00. What you actually want to know is the 5 lowest unique values. To do this, you can implement uniq. uniq takes lines from a file or stream and outputs all lines with consecutive duplicates removed. Let’s check out its behavior using an example.

$ cat letters.txt
$ uniq letters.txt

As you can see, uniq only removes consecutive duplicate lines (keeping one). It does not return the unique values. To find all unique values, you would need to sort first. Adding the -c allows you to count the occurences of each unique value.

$ sort letters.txt | uniq
$ sort letters.txt | uniq -c

If you wanted to know the 5 lowest abundance values (the original prompt), you need to combine cut, sort, and uniq.

$ cut -f2 matrix | sort -n | uniq | head -5

TASK: LAB 3, #3

NOTE: This task is on the accompanying worksheet on Sakai

Easier `sed` than done

You generated a file called fox that is full of forkhead gene headers. Forkhead genes are normally represented using the nomenclature Fox or FOX (e.g., FoxA1, FOXO3). sed allows you to find and replace text. Check out the general format of sed below on the first line. Then, replace “forkhead box protein”" with “FOX”, as indicated on the second line.

$ sed ‘s/pattern_to_find/pattern_to_replace/g’ file
$ sed 's/forkhead box protein/FOX/g' fox

Now use sed and the fox file to 1) replace “forkhead box protein”" with “FOX” (as above), 2) replace “-like” with nothing, and 2) replace “-B” with nothing. Save the output to a new file called fox2. You should be able to do this with a single line of code and piping. What command did you type?

Let’s now add a header line to the fox2 file. Currently, you can either print the output of sed to the screen or redirect it to a new file. To directly modify the fox2 file, choose the option sed -i. To add a header, you can use the 1i notation as shown below. The 1i will add the subsequent text to the first line (2i is the second line and so on).

$ sed '1iheader_text_here\'

TASK: LAB 3, #4

NOTE: This task is on the accompanying worksheet on Sakai

Lab3 is history

Generate a file with your history and save it in your folder. Saving your history regularly can be useful (especially if you are forgetful).

$ history > history.$(date +%F).your_last_name

Lab 3: Parsing text with UNIX tools

Findley Finseth

3 Feb 2019

Objectives

Lab 3 overview

Tidy up and get your data!

GREPing for foxes and flies – review of Lab 2

Calculating GC content with simple pipelines

Cutting, sorting, and being unique

Easier `sed` than done

Lab3 is history

Lab 3 assignment

Lab 3: Parsing text with UNIX tools

Findley Finseth

3 Feb 2019

Objectives

Lab 3 overview

Tidy up and get your data!

GREPing for foxes and flies – review of Lab 2

Calculating GC content with simple pipelines

Cutting, sorting, and being unique

Easier sed than done

Lab3 is history

Lab 3 assignment

Easier `sed` than done