Bioinformatics and Unix

Bio-unix: Examples of Simple Scripting in Bioinformatics (Including Awk, Perl, and Python)
by Oliver; Jan. 13, 2014
     
awk
     

Introduction

In "A Quick Guide for Developing Effective Bioinformatics Programming Skills", published in PLOS Computational Biology, there's a whole section entitled The Importance of UNIX Skills. The authors note:
The relevance of many of the rules outlined here can be directly evaluated though a survey of the bioinformatics positions described within scientific job sites, such as Nature Jobs (http://www.naturejobs.com) and Science Jobs (http://www.sciencejobs.com). For example, at the time of this writing, a search for the term “UNIX” finds more than 100 open positions seeking proficiency in UNIX.
What's all the fuss about? One reason these two subjects mesh so well is that in bioinformatics—which is to say, computing with biological data—there are heaps of data. Most modern sequencing experiments will produce gigabytes or terabytes worth. This data could have many fates. Most likely, it will be mapped to a reference genome or transcriptome. Or perhaps, if such a reference doesn't exist, it will be assembled into a new reference, and deposited in big-name online databases like NCBI, UCSC, or Ensembl. For any species there are an almost infinite number of genetic strains to be sequenced, and each of these can be annotated with genes, introns, exons, micoRNAs, etc.

A number of reasons make unix particularly adept in this context. The first is its ease of manipulating large batches of files and directories. Imagine trying organize hundreds of data files in hundreds of folders with a GUI—it would be a nightmare. Second, unix has great parsing power, allowing you to sniff out properties or inconsistencies of files too big to scroll through by eye. For hard problems, you want to write a proper script. But the command line—with its myriad utilities including awk, sed, and perl—can answer a surprisingly broad range of questions on its own. If you're sorting a file or counting unique lines, for example, you won't get more efficient than binaries like sort or uniq. A third reason for unix's hegemony is that many of the major tools in the field, like Blast, Samtools, Bedtools, Bowtie, TopHat, and bwa, are based on the command line.

Below I give some examples of simple scripting chores in bioinformatics. Most of the problems have no relevancy other than to demonstrate how to work fast on the command line.

An Example with Awk - Finding Properties of a Fasta File

For this example, which I'm partly borrowing from An Introduction to Unix, you need to know that a fasta file is one type of file format in which sequencing data is stored. In fasta format, every sequence has an ID line, which begins with > followed by some sequence, which is allowed to span multiple lines. For DNA, the sequence is comprised of the letters (or base pairs) A C T G. A fasta file of two genes could look like this:
$ cat myfasta.fa
>GeneA
ATGCTGAAAGGTCGTAGGATTCGTAG
>GeneB
ATGAACGTAA
(if genes were this small)

Starting simply, let's suppose we only want to see ID lines:
cat myfasta.fa  | awk '/^>/'
>GeneA
>GeneB
Suppose we want to create a file with ID vs length:
$ cat myfasta.fa | awk '{if ($1 ~ />/) {printf $1"\t"} else {print length}}' | sed 's|>||' 
GeneA	26
GeneB	10
Note that length is a keyword in awk.

Another common chore is filtering a fasta file. Sometimes we want to ignore small sequences, because either we don't believe they're real or because we'll have trouble aligning them. This is simply:
$ cat myfasta.fa | awk '{if ($0 ~ />/) {id=$0} else if (length > 20) {print id; print}}'
>GeneA
ATGCTGAAAGGTCGTAGGATTCGTAG
Likewise, perhaps we want to order our fasta file by length and filter out any sequences less than or equal to 50 base pairs. Let's suppose our fasta file is much longer. We could do this as follows:
$ cat myfasta.fa | paste - - | awk -F"\t" '{print length($2)"\t"$0}' | sort -k1,1nr | 
  awk '$1>50' | awk -F"\t" '{print $2; print $3}' > myfasta.gt50.fa
The
paste - -
bit combines every two rows into one row, making the ID and sequence two fields in the same line. Then we put a third field in front, the sequence length, and sort on it. Next we pipe the whole thing into awk again and filter it. Finally we re-format it back into fasta form. Note that we explicitly tell awk to delimit on tabs just in case there is whitespace in the fasta ID.

Here is a more difficult problem. When the sequencing machine does its magic, sometimes it can't make a call what the base is, so instead of writing A C T or G it uses the character N to denote an unknown. Suppose you have a fasta file—perhaps the sequences are thousands of base pairs long—and you want to know where the coordinates of these stretches of Ns are. As a test case, let's look at the following miniature fasta file:
$ cat test.fa
>a
ACTGNNNAGGTNNNA
>b
NNACTGNNNAGGTNN
In the long tradition of one-liners, here's what I wrote to solve this problem:
$ cat test.fa | awk 'BEGIN{flag=0}{if ($0 ~ />/) {if (flag==1) {print nstart"\t"nend;}; print; pos=0; flag=0} else {for (i=1; i<=length; i++) { pos++; c=substr($0,i,1); if (c=="N" && flag==0) {nstart=pos; nend=pos; flag=1;} else if (c=="N") {nend=pos} else if (c!="N" && flag==1) {print nstart"\t"nend; flag=0}}}}END{if (flag==1) {print nstart"\t"nend;}}'
Got that? Probably not, because the strength of one-liners has never been readability. Let's expand this line so we can see what it's doing:
$ cat test.fa | awk 'BEGIN{flag=0} {
 if ($0 ~ />/) # ID line 
 {
	if (flag==1) {print nstart"\t"nend;}; 
	# print ID, set position to 0, flag to 0
	print; 
	pos=0; 
	flag=0;
 } 
 else # sequence line
 {
	# loop thro sequence
	for (i=1; i<=length; i++) 
	{ 
		# increment position
		pos++; 
		# grab the i-th letter
		c=substr($0,i,1); 

		# set position if char == N
		# flag will raise hi if we hit an N char 
		if (c=="N" && flag==0) { nstart=pos; nend=pos; flag=1; } 
		else if (c=="N") { nend=pos } 
		else if (c!="N" && flag==1) { print nstart"\t"nend; flag=0; }

	} # for loop
 } # not ID line
}END{if (flag==1) {print nstart"\t"nend;}}'
>a
5       7
12      14
>b
1       2
7       9
14      15
Looks like we got the ranges right! Let's talk through this.

In the case we're on an ID line, we reset our position variable and our flag. The flag is a boolean (meaning it will be either 1, hi, or 0, low) which we're going to set hi every time we see an N. This plays out in the following way: once we hit a sequence line, we're going to loop over every single character. If the character is an N and the flag is low—meaning the previous character was not an N—then we record this position, pos, as being both the start and end of an N streak: nstart=pos and nend=pos. If the character is an N and the flag is hi, it means the previous character was an N—hence, we're in the middle of a streak of Ns—so the start position is whatever it was when we saw the first N and we just have to update the end position. If the character is not an N but the flag is hi, it means the previous character was an N and we just finished going through a streak, so we'll print the range of the streak and set the flag low to signify we're no longer in a streak.

There's one more subtlety: what if the sequence line ends on an N character? Then we won't end up printing it because our logic only prints things the next time we see a non-N. To solve this issue, we add a line in the if (line == ID line) block to check if our flag is hi. If it is, we print the last-seen range before clearing the variables. And there's an even more subtle point: what if the last line of our file ends on an N? In this case, that fix won't work because we won't be hitting any more ID lines. To solve this, we use the END{ } block: if the flag is hi and we're done reading the file, we'll print the last-seen range. In this example, our fasta file was small and we could eyeball the N ranges but, with a monster file, we'd need to trust our awk rather than our eyes.

An Example with Perl and Awk - Combining Rows with Same First Field in a File

This problem is not specific to bioinformatics, but let's say we want to combine the rows of a file if the value of the first field is the same. (I'll borrow some of the following from the Perl Wiki.) Suppose our file is:
1       a
2       w
3       c
3       d
4       x
and we want to get this in the form:
1       a
2       w
3       c,d
4       x
Here's a perl script to do this for input piped in from std:in:
#!/usr/bin/env perl

use Data::Dumper; 
use strict; 

my %h=();

while (<STDIN>)
{
	chomp($_); 
	my @a=split; 

	if ( not $h{$a[0]} ) 
	{ 
		$h{$a[0]} = $a[1]; 
	} 
	else 
	{ 
		$h{$a[0]} = $h{$a[0]}.",".$a[1]; 
	}
}

foreach my $key ( sort keys %h ) 
{
	print $key,"\t",$h{$key},"\n"
}

# print Dumper \%h
This script loops over std:in. It gets the first field of the input and makes an entry in the hash, h, if it does not exist:
field 1 --> field 2
If the key already exists in the hash table, it concatenates the second field onto the current value with a comma. However, this approach has disadvantages. For a small file, it works fine. But, for a giant file, creating a hash can overwhelm the computer's memory. There's another way to solve this problem that presupposes our input file is sorted. We'll use awk:
$ cat file.txt | 
   awk '{if (NR==1) {prev=$1; x=$2} else if ($1==prev) {x=x","$2} else {print prev"\t"x; prev=$1; x=$2}}END{print prev"\t"x;}'
If we're on the first line, this code grabs the first and second fields. Otherwise, if the first field is the same as the previous row's the first field, we accumulate the values of the second field in x. If there's a change, we spit out the values of prev and x and reset our values. Since lines only get printed when there's a change in the first field, we have a problem when we get to the end of the file: we're one behind. To solve this, we use an END block, which executes after the whole file has been read, and print out the final values of prev and x.

To recap: approach 1 works best on small files which don't have to sorted, while approach 2 works better on giant files which are sorted.

Update: Here's a better solution than both of the above, using GNU datamash:
$ cat file.txt | datamash -g 1 collapse 2 -s -W
This solves the problem in a more general way. You can collapse on identifiers in any column, average the collapsed numbers, etc. Read more on the unix tutorial page. Hat tip: Albert Lee.

An Example with Unix - Examining a Batch of Files for Inconsistencies

Let's take an example from stuff I happened to be doing today. I'd broken a large file into 500 chunks, blasted each chunk, and used Bedtools and GFFs from NCBI to annotate the files. Each blast file and each output file of the annotating script were supposed to have the same number of IDs.

The annotation files looked like this:
ID			Gene			Species
ENSMMUT00000008032      ACTR3,CXorf56   	Homo sapiens
ENSMMUT00000032516      ADRA1A  		Pan paniscus,Homo sapiens,Papio anubis,Pongo abelii
ENSMMUT00000039107      LOC708652,PDLIM2  	Pan paniscus,Macaca mulatta,Homo sapiens
ENSMMUT00000019854      LEPROTL1        	Pan troglodytes,Macaca mulattax,Nomascus leucogenys
while the blast files looked like this:
qseqid			sseqid					pident	length	mismtch	gapopen			
ENSMMUT00000019076	gi|289063412|ref|NM_001172429.1|        97.71   1133    0       14	...      
ENSMMUT00000019076	gi|402852391|ref|XM_003890858.1|        95.17   1138    16      26      ...
ENSMMUT00000019076	gi|116805349|ref|NM_018942.2|   	93.49   1136    43      20      ...
ENSMMUT00000020734	gi|402852393|ref|XM_003890859.1|        99.37   1117    7       0       ...
ENSMMUT00000020734	gi|402852393|ref|XM_003890859.1|        98.86   350     4       0       ...
How to see if the number of unique IDs was the same in each of the 500 file pairs, as I expected it to be? I decided to look at the first 100 files with a one-liner:
for i in $( seq 1 100 ); do echo "***"$i; cat ../tmp/blastout${i} | cut -f1 | sort -u | wc -l; cat ../tmp/tmp${i}.report.txt | cut -f1 | sort -u | wc -l; done
For the benefit of readability, that's:
for i in $( seq 1 100 ); do 
	echo "***"$i; 
	cat ../tmp/blastout${i} | cut -f1 | sort -u | wc -l; 
	cat ../tmp/tmp${i}.report.txt | cut -f1 | sort -u | wc -l; 
done 
This printed the file number and the number of unique IDs in each file:
***1
213
213
***2
214
214
***3
214
214
So far so good. I then refined the code to improve the output format:
 for i in $( seq 1 500 ); do 
	a=$( cat ../tmp/blastout${i} | cut -f1 | sort -u | wc -l ); 
	b=$( cat ../tmp/tmp${i}.report.txt | cut -f1 | sort -u | wc -l ); 
	echo -e "$i\t$a\t$b"; 
done 
which produced:
1       213     213
2       214     214
3       214     214
4       214     214
To find which files didn't meet our criterion, some simple awk:
 for i in $( seq 1 500 ); do 
	a=$( cat ../tmp/blastout${i} | cut -f1 | sort -u | wc -l ); 
	b=$( cat ../tmp/tmp${i}.report.txt | cut -f1 | sort -u | wc -l ); 
	echo -e "$i\t$a\t$b"; 
done | awk '{if ($2!=$3) {print}}'
which gave us our answer:
56      214     0
92      213     0
96      213     0
124     213     0
The annotation had failed in files 56, 92, 96, 124. Next, I grabbed these numbers and saved them in a file:
 for i in $( seq 1 500 ); do 
	a=$( cat ../tmp/blastout${i} | cut -f1 | sort -u | wc -l ); 
	b=$( cat ../tmp/tmp${i}.report.txt | cut -f1 | sort -u | wc -l ); 
	echo -e "$i\t$a\t$b"; 
done | awk '{if ($2!=$3) {print}}' | cut -f1 > failedjob.txt
in order to re-run the annotation script on only these failed jobs:
for i in $( cat failedjob.txt ); do 
	echo "***"$i; 
	../scripts/annot_script.pl --input ../tmp/blastout${i} > log${i}.o 2> log${i}.e; 
done
That's it. Nothing hard, but this illustrates how you can play fast and loose with unix, chaining commands together to attack a problem (and all conveniently on one line).

An Example with Command-Line Perl - Counting Different Types of Rows in a File

For a project in which I was annotating different sequences with genes, I had a file mapping sequence ID to a comma-delimited list of genes. If we look at the beginning of one of these files, the first two columns are:
ID			Gene(s)
ENSMMUT00000011192      TYROBP
ENSMMUT00000038564      MAP4K4,PLAT,IL12RB2,MMAA
ENSMMUT00000019949      ZNF420,LOC101059449,LOC100603678
ENSMMUT00000017551      C11H19orf33,C19H19orf33,C19orf33
ENSMMUT00000038523      -
ENSMMUT00000009027      ZNF540
ENSMMUT00000027763      APLP1
ENSMMUT00000034882      -
ENSMMUT00000049101      -
Let's say we wanted to count the number of rows that have one gene, multiple genes, and no genes. This is a job for awk or command line perl. Here's our one-liner (I will assume the file has no header):
cat report.txt | perl -ne 'BEGIN{$num_tot=0; $num_one=0; $num_mult=0; $num_non=0}{$num_tot++; chomp($_); @a=split("\t",$_); @b=split(",",$a[1]); if ($a[1] eq "-") {$num_non++; $symb="non"} elsif (scalar(@b)==1 ) {$num_one++; $symb="one"} else {$num_mult++; $symb="mult"} print $symb,"\t",scalar(@b),"\t",$a[1],"\n"}END{print "***counts:\nTot\t".$num_tot. "\nOne\t".$num_one."\t".int(100*$num_one/$num_tot). "%\nMult\t".$num_mult."\t".int(100*$num_mult/$num_tot). "%\nNogene\t".$num_non."\t".int(100*$num_non/$num_tot)."%\n"}'
Let's make this readable:
cat report.txt | perl -ne 'BEGIN{$num_tot=0; $num_one=0; $num_mult=0; $num_non=0}
{
	$num_tot++; 
	chomp($_); 
	@a=split("\t",$_); 
	@b=split(",",$a[1]); 

	if ($a[1] eq "-") 
	{
		$num_non++; 
		$symb="non"
	} 
	elsif (scalar(@b)==1) 
	{
		$num_one++; 
		$symb="one"
	} 
	else 
	{
		$num_mult++; 
		$symb="mult"
	} 

	print $symb,"\t",scalar(@b),"\t",$a[1],"\n"

}END{ print "***counts:\nTot\t".$num_tot;
      print "\nOne\t".$num_one."\t".int(100*$num_one/$num_tot);
      print "%\nMult\t".$num_mult."\t".int(100*$num_mult/$num_tot);
      print "%\nNogene\t".$num_non."\t".int(100*$num_non/$num_tot)."%\n"; }'
If we pipe this into head, we get:
one     1       TYROBP
mult    4       MAP4K4,PLAT,IL12RB2,MMAA
mult    3       ZNF420,LOC101059449,LOC100603679
mult    3       C11H19orf33,C19H19orf33,C19orf33
non     1       -
one     1       ZNF540
one     1       APLP1
non     1       -
non     1       -
while piping into tail gives:
***counts:
Tot     212
One     92      43%
Mult    112     52%
Nogene  8       3%
This is first grade math, but it's an illustration of how we can use perl on the command line to extract some simple properties of a file.

An Example with Unix & Awk - Finding a Property of a Large File

Let's return to unix and see another example of how we can answer a question about a file that's too big to cat. The other day I had some data:
$ cat bedintersect.parse.txt | head -20 | cut -f1-6
ID			Accession	Feature		Start	End	Gene
ENSMMUT00000001218      AC002115.1      Genbank-region  8420    8586
ENSMMUT00000001218      AC002115.1      Genbank-region  8420    8586
ENSMMUT00000052495      AC207480.4      Genbank-region  1944    1995
ENSMMUT00000052495      AC207480.4      Genbank-region  1944    1995
ENSMMUT00000052495      AL035045.5      Genbank-region  80788   80839   
ENSMMUT00000052495      AL035045.5      Genbank-region  80788   80839   
ENSMMUT00000052495      NG_032096.1     Genbank-region  137     194
ENSMMUT00000052495      NG_032096.1     Genbank-region  137     194
ENSMMUT00000052495      NG_032096.1     Genbank-gene    137     194     RNY4P25
ENSMMUT00000052495      NG_032096.1     Genbank-exon    137     194
ENSMMUT00000052495      AL589764.19     Genbank-region  56636   56693   
ENSMMUT00000052495      AL589764.19     Genbank-region  56636   56693   
ENSMMUT00000052495      AC200892.4      Genbank-region  5289    5345
ENSMMUT00000052495      AC200892.4      Genbank-region  5289    5345
ENSMMUT00000052495      NG_016630.1     Genbank-region  133     189
ENSMMUT00000052495      NG_016630.1     Genbank-region  133     189
ENSMMUT00000052495      NG_016630.1     Genbank-gene    133     189     RNY4P18
ENSMMUT00000052495      NG_016630.1     Genbank-exon    133     189
ENSMMUT00000052495      AC161614.4      Genbank-region  46721   46778   
ENSMMUT00000052495      AC161614.4      Genbank-region  46721   46778
(where I've added a header for clarity) The question I had was this: Could there be more than one Genbank-gene feature immediately following a Genbank-region feature? To answer this, we'll focus on column three and grep for region or gene, since that's all we care about:
$ cat bedintersect.parse.txt | head -20 | egrep "region|gene" | cut -f3
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-gene
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-region
Genbank-gene
Genbank-region
Genbank-region
Now we'll use a line of awk. Anytime we see a Genbank-gene line, we'll increment a counter; otherwise, we'll reset that counter to 0. We'll then print the counter along with the row number. If more than one Genbank-gene line appears without an intervening Genbank-region line, this counter will become greater than 1. To wit:
$ cat bedintersect.parse.txt | head -20 | egrep "region|gene" | cut -f3 | 
   awk 'BEGIN{x=0}{if ($1=="Genbank-gene") {x++} else {x=0}}{print x"\t"NR"\t"$0}' 
0       1       Genbank-region
0       2       Genbank-region
0       3       Genbank-region
0       4       Genbank-region
0       5       Genbank-region
0       6       Genbank-region
0       7       Genbank-region
0       8       Genbank-region
1       9       Genbank-gene
0       10      Genbank-region
0       11      Genbank-region
0       12      Genbank-region
0       13      Genbank-region
0       14      Genbank-region
0       15      Genbank-region
1       16      Genbank-gene
0       17      Genbank-region
0       18      Genbank-region
The last step is to awk our awk and see if this first column ever rises above 1:
$ cat bedintersect.parse.txt | egrep "region|gene" | cut -f3 | 
   awk 'BEGIN{x=0}{if ($1=="Genbank-gene") {x++} else {x=0}}{print x"\t"NR"\t"$0}'  | awk '$1>1'
2       426     Genbank-gene
So we discover that on line 426, it does indeed. Although this is not hard, it's another illustration of using unix pipelines to crunch through a file and make it reveal its essence.

An Example with Python - Changing a File Formats: GTF --> BED

Bioinformatics is rife with different file formats and a common chore is twisting one file format into another. GTF is a format for associating features (say, exons and introns) with ranges (i.e., positions) in some file of sequence data (usually fasta). Ensembl gives the specifications here: To quote, the columns are:
  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
They also remind us, "Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'"

Another common format is BED, which also describes the positions of features: Here are the fields, quoting from the UCSC page:
  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser's translation of BED score values into shades of gray:
  6. strand - Defines the strand - either '+' or '-'.
  7. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
  10. blockCount - The number of blocks (exons) in the BED line.
  11. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  12. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
The first three columns—chromosome, chromosome start, chromosome end—are mandatory. Although the .bed file has taken on a larger life in the bioinformatics community, it was first invented by UCSC as a format to display tracks on the Genome Browser. Fields 4 through 12 are optional and inform how the .bed file will be rendered.

One thing to note about the BED format is that it follows one of the stupidest conventions ever invented, which is that ranges are 0-based and the end coordinate is not included. To illustrate, for the following sequence:
1 2 3 4
A C T G
0 1 2 3
nucleotides ACT would be described as at positions 1-3. However, UCSC describes them as at 0-3—the equivalent of calling a cat a dog—causing no end of misery with off-by-one errors.

The problem at hand is a simple one: convert a .gtf file into a .bed file. Here's the .gtf file:
chr	source	feature	start	end	score	strand	frame	attribute
5       CU      exon    1629    3068    .       -       .       attributes
5       CU      exon    7918    8005    .       -       .       attributes
5       CU      exon    10109   10276   .       -       .       attributes
5       CU      exon    7268    7291    .       -       .       attributes
5       CU      exon    7918    8005    .       -       .       attributes
5       CU      exon    10109   12778   .       -       .       attributes
10      CU      exon    2669    4520    .       +       .       attributes
10      CU      exon    4981    5890    .       +       .       attributes
10      CU      exon    2669    4511    .       +       .       attributes
10      CU      exon    5149    5890    .       +       .       attributes
10      CU      exon    2678    4148    .       +       .       attributes
10      CU      exon    4594    4678    .       +       .       attributes
13      CU      exon    3576    7683    .       -       .       attributes
13      CU      exon    21180   21237   .       -       .       attributes
13      CU      exon    24379   27184   .       -       .       attributes
13      CU      exon    67096   67274   .       -       .       attributes
13      CU      exon    76511   76684   .       -       .       attributes
13      CU      exon    127251  127532  .       -       .       attributes
13      CU      exon    203289  203742  .       -       .       attributes
13      CU      exon    3576    7683    .       -       .       attributes
Since it's too wide to display, I've just called the 9th field "attributes". It actually looks like this:
attribute
gene_id "G033544"; transcript_id "T92538"; exon_number "1"; gene_name "POLR3K"
gene_id "G033544"; transcript_id "T92538"; exon_number "2"; gene_name "POLR3K"
gene_id "G033544"; transcript_id "T92538"; exon_number "3"; gene_name "POLR3K"
gene_id "G033544"; transcript_id "T92539"; exon_number "1"; gene_name "POLR3K"
gene_id "G033544"; transcript_id "T92539"; exon_number "2"; gene_name "POLR3K"
gene_id "G033544"; transcript_id "T92539"; exon_number "3"; gene_name "POLR3K"
gene_id "G002358"; transcript_id "T06620"; exon_number "1"; gene_name "LOC721591"
gene_id "G002358"; transcript_id "T06620"; exon_number "2"; gene_name "LOC721591"
gene_id "G002358"; transcript_id "T06621"; exon_number "1"; gene_name "LOC721591"
gene_id "G002358"; transcript_id "T06621"; exon_number "2"; gene_name "LOC721591"
gene_id "G002358"; transcript_id "T06624"; exon_number "1"; gene_name "LOC721591"
gene_id "G002358"; transcript_id "T06624"; exon_number "2"; gene_name "LOC721591"
gene_id "G008166"; transcript_id "T23412"; exon_number "1"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "2"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "3"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "4"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "5"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "6"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23412"; exon_number "7"; gene_name "FILIP1"
gene_id "G008166"; transcript_id "T23413"; exon_number "1"; gene_name "FILIP1"
This is a mouthful and converting it into BED form with a one-liner on the command line isn't practical. So let's break with the thread of this article and write a small script to do this. I will assume the file has no header.

Part 1

#!/usr/bin/env python

import sys, re

if (len(sys.argv) == 1):
    print('''A script which converts exon ranges in gtf format 
into UCSC appropriate bed format.
The ninth column is assumed to be of the format 
gene_id "X"; transcript_id "X"; exon_number "X"; gene_name "X"

Usage: 
gtf2ucsc_bed.py my.gtf

Verbose Mode:
gtf2ucsc_bed.py my.gtf 1
''');
    exit(0)

# verbose
verbose = 0

if (len(sys.argv) > 2):
    verbose = int(sys.argv[2])

Part 2

# define dicts
dtran={}  # transcript id to chr, gene, strand mapping
dgene={}  # gene to transcript mapping
dstlen={} # transcript id to start position, and length mapping

with open(sys.argv[1], "r") as f:
    for line in f:
        # print(line),

	# break line on tabs
	listcols = line.split("\t")

	# get columns
	chr = listcols[0]
	exstart = int(listcols[3])
	exend = int(listcols[4])
	strand = listcols[6]
	attr = listcols[8]
	# print(attr),

	# extract strings from attributes column 
	match = re.search(r'gene_id "(\S+)"; transcript_id "(\S+)"; exon_number "(\S+)"; gene_name "(\S+)"', attr)

	if match:
	    # get name
	    geneid = match.group(1)
	    tranid = match.group(2)
	    exonnum = match.group(3)
	    genename = match.group(4)

	    # print(geneid, tranid, exonnum, genename)
	    # dstart[tranid] = exstart
	    # dlen[tranid] = exend - exstart + 1

	    # map transcript isoform id to gene
	    dtran[tranid] = (chr,genename,strand)

	    # map gene to transcript isoform ids
	    if (genename in dgene):
	        dgene[genename][tranid] = 1
	    else:
	        dgene[genename] = {}

	    # map transcript isoform id to its exon starts and lengths
	    if (tranid in dstlen):
	        dstlen[tranid].append((exstart, exend - exstart + 1))
	    else:
	        dstlen[tranid] = [(exstart, exend - exstart + 1)]


if verbose:
    print(dtran)
    print
    print(dgene)
    print
    print(dstlen)
    print

Part 3

# keys are genes
for mygene in dgene:
	
    if verbose:
        print("***" + mygene + "***")

    # define isoform num to iterate over
    isoform_number = 1 

    for myisoform in dgene[mygene]:
        newgenename=mygene + "." + str(isoform_number)
        isoform_number += 1

	startlist = ""
	endlist = ""
	lenlist = ""
	start_coord = 0
	end_coord = 0
	num_blocks = len(dstlen[myisoform])

	# boolean to track if we're at the first exon in the isoform
	isfirst = 1

	# get starts and lengths 
        for mytup in dstlen[myisoform]:
		# convert format to idiotic UCSC convention: must subtract one from the start
	    if isfirst:
                start_coord = mytup[0] - 1
	        isfirst = 0
	    startlist += "," + str(mytup[0] - 1 - start_coord) # this is an OFFSET!
	    endlist += "," + str(mytup[0] - 1 + mytup[1])
	    lenlist += "," + str(mytup[1])
	    end_coord = mytup[0] - 1  + mytup[1]

        # remove leading comma
        startlist = re.sub(r',', "", startlist, 1)
        lenlist = re.sub(r',', "", lenlist, 1)
        endlist = re.sub(r',', "", endlist, 1)

	# print line for UCSC bed file
	print (dtran[myisoform][0]  + "\t" + str(start_coord) + "\t" + str(end_coord) + 
	    "\t" + newgenename + "\t" + "1000" + "\t" + dtran[myisoform][2] + 
	    "\t" + str(start_coord) + "\t" + str(end_coord) + "\t" + "0" + 
            "\t" + str(num_blocks) + "\t" + lenlist + "\t" + startlist )

        if verbose:
            print(myisoform)
            print(newgenename)
	    print(startlist)
	    print(lenlist)
            print
What's going on here? In Part 1 we're setting up our arguments. If the user gives no arguments, we'll print a help message and quit. The first argument will be our GTF file. We'll make a boolean variable verbose that will be false by default, but can be turned on with a second argument.

In Part 2 we're going to define a couple of dictionaries we need, and populate them. Study the GTF file for a minute, and note that a single gene id or gene name can be spread over multiple transcript IDs. This is because in biology one gene can have numerous transcript isoforms based on different splicing patterns. So, in our GTF file, the transcript ID is the unit we want to focus on. The three dicts we will use are:
  • dtran - map transcript id to chr, gene, strand
  • dstlen - map transcript id to start position, and length of each of its exons
  • dgene - map gene to transcript ids
We we will populate these dictionaries. We loop through the BED file and split it on tabs, grabbing the columns we need. For the attribute field, we extract the gene_name etc. with a regular expression. If the verbose mode is on, we can see what we have. Let's take a look.

dtran is:
{'T92539': ('5', 'POLR3K', '-'), 'T92538': ('5', 'POLR3K', '-'), 'T06624': ('10', 'LOC721591', '+'), 'T06621': ('10', 'LOC721591', '+'), 'T23412': ('13', 'FILIP1', '-'), 'T23413': ('13', 'FILIP1', '-'), 'T06620': ('10', 'LOC721591', '+')}
dstlen is:
{'T92539': [(7268, 24), (7918, 88), (10109, 2670)], 'T92538': [(1629, 1440), (7918, 88), (10109, 168)], 'T06624': [(2678, 1471), (4594, 85)], 'T06621': [(2669, 1843), (5149, 742)], 'T23412': [(3576, 4108), (21180, 58), (24379, 2806), (67096, 179), (76511, 174), (127251, 282), (203289, 454)], 'T23413': [(3576, 4108)], 'T06620': [(2669, 1852), (4981, 910)]}
dgene is:
{'LOC721591': {'T06620': 1, 'T06621': 1, 'T06624': 1}, 'FILIP1': {'T23412': 1, 'T23413': 1}, 'POLR3K': {'T92539': 1, 'T92538': 1}}
In precise language, the first dict maps a string to a tuple; the second maps a string to a list of tuples; and the third maps a string to a dict (an easy way to ensure that a transcript id maps to a non-repeated set of genes, since the keys of a dict are unique). Now that we have our data structures, we just have to print them in BED form.

In Part 3 we'll use our dicts to print a BED file. Our goal is a file with standard gene names and, to deal with this, we'll append a number to each gene name denoting its isoform. For example, we will call the two isoforms of POLR3K—corresponding to T92538 and T92539POLR3K.1 and POLR3K.2.

In this block we first loop over the genes. Next, we loop over the isoforms. Within each isoform, we loop over the list containing the start positions of each exon and its length. Per order of the BED specification, we have to convert the start positions into 0-based offsets in column twelve. We also have to take care to obey the UCSC convention of subtracting one from the starting coordinate. Once we've done all this we get our results, a BED file:
chr Strt  End     name         score s  thSt  thEnd   RgbCt blockSizes                    blockStarts
10  2668  5890    LOC721591.1  1000  +  2668  5890    0  2  1852,910                      0,2312
10  2668  5890    LOC721591.2  1000  +  2668  5890    0  2  1843,742                      0,2480
10  2677  4678    LOC721591.3  1000  +  2677  4678    0  2  1471,85                       0,1916
13  3575  203742  FILIP1.1     1000  -  3575  203742  0  7  4108,58,2806,179,174,282,454  0,17604,20803,63520,72935,123675,199713
13  3575  7683    FILIP1.2     1000  -  3575  7683    0  1  4108                          0
5   7267  12778   POLR3K.1     1000  -  7267  12778   0  3  24,88,2670                    0,650,2841
5   1628  10276   POLR3K.2     1000  -  1628  10276   0  3  1440,88,168                   0,6289,8480
(where I've added a header for clarity)
Score!

An Example with Unix & Awk - Joining Two Files on a Common Key when They're Too Big to Hash

Joining two files on a common key is a familiar task. We can easily solve this by employing a hash (or dict) but what if one of the files is huge? If we try to make a hash table from a giant file—say, a couple hundred megabytes—our system will run out of memory. The other day, I was in this very situation. I had two files: a small blast file and a monster file that mapped the accession IDs to gene names. I wanted to join them so I could see the gene names in the blast file.

The blast file, 1.blast, looked like this:
qseqid			sseqid 				pident  length  ...
TCONS_00007936|m.162    gi|27151736|ref|NP_006727.2|    100.00  324
TCONS_00007944|m.1236   gi|55749932|ref|NP_001918.3|    99.36   470     
TCONS_00007947|m.1326   gi|157785645|ref|NP_005867.3|   91.12   833     
TCONS_00007948|m.1358   gi|157785645|ref|NP_005867.3|   91.12   833     
TCONS_00007949|m.1390   gi|157785645|ref|NP_005867.3|   98.56   833     
TCONS_00007951|m.1446   gi|157785645|ref|NP_005867.3|   97.75   2307    
TCONS_00007952|m.1541   gi|157785645|ref|NP_005867.3|   97.13   1461    
TCONS_00007952|m.1542   gi|157785645|ref|NP_005867.3|   68.17   600     
TCONS_00007954|m.1664   gi|31881779|ref|NP_037467.2|    99.29   420     
TCONS_00007955|m.1692   gi|31881779|ref|NP_037467.2|    99.29   420     
While the accession to gene file, acc2gene.txt, looked like this:
accession	gene
NP_047184.1     repA1
NP_047186.1     repA2
NP_047187.1     leuA
NP_047188.1     leuB
NP_047189.1     leuC
NP_047190.1     leuD
NP_858065.1     ibp
NP_858066.1     repA1
NP_047185.1     pLeuDn_02
YP_003329478.1  leuC
(where I've added headers for clarity) Instead of making a hash table mapping:
accession ID --> gene
we're going to concatenate these two files together and sort them on their common key, accession ID. First we have to make sure that each file has an accession column, and that these columns lines up. We'll be fancy and use a bit of process substitution:
$ cat 1.blast | cut -d"|" -f5 > 1.blast.cut # make file of accession IDs from blast file
$ sort -k1,1 <( paste 1.blast.cut 1.blast ) acc2gene.txt > comb.sort 
The file comb.sort is a combination of our two files, sorted on accession ID. Notice what we've accomplished: the rows we wanted to join are right next to each other. If we look around, say, accession NP_006727.2 in comb.sort we see:
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     TCONS_00007936|m.162    gi|27151736|ref|NP_006727.2|    100.00  324 
NP_006727.2     TCONS_00007937|m.215    gi|27151736|ref|NP_006727.2|    92.86   98  
NP_006727.2     TCONS_00007937|m.224    gi|27151736|ref|NP_006727.2|    100.00  58  
NP_006727.2     TCONS_00007938|m.286    gi|27151736|ref|NP_006727.2|    100.00  140 
NP_006727.2     TCONS_00007938|m.292    gi|27151736|ref|NP_006727.2|    100.00  58  
Now all we have to do is join these adjacent rows:
$ cat comb.sort | 
   awk -F"\t" '{if (NF==2) {id=$1; gene=$2;} else {arr[id]=gene; print $0"\t"arr[$1]}}' > comb.sort.ann
What does this code do? If the row has two columns, it means it came from the giant file and we know column 1 is accession and column 2 is gene, so we store these in the variables id and gene. Otherwise, the row came from the smaller file and we create a hash table, arr, mapping ID to gene and use it to print a line with gene names. Note that arr is much smaller than a hash indiscriminately mapping all IDs to genes because it only maps the IDs which appear in the small blast file. So comb.sort.ann looks like:
NP_006727.2     TCONS_00007936|m.162    gi|27151736|ref|NP_006727.2|    100.00  ...	DNAJB2
NP_006727.2     TCONS_00007937|m.215    gi|27151736|ref|NP_006727.2|    92.86   ...	DNAJB2
NP_006727.2     TCONS_00007937|m.224    gi|27151736|ref|NP_006727.2|    100.00  ...	DNAJB2
NP_006727.2     TCONS_00007938|m.286    gi|27151736|ref|NP_006727.2|    100.00  ...	DNAJB2
NP_006727.2     TCONS_00007938|m.292    gi|27151736|ref|NP_006727.2|    100.00  ...	DNAJB2
And we're done, right? No, unfortunately, there's a problem. If we look through the file carefully, we see that sometimes we get a situation like this:
NP_006727.2     TCONS_00007936|m.162    gi|27151736|ref|NP_006727.2|    100.00  324 
NP_006727.2     TCONS_00007937|m.215    gi|27151736|ref|NP_006727.2|    92.86   98  
NP_006727.2     TCONS_00007937|m.224    gi|27151736|ref|NP_006727.2|    100.00  58  
NP_006727.2     TCONS_00007938|m.286    gi|27151736|ref|NP_006727.2|    100.00  140 
NP_006727.2     TCONS_00007938|m.292    gi|27151736|ref|NP_006727.2|    100.00  58  
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
NP_006727.2     DNAJB2
where the blast lines come before the gene to accession lines. The sorting isn't as we want because, our hash table arr won't have the ID to gene mapping yet:
arr[NP_006727.2] = DNAJB2
gets made after it's too late and our code fails. We can fix this with a "hack":
$ cat 1.blast | cut -d"|" -f5 > 1.blast.cut  # get file of accession numbers from the blast file
$ sort -k1,1 -k3,3 <( paste 1.blast.cut 1.blast ) acc2gene.txt > comb.sort 
$ cat comb.sort | 
   awk -F"\t" '{if (NF==2) {id=$1; gene=$2;} else {arr[id]=gene; print $0"\t"arr[$1]}}' > comb.sort.ann
The only thing we've added is:
sort -k1,1 -k3,3
Since acc2gene.txt has an empty third column, this will ensure its lines always land above the blast lines and our problem is solved.
Advertising

image


image


image