Wiki: Bioinformatics
A Small Bioinformatics Wiki Covering Software, File Formats, and Useful Commandsby Oliver; 2014
Introduction
This article is no longer up to date or maintained
This is a collection of miscellaneous bioinformatics stuff for reference (more a reminder to myself than a carefully crafted article).
Biotechnology Resources
Illumina, a leading manufacturer of sequencing machines, has an introduction to sequencing worth reading here: An Introduction to Next-Generation Sequencing Technology. I've stolen a figure from this document illustrating the next-generation sequencing workflow:
As the figure says, the four steps are:
A major resource for biotechnology on the web is the National Center for Biotechnology Information.
- NGS library is prepared by fragmenting a gDNA sample and ligating specialized adapters to both fragment ends.
- Library is loaded into a flow cell and the fragments hybridize to the flow cell surface. Each bound fragment is amplified into a clonal cluster through bridge amplification.
- Sequencing reagents, including fluorescently labeled nucleotides, are added and the first base is incorporated. The flow cell is imaged and the emission from each cluster is recorded. The emission wavelength and intensity are used to identify the base. This cycle is repeated "n" times to create a read length of “n” bases.
- Reads are aligned to a reference sequence with bioinformatics software. After alignment, differences between the reference genome and the newly sequenced reads can be identified.
BLAST (discussed below) is a famous sequence aligner you can use online.
The UCSC Genome Browser (discussed below) provides a graphical user interface for browsing the human genome, as well as the genomes of other species. The Exome Aggregation Consortium Browser provides a searchable database of variants from "a wide variety of large-scale sequencing projects." DAVID is a tool for analyzing lists of genes.
Biostars and SEQanswers are bioinformatics-community forums which post questions and answers in the manner of Stack Overflow. They tend to pop up when you Google bioinformatics questions.
What Is a Read?; Mapping
Modern sequencing technologies break DNA or RNA into many shorter fragments, and it is these short fragments—reads in sequencing parlance—which are sequenced. For example, a read of 20 nucleotides might look like this:AGCTCAGCAAAGCTTCGCTCTypically, reads are longer than this, although the precise length is dependent on the sequencing platform. A giant file of raw reads isn't the most useful thing, so typically reads have to be processed—they are the input to bioinformatic programs or pipelines. A simple starting point is to map or align the reads, which means matching them to a pre-existing reference by sequence homology. This will tell us the read's location. For example, perhaps a hypothetical read maps to chromosome 1 of the human genome reference at position 2345. Then, after alignment, you might see something like this:
chr1 2345 AGCTCAGCAAAGCTTCGCTCfor the particular read.
Your set of reads is unlikely to agree perfectly with the reference to which you're mapping it, since members of the same species don't have identical genomes. Mapping is a way to quantify both the similarities and the differences vis-à-vis the reference. This difference can take the form of single nucleotide variants (SNVs) and insertions/deletions (indels). Larger structural rearrangements, such as copy number variations and chromosomal translocations, can be more challenging to catch.
Two important concepts related to mapping are paired-end reads and coverage depth. Coverage depth is simply the number of reads that align on a particular spot of the reference genome:

The Illumina sequencing introduction linked above describes paired-end reads:
A major advance in NGS technology occurred with the development of paired-end (PE) sequencing. PE sequencing involves sequencing both ends of the DNA fragments in a sequencing library and aligning the forward and reverse reads as read pairs. In addition to producing twice the number of reads for the same time and effort in library preparation, sequences aligned as read pairs enable more accurate read alignment and the ability to detect indels, which is simply not possible with single-read data. Analysis of differential read-pair spacing also allows removal of PCR duplicates, a common artifact resulting from PCR amplification during library preparation. Furthermore, paired-end sequencing produces a higher number of SNV calls following read-pair alignment. While some methods are best served by single-read sequencing, such as small RNA sequencing, most researchers currently use the paired-end approach.Their figure:

We mentioned the alignment software Blast above but for high coverage, paired-end, short reads—i.e., what comes out of most modern sequencing machines—you'll want to avoid Blast and use a program tailor-made for this job, such as bwa.
Assembly
What if you have sequencing data for an uncharacterized bacteria or transcriptome? Or what if you want to produce your own reference genome? In all of these cases, the protocol is sequence assembly, which is the process of stitching your reads back together—i.e., forming the sequences your reads comprised before they were sheared into fragments. When two or more reads are joined, the longer sequence they produce is called a contig. Assembly is usually a very computationally-intensive process and there are many programs to choose from, as listed on the Wikipedia page. After your assembly is finished, you've made a reference and other people can map their reads to it. Various big bioinformatics outfits regularly roll out new and improved reference assemblies of the human genome and those of other species (see, e.g., The Genome Reference Consortium). One side note about assembly is that there is no answer key—no way to check how close you got to the right answer.The Genome
For the purpose of getting a simple model in your head, think of the genome as having two distinct regions: genes and intergenic regions. Continuing the oversimplification, think of genes as having two kinds of components: exons, the coding part, and introns, the non-coding part. Here's a figure from Wikipedia:
Current DNA sequencing experiments often focus only on the total set of exons, known as the exome (although whole-genome sequencing is becoming more common).
DNA vs RNA Sequencing Data
The fundamental difference between DNA-seq and RNA-seq data is that the latter contains transcripts which are the product of splicing—the exons we met above can join in various combinations to form mRNA. This makes dealing with RNA data more of a challenge: trying to map it onto a genomic reference will be harder, since two contiguous exons in an RNA transcript may be separated by an intron in the genomic DNA. Furthermore, a DNA-seq dataset should have sequences representing the whole genome (although, in practice, often just the exome is captured), whereas an RNA-seq dataset will only contain the particular transcripts expressed in your population of cells (e.g., liver cells are going to look different than brain cells). Because humans are diploid, a heterozygous variant at a particular position will be present in roughly 50% of the covering reads in DNA-seq data. In RNA-seq, however, this signal will be much noisier because coverage is related to transcript expression and all the complicated regulatory machinery that goes along with it. It's easier to call variants in DNA sequencing data, but RNA sequencing data has other dimensions: you can use it for analyses of gene expression, gene fusions, and pathogen discovery. As my coworker eloquently put it, RNA is like a way station—if you want to express protein, you have no choice but to pass through it, so the genetic material of the organism as well as that of any virus infecting it will show up in the RNA.Hat tip: Sak
Fastq and Fasta Formats
Unaligned, short reads often come in fastq ("fast Q") format. In this file format, there are 4 lines per read: read ID; sequence; a plus sign; and sequence quality. For example, the following 12 lines comprise three sequences:@CWS01-ILLUMINA:9:FC:1:1:4115:1677 1:Y:0:GGCTAC NGCGCCGGGCGGGGCGAAAGCGGGCGTCATGGGCGCAGCTTCAGGCGAATGTAGAGGAAATTGACCGCCGCGGCGG + ############################################################################ @CWS01-ILLUMINA:9:FC:1:1:4153:1672 1:Y:0:GGCTAC NGCGGAAAGCGGCTGGGGGGTGCGGGGACGGCGCGCGCACGGAGGGGGCTCGCCGGCGCCGCGGGCAGGCGGGCGC + ############################################################################ @CWS01-ILLUMINA:9:FC:1:1:4197:1676 1:N:0:GGCTAC NGAAAAAGGCGGCAGCAGCATGGTTGGTGGCGGGGAGCATGTGCGCCGCGGCGCACGCGCAAGGCACGGGGACGCT + #,.,*///./@@@@@@@2@@<<<:<@@@@@##############################################A more general format for storing sequencing data is fasta ("fast A"). The pattern for this format is >Sequence ID followed by the sequence, spanning any number of lines. Note the identifier always starts with a > character. Compared to fastq format, fasta format is suitable for longer sequences. For example, here are two sequences in fasta format:
>c0_g1_i1 GTTTCATACTTGCTCTGAATGGCGTTGCTGAGTCCAGACCAGCATCCGCGGTGCCAGCGC CGGGCACCGCAGGGGACCTGAGTCCTGCCCCAGTGTCAAGCCGCGGTTCTGTTATCCGTC ACCACGACAACCAAGCACCAGACCTGCCGGGTGTTCGGCAGTGGCAGTGACAGCTCAGAC CCGCTCAGACGCAGGGTAGTGAGCCGGGCGGCTCTGGTG >c2_g1_i1 TTGTATTCAAATTCATTCCTAGTCCTATTGGCTCTTTTATTCTTTCCACGGAATCTTCCG TTGTGTGTACTTTTTATCTTTTCTTTAGAATTTCTTTTGAATTTAATTAATACCATAAGG AGCCTAGTGATTCCCAACTGTTTACAAATTTTGCCTTGTCAATGAGATTTTAAATTCTTT TCCTAAGGACTTAGTTGGGGCATAGTAAGAAATTGTAATTGATTGATAAAAAAAAACCATNote that there's no sequence quality in this format. You'll usually find reference genomes or contigs produced by an assembly in fasta format, while reads fresh off the sequencing machine come in fastq.
Alignment and SAM Format
As we remarked above, alignment, also called mapping, is the process of matching reads to a pre-existing reference by sequence homology. Aligned reads are stored in yet another file format known as sam (for Sequence Alignment/Map), which includes both the read and reference sequences and information about where and how they differ. The compressed form of a sam file is known as a bam file. Here's a typical workflow:
fastq file of reads --> align to reference --> bam file
A bam file can be indexed to produce a corresponding bai index file, as you can read about on on BioStars.
Here's what the first 10 lines of a sam file looks like (you can scroll to the right):

Quoting directly from the man page, the columns are:
sam
1 | Query template NAME |
2 | bitwise FLAG |
3 | Reference sequence NAME |
4 | 1-based leftmost mapping POSition |
5 | MAPping Quality |
6 | CIGAR string |
7 | Ref. name of the mate/next read |
8 | Position of the mate/next read |
9 | observed Template LENgth |
10 | segment SEQuence |
11 | Phred-scaled base QUALity+33 |
Bowtie and BWA
As we remarked above, bowtie—now officially bowtie2—and bwa are two popular aligners. They excel at mapping short reads from high-throughput sequencing data.Bowtie
To use bowtie, you must first index your reference:$ bowtie2-build ref.fa prefix(It will create files whose names begin with prefix.)
To map paired reads (bowtie2 version 2.1.0):
$ bowtie2 -x /path/to/ref/prefix -1 read1.fq -2 read2.fq > aln-pe.samTo quote the manual directly, the general usage is:
Usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>] <bt2-idx> Index filename prefix (minus trailing .X.bt2). NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible. <m1> Files with #1 mates, paired with files in <m2>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <m2> Files with #2 mates, paired with files in <m1>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <r> Files with unpaired reads. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <sam> File for SAM output (default: stdout) <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.Oftentimes, your goal with mapping is produced a sorted bam file along with its bai index. If you're smart, you can use bash pipes to do this without creating a lot of polluting intermediate files. E.g.:
$ # map, convert sam to bam, sort bam $ bowtie2 -x $ref -1 $mate1 -2 $mate2 | samtools view -bS - | samtools sort - $mylocaloutputdir/${name}.sortedThis will produce a file called $mylocaloutputdir/${name}.sorted.bam.
BWA
Index your reference:$ bwa index ref.faFor single end reads:
bwa aln ref.fa short_read.fq > aln_sa.sai bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.samFor paired reads:
bwa aln ref.fa read1.fq > aln_sa1.sai bwa aln ref.fa read2.fq > aln_sa2.sai bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.samGenerate a bam file, instead of a sam file, with bwa:
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq | samtools view -bS - > aln-pe.bamFor long reads:
bwa bwasw ref.fa long_read.fq > aln.samHere's an example of a script to automate bwa mapping. It assumes the fastq files are zipped and uses 8 cores:
#!/bin/bash echo [start] echo [pwd] `pwd` echo [date] `date` mate1=$1 mate2=$2 myoutputdir=$3 name=$4 ref=$5 echo 1 bwa aln -t 8 $ref <( cat $mate1 | gunzip ) > ${myoutputdir}/$name.1.sai echo 2 bwa aln -t 8 $ref <( cat $mate2 | gunzip ) > ${myoutputdir}/$name.2.sai echo 3 bwa sampe $ref ${myoutputdir}/$name.1.sai ${myoutputdir}/$name.2.sai <( cat $mate1 | gunzip ) <( cat $mate2 | gunzip ) | samtools view -bS - > ${myoutputdir}/${name}.bam # now sort and index bam file echo [finish] echo [date] `date`Update: bwa mem is the preferred command, not bwa aln plus bwa samse / bwa sampe. This source notes:
Yes, the aln/sampe commands invoke the original BWA algorithm (backtracking, from 2009), and the mem command invokes the latest BWA improvement (MEM algorithm, 2013). I think it is fair to say that mem should replace aln/sampe for everything now. (except maybe on older, < 70 bp reads)
Samtools
As we remarked above, sam format is a common file structure in which to store aligned sequence data. Samtools is a popular suite of software tools for manipulating sam and bam files. Here are some basic samtools (version 0.1.19) commands.View the first one hundred lines of a bam file on the command line:
$ samtools view myfile.bam | head -100 | column -t | less -SNote the:
column -t | less -S
combination, which makes viewing jagged multi-column files in the terminal easy.
Convert bam to sam:
$ samtools view myfile.bam -h > myfile.samThe -h flag preservers the header, which is necessary for converting sam back to bam.
Get a particular region of a bam file, e.g. chr20:
$ samtools view -h myfile.bam chr20Sam to bam:
$ samtools view -bS myfile.sam > myfile.bamSort bam:
$ samtools sort myfile.bam myfile.sortedIndex a bam file (to produce a .bai file):
$ samtools index myfile.sorted.bamGet the coverage depth of a bam file:
$ samtools depth myfile.sorted.bam > myfile.sorted.depth.txtIndex a fasta file (to produce a .fai file):
$ samtools faidx myfile.faPull out a subsequence from an faidx-indexed fasta file by range:
$ samtools faidx myfile.fa chr1:100000-100005Generate a bam file such that all reads meet the criteria "the read is paired in sequencing" and "the read is mapped in a proper pair":
$ samtools view -b -h -f 0x0001 -f 0x0002 myfile.bam > myfile.goodreads.bamA one-liner to split your bam file by chromosome (scroll horizontally to see the full command):
$ for i in {1..22} {X,Y,MT}; do echo "* "$i; mkdir -p chr/${i}; samtools view -bS <( samtools view -h in.bam $i ) > chr/${i}/out.${i}.bam; doneSamtools has an ASCII alignment viewer which allows you to see how reads mapped to the reference right in your terminal. (First make sure that you've indexed your .bam file so that the .bai file is in the same directory.) Syntax:
$ samtools tview $bam $refThen type g to go to a position or ? for help.
Here's an example of what it looks like:

View reads covering a particular position (e.g., chr9:711187):
$ samtools tview $bam $ref -p chr9:711187-711187 -d tWe'll talk about pileup format below, but here's how to generate it with samtools for a particular region (e.g., chr9:711187-711287):
$ samtools mpileup -d 100000 -L 100000 -r chr9:711187-711287 -f $ref $bamsThe mpileup command has a lot of flags and you need to pay careful attention to them. Here, we're using:
- -d INT max per-BAM depth to avoid excessive memory usage [250]
- -L INT max per-sample depth for INDEL calling [250]
$ samtools mpileup -A -Q 0 -d 100000 -L 100000 -r chr9:711187-711287 -f $ref $bams
Samtools Flags
The second column in a sam file is the flag. The flag gives you information about how your read mapped, but to decode this information you have to convert the flag from base-10 to base-2. For example, if your flag is 99, then in binary it's:
1100011
How do we read this?
Quoting from the Samtools man page:
Bit | Hexidecimal | Description |
1 | 0x1 | template having multiple segments in sequencing (i.e., read paired) |
2 | 0x2 | each segment properly aligned according to the aligner (i.e., read mapped in proper pair) |
4 | 0x4 | segment unmapped (i.e., read unmapped) |
8 | 0x8 | next segment in the template unmapped (i.e., mate unmapped) |
16 | 0x10 | SEQ being reverse complemented |
32 | 0x20 | SEQ of the next segment in the template being reverse complemented |
64 | 0x40 | the first segment in the template |
128 | 0x80 | the last segment in the template |
256 | 0x100 | secondary alignment |
512 | 0x200 | not passing filters, such as platform/vendor quality controls |
1024 | 0x400 | PCR or optical duplicate |
2048 | 0x800 | supplementary alignment |
So, for 99 == 1100011, the 20, 21, 25, and 26 bits are all high—i.e., it's paired (1); it's mapped in proper pair (2); the mate is on the reverse strand (32); and the first mate in the pair (64). As we saw in the previous section, we can filter on these qualities. For example, to get all the first mate reads, it's:
samtools view -f 64 file.bamTo get the second mate, it's:
samtools view -f 128 file.bam
File Formats in Which to Store Variants: mpileup, vcf, maf
If you have multiple bam files which you want to compare and all are mapped to the same reference, you can put them in mpileup format. To emphasize this point, bam files have information for a single sample, while pileup format can accommodate an arbitrary number of samples merged. A pileup file contains all the information in the bam files displayed "vertically." It gives the reference base per position as well as the corresponding read bases and read base qualities for all samples. It might look like this:chr | pos | ref base | cov depth | read bases | read base quals |
1 | 15026 | C | 6 | ,$..,,, | HD?E@9 |
1 | 15027 | T | 6 | ..,,,, | DD>A:C |
1 | 15028 | C | 4 | .,,, | EC;7 |
1 | 15029 | G | 5 | .,,,, | AE?>C |
1 | 15030 | T | 5 | .,,,, | 9CC=? |
1 | 15031 | G | 5 | A,,,, | 5FAAE |
1 | 15032 | G | 4 | .,,, | >F5< |
1 | 15033 | C | 5 | .,,,, | BB<<D |

The columns for a single sample are:
mpileup
1 | chromosome |
2 | 1-based coordinate |
3 | reference base |
4 | the number of reads covering the site |
5 | read bases |
6 | base qualities |
Pileup format isn't particularly human-readable, so there's one more format translation you need to do, into Variant Call Format (vcf). Here's the idea:
multiple bam files --> mpileup --> vcf
Here's what a vcf file looks like (scroll right):

Quoting directly from here, the columns are:
vcf
1 | CHROM | the chromosome |
2 | POS | the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position |
3 | ID | a semicolon-separated list of marker identifiers |
4 | REF | the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC") |
5 | ALT | the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles |
6 | QUAL | probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log10(probability)) |
7 | FILTER | Either "PASS" or a semicolon-separated list of failed quality control filters |
8 | INFO | additional information (no white space, tabs, or semi-colons permitted) |
9 | FORMAT | colon-separated list of data subfields reported for each sample. |
##fileformat=VCFv4.2 ##source=pileup2multiallele_vcf ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth"> ##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting reads"> ##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting reads"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 0"> ##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency"> ##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting reads"> ##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting reads"> ##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting reads on forward strand"> ##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting reads on reverse strand"> ##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting reads on forward strand"> ##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting reads on reverse strand"> ##SnpSiftVersion="SnpSift 4.1c (build 2015-03-29), by Pablo Cingolani" ##SnpSiftCmd="SnpSift annotate dbSnp138.vcf res/report.vcf" ##INFO=<ID=CLNDSDBID,Number=.,Type=String,Description="Variant disease database ID"> ##INFO=<ID=PH3,Number=0,Type=Flag,Description="HAP_MAP Phase 3 genotyped: filtered, non-redundant"> ##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class"> ##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations"> ##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies."> ##INFO=<ID=KGValidated,Number=0,Type=Flag,Description="1000 Genome validated">Whatever sub-fields are in the INFO or FORMAT field should be described here. After the nine fixed fields of the vcf, you can have an additional field for each sample, whose format is specified by the FORMAT column. The INFO sub-fields generally describe features that hold true across all samples, while the FORMAT sub-fields give you information per sample.
Here are some examples of sub-fields of format field:
DP | Description="Quality Read Depth of bases with Phred score >= 15" |
RD | Description="Depth of reference-supporting bases" |
AD | Description="Depth of variant-supporting bases" |
RBQ | Description="Average quality of reference-supporting bases" |
ABQ | Description="Average quality of variant-supporting bases" |
RDF | Description="Depth of reference-supporting bases on forward strand" |
RDR | Description="Depth of reference-supporting bases on reverse strand" |
ADF | Description="Depth of variant-supporting bases on forward strand" |
ADR | Description="Depth of variant-supporting bases on reverse strand" |
maf
1 | Hugo_Symbol |
2 | Entrez_Gene_Id |
3 | Genome center |
4 | NCBI_Build |
5 | Chromosome |
6 | Start_Position |
7 | End_Position |
8 | Strand |
9 | Variant_Classification |
10 | Variant_Type |
11 | Reference_Allele |
12 | Tumor_Seq_Allele1 |
13 | Tumor_Seq_Allele2 |
14 | dbSNP_RS |
15 | dbSNP_Val_Status |
16 | Tumor_Sample_Barcode |
17 | Matched_Norm_Sample_Barcode |
18 | Match_Norm_Seq_Allele1 |
19 | Match_Norm_Seq_Allele2 |
20 | Tumor_Validation_Allele1 |
21 | Tumor_Validation_Allele2 |
22 | Match_Norm_Validation_Allele1 |
23 | Match_Norm_Validation_Allele2 |
24 | Verification_Status |
25 | Validation_Status |
26 | Mutation_Status |
27 | Sequencing_Phase |
28 | Sequence_Source |
29 | Validation_Method |
30 | Score |
31 | BAM_File |
32 | Sequencer |
33 | Tumor_Sample_UUID |
34 | Matched_Norm_Sample_UUID |
Working with VCF Files
View a vcf file in the terminal (cutting off header lines):$ cat file.vcf | sed '/##/d' | column -t | less -SIndex tab-delimited files, such as those in vcf format, with tabix for fast searching:
$ bgzip file.vcf # produces file.vcf.gz $ tabix -p vcf file.vcf.gz # index vcf file with tabix (make .tbi file) $ # now we can grab ranges super fast, say chr 10 pos 370800 to 370900 $ tabix file.vcf.gz 10:370800-370900Note that this uses samtools' bgzip, not a standard unix utility like bzip2. Once you've indexed a vcf file in this way, you can easily pull out any region—say, chromosome 10, positions 370800-370900—as shown above.
There are Perl scripts for manipulating vcfs called VCFtools, but these can be slow with large vcfs. vcflib, faster than VCFtools, is "a simple C++ library for parsing and manipulating VCF files, + many command-line utilities".
Extract particular fields from the INFO-field:
$ vcfkeepinfo test.vcf DP4 MQ > test.small.vcfExtract particular fields from the GENOTYPE-field (i.e., FORMAT-field):
$ vcfkeepgeno test.vcf RBQ ABQ SDP AD > test.small.vcfIn this particular case, we cut the reference and variant quality fields, and the total depth and variant depth.
Calling Variants
Calling variants is a phrase which can mean two things in bioinformatics: (1) simply enumerating differences between a reference and a sample; and (2) determining which of those differences are significant. These are two very different questions. In both cases, we follow the pattern discussed above:
bam files --> mpileup file --> vcf
Enumerating the differences is straightforward; finding the significant ones is difficult.
Here is a simple example of calling variants with mpileup + bcftools (version 0.1.19):
$ samtools mpileup -D -S -d 5000 -r 1 -uf ref.fa 1.bam 2.bam | bcftools view -bvcg -p 2 - | bcftools view - > out.vcfThese commands are highly dependent on the flag options. Here are some of the options:
Output options: -d the max per-BAM depth to avoid excessive memory usage [250] -D output per-sample DP in BCF (require -g/-u) -g generate BCF output (genotype likelihoods) -O output base positions on reads (disabled by -g/-u) -s output mapping quality (disabled by -g/-u) -S output per-sample strand bias P-value in BCF (require -g/-u) -u generate uncompress BCF output
Blast
Blast, for Basic Local Alignment Search Tool, is a robust and versatile aligner which can handle short as well as very long sequences. The docs summarize:BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.As the name "Blast" says, it's a local aligner which means that, no matter how large your query and subject (reference) sequences are, it will find all the matching subsequences. It also means the alignment can be degenerate—one piece of the query sequence can map, non-uniquely, to many different places on the reference sequence. Blast can search very, very large databases of sequence data but, as with most aligners, it's computationally expensive.
Blast has an online GUI whose portal looks like this:

The output after performing an alignment looks like this:

Although this portal is convenient for small queries, if you want to run large queries, make your own blast databases, or run batches of jobs, it's worth downloading the command line version. Example command line blast syntax:
$ blastn -task megablast -db /db/prefix -outfmt 6 -query input.fa -word_size 28 -evalue .002 $ blastn -task megablast -db /db/prefix -outfmt 6 -query input.fa -evalue 1e-30 -max_target_seqs 5The header is:
1 | qseqid |
2 | sseqid |
3 | pident |
4 | length |
5 | mismatch |
6 | gapopen |
7 | qstart |
8 | qend |
9 | sstart |
10 | send |
11 | evalue |
12 | bitscore |
Index a Fasta File to Use as a Blast Database
To make a reference for blast, you have to index it. E.g.:$ makeblastdb -in ${input_fasta} -dbtype nucl -out ${name} -parse_seqids -hash_index
Downloading NCBI's Blast NT Database
The Blast NT database comprises a broad swath of sequence data across all life—"All GenBank + EMBL + DDBJ + PDB sequences," as the docs describe it. It's the most comprehensive repository of sequences there is and what you should blast to if you don't know what species your query sequence is. To download:$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*Then untar.
Making a Blast Subset Database
Once you've downloaded the NT database you can make a subset of it corresponding to, say, viruses or bats or anything else. To do this, you first need to get a list of the GI asscessions of all the sequences in the taxa you want. Go to NCBI Nucleotide and refine your search results using the Taxonomic Groups box on the right. Download the GIs as a text file by selecting: "Send to: File" and "Format: GI List". Here's a screenshot showing the process of downloading all bat GIs:
It will be called something like sequence.gi.txt. Then use the blastdb_aliastool to make a subset db:
$ blastdb_aliastool -db /path/to/blast/database/nt -dbtype nucl -gilist sequence.gi.txt -title YourTitle -out YourName
Querying the NT Database from the Command Line
To see the "head" of your NT database in fasta format, use:$ blastdbcmd -db ${ntdb} -entry all -outfmt '%f' | headTo query the taxid and description for an entry in the NT database for an individual GI accession, use:
$ blastdbcmd -db ${ntdb} -entry ${gi_accession} -outfmt '%T %t' -target_onlyFor a batch of GI accessions, use:
$ blastdbcmd -db ${ntdb} -entry_batch ${gi_accession_file} -outfmt '%T %t' -target_only
Blat
Blat is a BLAST-like tool developed by UCSC. It's often much faster than BLAST. Example command line blat syntax:$ blat ref.fa query.fa output.pslThe output is in UCSC's psl format.
Get blast style output:
$ blat ref.fa query.fa output.psl -out=blastGet tabular blast style output:
$ blat ref.fa query.fa output.psl -out=blast8
UCSC Genome Browser
UCSC Genome Browser is a very popular and rightly-famous tool for visualizing reference genomes (i.e., assemblies). You can zoom in and out and overlay the reference genome with all sorts of annotation. It was written by Jim Kent.SRA Toolkit
If you use NCBI's Sequence Read Archive (SRA), you can download raw sequencing data as files with an .sra extension. To convert these into fastq, download the SRA toolkit and run its fastq-dump:$ fastq-dump myfile.sra --gzip --split-files > log.o 2> log.e(the flags output your file in compressed form as well as one file per mate.)
IGV
Integrative Genomics Viewer (IGV) is a highly regarded and widely used tool for visualizing, among other things, reads mapped to reference. For example, you can load bam files into it and view the coverage depth of reads on the reference at any particular window on the genome.If you have a list of variants which you believe are real, it's a good idea to look at them in IGV to make sure the supporting reads don't look funny. Here's a screenshot of two bam files mapped to the human genome (hg19) displayed in IGV. At the position of the cursor, the reference is a C but some of the reads are calling a T:

Bedtools
Bedtools is a suite of utilities for so-called genomic arithmetic—e.g., finding intersections between features in annotated sets of sequence data. One thing I hate about it: Bedtools follows the incredibly stupid and insanely confusing UCSC genome browser convention of representing coordinates with zero-based counting where the end coordinate is not included. Here's an example of this nonsense, using Bedtools' getfasta command, which "extracts sequences from a FASTA file for each of the intervals defined in a BED file":$ cat junk.fa >junk ACTGGAACCT
$ cat junk.bed junk 2 5
$ bedtools getfasta -fi junk.fa -bed junk.bed -fo out.fa $ cat out.fa >junk:2-5 TGG
GFF Format
GFF, or Generic Feature Format, is the standard format for describing or annotating sequence features. Read about it here.FastQC
FastQC is a quality control tool for high throughput sequence data. It provides nice graphical displays of quality metrics for your fastq files and alerts you if, for example, your reads are low quality or particular k-mers are overrepresented. For this reason, it's a good idea to run this on your fastqs before you map them to nip any potential problems in the bud. Here's the simple syntax:$ fastqc -o outputdir -f fastq myfile.fastqHere's one of the many graphs it produces, average per base sequence quality:

Cutadapt
Cutadapt is a program for cutting junk adapters out of sequencing data.Trinity
Trinity is a de novo RNA seq assembler—that is to say, you input reads and it outputs a transcriptome. One of the best things you can say about Trinity is that, unlike many other tools in the field, it's super easy to use. As with all assemblers, it takes a significant amount of time and memory for large data sets.Command Line Bioformatics: Pulling Out Subsequences from a Fasta File by Range
Here's how to pull out subsequences from a fasta file, such as the reference genome hg19, by range. The only trick is that you have to index your fasta with samtools faidx first. Here, we're going to look at the sequences flanking three random positions in chromosome one, 100000, 2349483, 534900, by 10 base pairs on either side (check out the cool combination of awk and xargs!):$ echo -e "100000\n2349483\n534900" | awk '{print "chr1:"$1-10"-"$1+10}' | xargs -i echo "samtools faidx /path/hg19.fa {}" | sh >chr1:99990-100010 tcatcacactcactaagcaca >chr1:2349473-2349493 TGTCCTGCCCCCAAACGCGAA >chr1:534890-534910 gttggctaggctggtctcgaa
Command Line Bioformatics: Converting Fastq to Fasta
$ cat myfile.fastq | awk '{if (NR%4==1) {print ">"substr($1,2)} else if (NR%4==2) {print $0}}'
Command Line Bioformatics: Putting the Sequence Part of a Fasta File on a Single Line
$ cat myfile.fasta | awk 'BEGIN{f=0}{if ($0~/^>/) {if (f) printf "\n"; print $0; f=1} else printf $0}END{printf "\n"}'(from my co-worker, Vladimir)
Command Line Bioformatics: Reverse Complementing a Sequence
$ echo TTACGGT | rev | tr ATGC TACGCourtesy of this blog.
Command Line Bioformatics: Computing the Sequence Composition (i.e., how many A,C,G,Ts) in a Fasta File
One way:$ for i in A C T G; do cat myfile.fasta | grep -v ">" | grep -o ${i} | wc -l; doneA better way from my co-worker, Albert:
$ cat myfile.fasta | grep -v ">" | fold -w 1 | sort | uniq -c
Command Line Bioformatics: Generating Random Sequence Data
Via my co-worker, Albert:$ cat /dev/urandom | tr -dc 'actg' | fold -w 50 | head