http://fisherenglish.co.uk/wp-content/uploads/2017/03/tools-toolbox-bw-ss-1920.jpg

Bioinformatics Tools

1 Sequence Alignment

1.1 ClustalW

Useful for aligning a small number of sequences, such as homologs from multiple species. Not so useful for multiple alignment of a large amount of sequences from ChIP seq for instance.

1.2 SeaView

Graphical visualisation of multiple sequence alignment. Can perform alignment of sequences, or can be used to visualise the results of alignments from other algorithms. Can be manually adjusted which can be useful when incorporating prior knowledge into the alignment.

1.3 PAGAN

A phylogeny-aware general purpose aligner which can be more useful for clustering when there are likely to be large gaps in some sequences compared to others (for instance if comparing gene sequences from different species with non-homologous exons).

1.4 NCBI BLAST

Suite of tools for searching a protein or nucleotide database to find potential matches or homologs, or for aligning multiple sequences. This tool is particularly useful when you have a query sequence and want to find either known homologs amongst different species, or to find additional proteins with similar structural domains.

1.5 HMMer

Similar to BLAST, HMMer uses hidden Markov models to improve the accuracy and ability to detect remote homologs by using a stronger underlying mathematical model. This suite of tools can be used to search for homologs in protein and transcript databases in a similar way to BLAST.

1.6 FASTA

FASTA is the original sequence similarity search tool, and can be used to search for a query protein amongst known protein or ranscript databases such as UniProt. It uses a heuristic approach, which has been improved over the years, but BLAST is probably a better approach to use in general.

1.7 MUSCLE

MUSCLE is a multiple sequence comparison tool similar to ClustalW that may be slightly faster with more accurate results. It can be used to perform gapped alignment of a set of sequences in any supported format (e.g. fasta).

2 Motif and Domain Analysis

2.1 MEME Suite

The MEME suite is a series of tools for motif analysis and consists of the following tools:

2.1.1 MEME

De-Novo discovery of motifs from a series of related protein or transcript sequences. This method is more designed for comparison of sequences from, say, a handful of transcription factor sites and is not so well designed for large number of sequences (e.g. from ChIP seq experiments). This tool is better for the identification of wider motifs. For narrower motifs use DREME.

2.1.2 DREME

Like MEME above, but better for the discovery of narrower motifs.

2.1.3 MAST

MAST is a motif alignment and search tool for the identification of known (or identified) motifs amongst some database of sequences. This can be either a protein or transcript database, or for instance a known motif can be searched within a set of sequences such as from a ChIP seq experiment. A match score is calculated by comparing the position weight matrix of the motif with each sequence.

2.1.4 TOMTOM

TOMTOM is a way to compare de novo motifs to a database of known motifs contained within the MEME database. A list of matching motifs is returned, so this can be used to see if for instance your motif matches the known binding motif of a transcription factor within the database. One of the databases that is queried is the JASPAR database (see resources) which contains a large well-maintained and curated set of known TF binding sites.

2.1.5 GOMO

GOMO takes an input motif and scores all genes based on the binding affinity of the motif at their upstream promoter region. It then performs a gene ontology analysis to see if this motif is associated with the promoter regions of genes involved in any particular function.

2.1.6 MEME-CHIP

MEME-CHIP is a combined motif discovery tool for ChIP seq results. It relies on the fact that the mid-100bp region of the supplied sequences represent the likely binding location of the peak, performs motif discovery across a large number of sequences, and compares against known motif databases.

2.2 HOMER Suite

The HOMER (Hypergeometric Optimization of Motif EnRichment) suite of tools is ostensibly a package (like MEME) for de novo identification of enriched motifs from genomic data, but has expanded considerably to provide resources for many aspects of NGS analysis. It has tools for analysis of ChIP-Seq, RNA-Seq, GRO-Seq, Hi-C and BS-Seq data. For instance, for ChIP-seq it has some very useful functions for annotation of peaks (annotatePeaks.pl), motif analysis on called peaks (findMotifsGenome.pl), and differential peak analysis (getDifferentialPeaksReplicates.pl).

3 Domain Analysis

3.1 InterProScan

Interpro is a database of protein families and known structural domains from the EBI. InterProScan allows users to scan a query sequence for known structural domains or associated protein families, comparing against proteins in the UniProt Knowledge Database.

3.2 Pfam

Pfam is a database from the EBI containing identified protein famillies based on multiple sequence alignments and HMM analysis of known sequences. The Pfam website can be used to view protein families, or view the protein domain structure of known proteins, or alternative can be used to search a query sequence for known domains.

3.3 SMART

SMART is a well-maintained modular database (based on SwissProt, TReMBL and the like) of annotated and functional protein domains. Domain structure of known proteins as well as identification of known domains within novel protein sequences can be achieved using this tool.

3.4 Data Repositories

3.5 Gene Expression Omnibus (GEO)

GEO is the NCBI repository of high throughput data, including sequencing and microarray data. All data maintained within this database is MIAME-compliant, and is synched with projects in ArrayExpress as well. Most experiments contain both raw data (raw chip data for microarrays, or raw read files for sequencing), as well as processed data files (mapped data, called peaks, normalised microarray results, etc).

3.6 ArrayExpress

ArrayExpress is the EBI-maintained repository for high-throughput data. As with GEO, it contains data for microarray experiments and sequencing experiments of all kinds, all of which are MIAME-compliant. Most experiments contain both raw data (raw chip data for microarrays, or raw read files for sequencing), as well as processed data files (mapped data, called peaks, normalised microarray results, etc).

3.7 Sequence Read Archive (SRA)

The SRA is probably the single most useful source for the storage of sequence level data in the world. GEO and ArrayEpxress are linked to the SRA, and this is where the raw and mapped data for most sequencing experiments are stored. These data are stored in a proprietary XML-style format and the SRA Toolkit applications can be used to dump the files into one of a number of useful formats (e.g. fastq-dump and sam-dump).

4 Genome Mapping

4.1 BWA

The Burrows-Wheeler Aligner is one of the most widely used mapping algorithms available, and is particularly useful for mapping low-divergent sequences against a large reference genome. Each sequence is mapped with a mapping quality score, which takes into account gapped alignments and mismatches, and this can be used to define accurate mapping. The output SAM files also contain the location of mismatches in CIGAR format. The benefits of using this algorithm are its speed and accuracy. One negative is that if a sequence maps to multiple locations, only one mapped loci is returned (obviously with a low mapping score). If you want to see all mapped locations (e.g. if you are interested in regions of low complexity) then Bowtie may be better.

4.2 Bowtie

Bowtie is one of the most commonly used mapping algorithms, and uses indexing using Burrows-Wheeler to produce fast and memory efficient alignments. Bowtie can be used to provide multiple mapping coordinates for reads from regions of low complexity and repetitive sequences in the genome. Bowtie is well maintained (with version 2 currently in circulation).

4.3 MAQ

MAQ is an older aligner that does not seem to be maintained any longer. It has been superseded by BWA and Bowtie and should probably not be used any longer.

4.4 TopHat

TopHat is a splice-junction aware mapper for the alignment of RNA Seq data to the genome. It is based on the Bowtie aligner above, but analyses mapped data to identify splice junctions between exons. Alternatively a list of known exon splice junctions can be provided which can be used to guide the alignment across splice junctions. This is one of the most commonly used tools for the mapping of RNA seq data.

4.5 STAR

Like TopHat, STAR is a splice aware aligner that is used to map RNA Seq reads to the genome. However, STAR uses a much faster algorithm than TopHat and tends to produce slightly better results.

4.6 NovoAlign

Novoalign is a licensed mapping software tool (although a free version is available). For multi-mapping reads, you can choose to return all, none, or just one randomly selected mapping in your output.

4.7 SHRiMP

SHRiMP is another older free mapping algorithm which no longer has active support from the authors. It is based on seeding together with the Smith-Waterman algorithm, but is surprisingly fast.

4.8 ELAND

ELAND is the Illumina pipeline software that comes attached to each of the Illumina sequencing machines. It is their proprietary software and is not easy to get hold of, although as indirect customers we could probably obtain a copy if necessary.

5 De Novo Assembly

5.1 SOAP

SOAP is a suite of tools for analysis of short oligonucleotide sequences, including for de novo assembly, alignment against a target genome, splice mapping for RNA seq data, and SNP and indel calling for resequencing experiments. The website has not been updated since 2010, so this may not be maintained any more and may be an older tool that is not of any use any more.

5.2 ABySS

ABySS is a de novo assembler that is paired end aware and well suited for short reads. It is well maintained, with the latest version (1.5.2) out in July.

5.3 Trinity

Trinity is a complete package for assembly of transcripts from Illumina RNA Seq data, developed at the Broad Institute ant the Hebrew University of Jerusalem. There are three main stages to the assembly: 1) The Inchworm program performs the read assembly, generating either full length transcripts, or unique portions for alternatively spliced transcripts; 2) The Chrysalis program then clusters these unique contigs into “gene” clusters and constructs de Bruijn graphs for each cluster based on member assignment; 3) The Butterfly program then deconvolutes these graphs and generates the full length transcripts for the alternatively spliced transcripts and identiying paralogous genes. It can be used in conjunction with the Trinotate package, which uses a number of standard programs for annoattion of the resulting transcripts.

6 Read Quantification

6.1 HTseq

HTseq is a python based tool for assessing the quantification of reads amongst specific genomic elements (typically used to count reads that map to individual genes in RNA Seq data analysis). It relies on mapped data using one of the (likely splice-aware) mapping elgorithms mentioned above. In particular, it allows you to specify how best to deal with reads that overlap multiple exons (e.g. due to multiple splicing isoforms, or due to reads mapping to genes that overlap on opposite strands).

6.2 Kallisto

Kallisto is an alignment-free read quantification algorithm, used to quantify the abundance of transcripts from (say) RNA-Seq data. It uses an index of the target transcriptome and uses a pseudoalignment method which can quantify 30 million human reads in less than 3 minutes. This has been shown to be comparable to other assembly-based quantification methods (e.g. Tophat2, STAR), and actually is robust to read-level errors and so often out-performs these, whilst also being orders of magnitude faster.

7 Gene Ontology (GO) and Network Analysis

7.1 DAVID

DAVID is an online tool for performing gene ontology analysis of a set of target genes. As well as comparing gene lists against gene ontology classes, it is also able to compare against many other annotated gene lists, including pathways from KEGG and Panther, disease genes, literature searches, etc. It is a fantastic resource for looking for enriched functions within a given gene list.

7.2 GSEA

GSEA is the Broad Institute’s visualisation tool for looking for enrichment of functional annotations in a given gene set. You provide the tool with a molecular profile data set (e.g. from RNA seq or microarray) which allows you to rank all genes in the genome based on some measure of expression (or enrichment above a control). This list is compared against a database of gene sets (including curated sets such as gene ontology and disease data sets, as well as any sets that may be of interest specifically for your current experiment , such as results from a different experiment), the genes are ranked based on the profile data, and are scored based on whether or not they are present in the database. The idea is that if there is an enrichment of genes of a certain function in your data set, then you will see this as an enrichment in the graph of the score towards the end of the genes (it will peak at the start, and then tail off for instance). If however there is no enrichment (the genes in the current database are spread across the entire continuum of expression scores) then no such peak will be seen.

7.3 GO

The Gene Ontology consortium maintains a well maintained and fully curated database of gene ontologies for the vast majority of identified genes amongst a whole host of organisms. Gene ontology classes are subdivided into 3 groups: biological process, molecular function, and cellular process. The gene ontology classes also follow a hierarchical structure, with increased fidelity as you look at more fine-grained annotations. The GO website allows you to search for enrichment of certain GO terms within genes in a dataset that you input, such as from a microarray experiment.

7.4 GREAT

GREAT is a software tool for predicting cis-regulatory regions for a given set of genomic regions. Essentially it takes a set of genomic loci and analyses the annotation of any nearby genes to see if the specified regions are close by to genes of a specific function. For instance, you might have identified a set of ncRNAs that are specifically involved in cis-regualtory activity of target genes of a specific class.

7.5 PANTHER

The PANTHER (Protein ANalysis THrough Evolutionary Relationships) database is another database of gene/protein family and subfamily functional classes, that actually uses the GO terms from the gene ontology database. It uses evolutionarily conserved elements of novel genes/proteins to identify functions based on similar families present in the database. It is ultimately used in a similar way to the GO database.

7.6 blast2GO

blast2GO is a bioinformatics package for the identification of GO classes and other functional information from novel nucleotide level data (e.g. from de novo sequencing). It utilises the BLAST algorithm to identify genes with similar sequence structure and known functional annotation that can be inferred across to the novel sequence.

8 Quality Control

8.1 FastQC

FastQC from the Babraham Institute is a very easy to use tool for quality control of fastq files, such as those used in sequencing. This can be a fantastic first pass analysis for sequencing data to check that no obvious problems exist with the sequencing data prior to mapping. A variety of figures are generated, including the base-level base calling scores across each read which can show if there are any systematic regions that require trimming (e.g. the 3’ end of the reads where quality can fade), the per-base sequence composition which can help to identify the presence of contaminants (such as adapter sequences), and the levels of sequence duplication which might indicate significant PCR bias. It is usually a very good idea to run FastQC on the sequencing files as a matter of course prior to mapping to see what specific issues you should expect to encounter in the processing.

8.2 MultiQC

MultiQC is a tool that I have recently started to use which is able to create interactive reports aggregating the results from various QC programs into one simple to navigate web page. This provides a single report for each sample that can easily be perused to identify issues with the data. It is set up to deal with output from many common analysis tools, including cutadapt, fastQC, Trimmomatic, BWA, Tophat, Bowtie, STAR, etc.

9 Peak Calling

9.1 FindPeaks

FindPeaks is part of the Vancouver short read analysis package, and is a peak calling algorithm designed to identify regions of enrichment above control and background levels, in a set of sequencing data (e.g. ChIP seq). It works well at identifying tighter peaks but is not well designed for the identification of extended enriched regions (e.g. H3K36me3).

9.2 MACS

MACS is a model-based peak-calling algorithm that uses the difference in the ChIP fragments from the positive and negative strands to pinpoint the peak center. A poisson model is used to model the background, and enriched regions above the background or above a control sample are estimated. MACS is useful for tight and medium width peaks, but is not so good at identifying wider peaks such as for gene-wide histone marks.

9.3 HPeak

HPeak uses hidden Markov models to identify regions of the genome with significantly more mapped reads than expected by chance. It was released in 2010 and version 2 is now out, although I do not think that it is maintained and updated any more.

9.4 PeakSeq

PeakSeq is another recent peak calling algorithm that identifies local thresholds for peak calling based on a Poisson approximation. The ChIPseq data are first normalised to a control sample, and then enriched regions are identified based on this threshold.

9.5 QuEST

Quantitative enrichment of sequence tags (QuEST) is an older peak calling algorithm that works in a similar way to MACS. It is most suited for identifying clear tight peaks in ChIP seq data.

9.6 HOMER

The HOMER suite of tools is very useful for performing various aspects of ChIPseq analyses, such as annotation of called peaks and motif finding. Their peak calling tool is designed specifically for ChIP style sequencing data, and relies on defining a constant peak length for all peaks. This is designed to keep things sinple for the motif analysis tool, but means that this algorithm is not well designed for identifying longer

9.7 PeakRanger

PeakRanger is a versatile peak calling algorithm that is useful for both narrow and broader peaks in ChIP seq data.

10 Sequence Manipulation

10.1 BEDTools

Bedtools is a suite of Unix tools for dealing with files of bed, bedgraph, wig, etc. They can be used to do everything from intersecting, merging, counting, complementing and shuffling genomic interval files, as well as converting between different file formats such as BAM, BED, GFF/GTF and VCF.

10.2 SAMtools

SamtTools is a suite of Unix tools for dealing with SAM and BAM formatted files following mapping of genome wide sequencing data. This includes reading, writing, editing, indexing and viewing of these files, as well as filtering based on different fields in the file (mapping quality etc.).

11 Adapter Trimming

11.1 Cutadapt

Cutadapt is a tool allowing the trimming of adapter sequences from the 3’ and 5’ of called reads from sequencing studies. This can be used to trim off known sequences as well as trimming reads to a specified length. This method is not useful for paired end reads as it does not maintain synchronisation between the mate pair files.

11.2 Trimmomatic

Trimmomatic is another tool for trimming sequences such as adapter sequences from called reads, that also allows you to crop reads to remove poor quality bases and maintains synchronisation between mate pair files.

11.3 Trim Galore

Trim galore is a wrapper script for cutadapt that allows users to maintain synchronisation between mate pair files.

11.4 Reaper

Reaper is part of the Kraken package from EMBL/EBI, and is a fast and memory efficient method for demultiplexing, trimming and filtering short read sequencing data. It can be used for a variety of purposes, including barcode filtering, quality checking, and the llike.

12 Pathway Analysis

12.1 Ingenuity

Ingenuity pathway analysis is a platform for the analysis of pathways within sequencing or microarray results based on comparisons with the Ingenuity knowledge base. In particular, the pathway analysis tool provides an interactive approach to the analysis of complex ’omics data, by using known interactions between genes contained within the knowledgebase to generate connections between genes from your analyses.

12.2 KEGG

Like the gene ontology database, the Kyoto encylopedia of genes and genomes is a database resource of various aspects of biological systems such as pathways, functional hierarchies, protein modules, reactome associations, etc. KEGG pathway maps provide interaction information between genes in the given pathway, and expression data can be overlaid to understand the interaction effects in your data.

12.3 Cytoscape

Cytoscape is an open source software platform for the visualisation of complex networks, such as those resulting from interaction datasets, gene ontology classes, etc. It can be used to look for potential interaction networks by combining data from multiple genome-wide experiments.

12.4 STRING

The STRING database provides protein-protein interaction networks identified using a variety of different methods (computer predictions, text mining from published reports and other databases, high throughput experiments, conserved co-expression, etc). This includes both direct interactions and functional interactions. This database can be very useful to understand protein interaction networks and to identify hub genes in gene expression results.

13 Genome Browsers

13.1 University of California, Santa Cruz Genome Browser (UCSC)

One of the most well-known genome browsers available. Describing everything that this browser can do is beyond the scope of this brief outline, but the UCSC browser allows you to represent genomic data from various sources in parallel tracks on the internet browser allowing you to identify correlations between different factors throughout the genome. You can upload your own data in a variety of formats (e.g. mapped RNA seq data in wiggle format, raw read data in BAM format, called peaks from ChIP Seq data in BED format, etc.), and can also access a huge amount of data available from UCSC’s own servers (e.g. instances of the vast ENCODE data). This is a great way of visualising and exploring your data and should be explored in more detail to see what you can do.

13.2 Ensembl Genome Browser

The Ensembl Genome Browser is an alternative web based genome browser to the UCSC browser offered by Ensembl. It is important to be aware that, whislt both UCSC and Ensembl use the same assemblies, there are many key differences that make them fairly incompatible. One important difference is that chromosomes are labelled differently (e.g. chr1, chr2, chr3 for UCSC but 1, 2, 3, for Ensembl). Similarly the genome builds are named differently. It generally works out simpler to pick one of these specifications and stick to it. Additional data (your own files or puclished data such as that from ENCODE) can be added using the track hubs feature, which allows you to specify where to find these data and how to represent them in your browser session.

13.3 Integrative Genomics Viewer (IGV)

The IGV browser from the Broad Institute is an alternative to the UCSC browser. Unlike the UCSC browser, which accesses a centrally located database through a web broswer, the IGV runs locally and so does not suffer from the lag and slow-down often seen with the UCSC browser (particularly during high traffic times of the day). The IGV browser has some very nice personalisation and formatting options not present in UCSC (e.g. being able to easily resize all of your tracks). On first looks, it is a little more bare bones than UCSC due to the lack of data avaialble at your fingertips (as with UCSC), but these annotated published data sets can be added using well annotated track hubs.

comments powered by Disqus