Basic Local Alignment Search Tool (BLAST)

66 %
34 %
Information about Basic Local Alignment Search Tool (BLAST)
Software

Published on June 27, 2014

Author: wijesingheasirisuranga

Source: slideshare.net

Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol. Bio 1990 W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala

CONTENTS ConclusionResults MethodsIntroduction » » » » » Questions & Answers »

INTRODUCTION

WHY COMPARE ONE SEQUENCE TO ANOTHER? Introduction

• Function of a newly sequenced gene or a protein can be predicted by discovering it’s homology to a known gene or a protein. • A major task of Bioinformatics is to find homologous sequence in a database of sequences • Databases of DNA and amino acid sequences continue to grow in size. • There are number of software tools based on many algorithms. Introduction

EVOLUTION OF SIMILARITY SEARCH ALGORITHMS Introduction

Needleman-Wunsch Algorithm • Dynamic programming algorithm • Assign scores to insertions ,deletions and replacements. • Compute an alignment of two sequences with least mutations. • Measurement of Similarity • Because of computational requirements impractical for searching a large database without a supercomputer Introduction

Smith Waterman • Rapid Heuristic Algorithm • Allows large databases to be searched on common computers FastP Algorithm • David J. Lipman and William R. Pearson in 1985 • First find locally similar regions between 2 sequences based on identities but not gaps. • Then rescores those regions using a similarity matrix. • Quite popular Introduction

NEED FOR A FAST ALIGNMENT METHOD Introduction

BLAST(Basic Local Alignment Search Tool) Algorithm • A new approach to rapid sequence comparison. • It directly approximate alignments that optimize a measure of local similarity called the Maximal Segment Pair(MSP) score. • Recent mathematical results on MSP scores allows performance analysis of this method • Generated alignment has a statistical significance. • Simple & a robust algorithm Introduction

Can be applied to various contexts • DNA sequences • Protein sequences • Motif searches • Gene identification searches • Analysis of multiple similarity regions in long DNA sequences Characteristic features • Flexibility & Tractability. • Very much faster than existing sequence comparison tools of comparable sensitivity. Introduction This research paper describes all the methods & implementations of BLAST algorithm.

METHODS

THE MSP MEASURE Discussion on Maximal Segment Pairs METHODS

TWO TYPES OF SIMILARITY MEASURES Methods Global • Optimize the overall alignment of two sequences. • Includes large stretches of low similarity Local • Seek only relatively conserved subsequences • Single comparison may yield several distinct subsequence alignments

• Local similarity measure are preferred for database searches where cDNAs can be compared with partially sequence genes. • Many similarity measures including the one they describe begins with a matrix of similarity score for all possible pairs of residues B L A S T Methods

Scoring Alignments • Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein) • Identities & conservative replacements >>> Positive Scores • Unlikely Replacements >> Negative scores • Amino acid sequences : “PAM” matrix BLOSUM • DNA sequences : match = +5 mismatch = -4 • Sequence Segment – Contiguous stretch of residues of any length • Similarity score for segments is the sum of similarity values for each pair of aligned residues Methods

Maximal Segment Pair (MSP) • Given these rules they have defined a Maximal Segment Pair (MSP) which is the highest scoring pair of identical length segments chosen from 2 sequences. • The similarity score of an MSP is called the MSP score which is calculated by BLAST. • With long sequences the search for the MSP score becomes computationally demanding. • Therefore BLAST searches for locally maximal segment pairs – Score cannot be improved by either extending or shortening segments. Methods

RAPID APPROXIMATION OF MSP SCORES METHODS

• Goal is to report those database sequences that have MSP score above some cutoff score S. • Statistically the highest MSP score S can be estimated at which “chance similarities” are likely to appear. • BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this score. Methods

• Let a word pair be a segment pair with a fixed length w. • Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score at least T. • Such hit will be extended until it exceeds the cutoff score S and those hits will be the final output of BLAST. • Lower T => Fewer false negatives • Lower T => More pairs to analyze Methods

IMPLEMENTATION Key Steps of BLAST Algorithm METHODS

• BLAST finds locally maximal segment pairs that exceeds a particular cutoff. • Detailed annotation of Three Algorithmic steps. • Compile a list of high-scoring words. • Scanning the DB for hits. • Extending hits that meet certain s coring criteria (Extend only word pairs with a score of at least T to determine if it has a segment pair of score at least S). METHODS B L A S T

COMPILING OF HIGH SCORING WORDS • Obtain the list of words in the target sequence (k-mers), that give a score of T or higher when aligned with the query sequence. • Assume that the query sequence is P Q G E F G • We have the following four 3-mers (recall, the number of k-mers is always N-k+1): P Q G Q G E G E F E F G METHODS Each of these 3-mers are then scored against each and every one of the k-mers in each of the target sequence. For long sequences, this could well include all 8000 possible k-mers.

• So, of all pairwise scorings for P Q G (using the BLOSUM- 62 matrix), we can find the following high scoring ones: • P Q G (of course, this is a perfect match) score of 7+5+6 = 18 • P E G score of 7+2+6 = 15 • P Q A score of 7+5+0 = 12 METHODS

SCANNING THE DB FOR HITS • Scan the database for hits with the compiled list of words obtained in previous step. • How efficiently search a long sequence for multiple occurrences of short sequences. • BLAST has two approaches • Indexing approach • Finite state machine METHODS

INDEXING APPROACH – EXAMPLE 1 • Build a lookup table of size |Σ|w for all w-length words in DB. METHODS

INDEXING APPROACH – EXAMPLE 2 • Let w=3. For amino acids, the number of words is 203. • Map a word to an integer between 1 and 203. • Thus a word has an index into an array. • Each index points to a list of matches of the word in the query sequence. • As we scan the database, each database word immediately leads to the hits in the query sequence. METHODS

EXTENDING HITS • Once the hits are located both in the query and the target sequence, extend the hits to form high scoring segment pairs. • Find the highest scoring segment (the maximal segment pair) or those whose score exceeds (another user set) threshold S. • When manage to find a hit (a match between a “word” and a database entry), extend the hit in either direction. • Keep track of the score (use a scoring matrix). METHODS

EXTENDING HITS – EXAMPLE 1 • Extend each seed on either side until the aggregate alignment score falls below a threshold. • Un-gapped: Extend by only either matches or mismatches. • Gapped: Extend by matches, mismatches or a limited number of insertion/deletion gaps. METHODS

EXTENDING HITS – EXAMPLE 2 METHODS

RESULTS

Evaluating Statistical Significance RESULTS

• Finally, evaluate the statistical significance of the alignments / scores that exceed the threshold. • BLAST statistical significance of MSP scores can be evaluated by following factors. • Performance of BLAST with random sequences • Performance of BLAST with homologous sequences • Performance comparing long DNA sequences RESULTS B L A S T

Performance of BLAST with random sequences RESULTS

• When two random sequences of length m and n compared, the probability of finding at least one HSP “by chance” is: • Hence, the probability of finding exactly x HSPs with a score ≥ S is given by: RESULTS E eXPXPXP   1)0(1)1(1)1( ! )( x E exXP x E  B L A S T

• Where E can be defined by according to the Karlin-Altschul equation. • E(HSPs with score) ≥ S, also called the E-value: RESULTS

• The probability of finding c or more distinct segment pairs, all with a score of at least S, is given by the formula: • Utilizing this formula can be detected two sequences that share distinct regions of similarity as significantly related. RESULTS      1 0 ! 1)(1)( c i i E i E ecXPcXP B L A S T

The choice of word length & threshold parameters Necessary & Sufficient Adjustments to be Done in Terms of Word Length & Threshold Value RESULTS

Time required to execute BLAST • To Compile List of Words. • To Scan the Database for Hits (MSP>T). • To Extend All Hits to Seek Segment Pairs with Scores Exceeding the Cutoff (HSP>S). • All these three steps depend on W & T Can we make the process more optimal? • Decrease the time spent on step 3, by increasing the W. But there are complementary problems created by larger W. • For Proteins – 20W possible word, Therefor when W increases number of words generated by query grows exponentially. (But number of words increases linearly with the length of the query) • It increases time spent on step 1 & also the amount of memory required. Results B L A S T

Optimal T and W values • For protein sequences W=4, T=17. • For DNA sequences W=11. How W and T affect the performance of BLAST Results B L A S T T Execution Time Accuracy Speed W Execution Time Accuracy Speed Computational Complexity ?

Performance of BLAST with homologous sequences Things to be Noted When Query Sequence BLAST Against Set of Homologous Sequences RESULTS

What is homologous? • Related by common ancestor. Researchers Example 1 • Search for wooly monkey sequence. • When W=4 & T=17. • Found 178 MSPs with scores (50-80). • Random model suggest that BLAST should miss 24 of MSPs • But actual miss 43. • Therefor error = 44.2% Researchers Example 2 • Search for mouse sequences. • Same W & T as previous. • Found 33 MSPs with scores (45-65). • Random model suggest that BLAST should miss 8 of MSPs • But actual miss 2. RESULTS B L A S T

• Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous. • The overall performance of BLAST depends on the distribution of MSP scores. Strengths of BLAST • Great utility is for identify high scoring MSPs quickly. • Takes lower amount of time for the alignment process. Further improvements can be done • Novel approaches like Position Specific Iterated BLAST (PSI BLAST) RESULTS B L A S T

Comparison of two long DNA sequences Does Adjustments of W Make the Process Faster? RESULTS

Main Classes of Locally Similar Regions • Genes. • Long interspersed Repeats. • Anticipated Weaker Similarities. Example (Human Gene VS Rabbit Gene) Step -1 • Match Score = 5, Mismatch Score = -4 & W=12. • 93 Alignments scoring over 200, 57 Alignments scoring over 350 with 1301 highest score. Step -2 • W=8 • Only additional 32 alignments are found score over 200. • Use of Smaller W does not provides new essential information always. Results B L A S T

Results B L A S T The Time & Sensitivity of BLAST on DNA Sequences as a Function of W

CONCLUSION

Underlying Concept of BLAST • Simple & Robust. • Can be implemented in many ways. • Can be used in variety of contexts. Researchers Implementation • Used a shared memory version of BLAST. • Why shared memory? • Loads compressed DNA file into memory once & allow subsequent steps to skip that step. • BLAST approach permits construction of extremely fast programs for database searching, Which provides additional advantage on mathematical advantage To Whom This Tool Would Help • Molecular Biologists • Doctors & etc… Conclusion B L A S T

QUESTIONS & ANSWERS

Q&A Q&A • What are the disadvantages of Blast with compared to other heuristic algorithms? – by Pubudu • What is the criteria of selecting threshold S and T values using in implementation part of this research paper? – by Parinda • What are the chances that the Maximal Segment Pair score for two unrelated sequences would be greater than or equal to S value? – by Parinda • Among PAM and BLOSUM matrices what is the most suitable matrix when scoring segments in amino acid sequences ? Why? – by Shashika

THANKS FOR YOUR TIME

Add a comment

Related presentations

Speaker: Matt Stine Developing for the Cloud Track Marc Andressen has famou...

This presentation explains how to develop a Web API in Java using (JAX-RS or Restl...

1 App,

1 App,

November 10, 2014

How to bring innovation to your organization by streamlining the deployment proces...

Cisco Call-control solutions can handle voice, video and data

Nathan Sharp of Siemens Energy recently spoke at the SAP Project Management in Atl...

Related pages

BLAST (Basic Local Alignment Search Tool)

nucleotide blast: Search a nucleotide database using a nucleotide query Algorithms: blastn, megablast, discontiguous megablast: protein blast: Search ...
Read more

BLAST Basic Local Alignment Search Tool

BLAST Basic Local Alignment Search Tool Blast Program Selection Guide Table of Content 1. Introduction 2. BLAST Database Content 3. Program Selection Table
Read more

BLAST: Basic Local Alignment Search Tool

The BLAST+ 2.2.31 executables can produce a new version of the BLAST XML (XML2). XML2 is also also available from the NCBI BLAST website and the BLAST AMI ...
Read more

Basic Local Alignment Search Tool - BLAST algorithm

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local ...
Read more

BLAST - Wikipedia, the free encyclopedia

In bioinformatics, BLAST for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino ...
Read more

Using the Basic Local Alignment Search Tool (BLAST)

INTRODUCTION. The BLAST algorithm was developed as a way to perform DNA and protein sequence similarity searches by an algorithm that is ...
Read more

BLAST: Basic Local Alignment Search Tool - NIH Library

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to ...
Read more

NCBI BLAST on Windows Azure - Microsoft Research

... based implementation of the Basic Local Alignment Search Tool (BLAST) ... is the BLAST application. BLAST on Windows Azure uses the latest ...
Read more

Basic Local Alignment Search Tool (BLAST) | Learn Science ...

Awash in a sea of data, how do scientists identify the function of a newly cloned gene? Online resources like the Basic Local Alignment Search Tool (BLAST ...
Read more

Basic local alignment search tool - ScienceDirect

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local ...
Read more