gene finding 1

33 %
67 %
Information about gene finding 1
Entertainment

Published on November 16, 2007

Author: Hannah

Source: authorstream.com

BMB3600 - Bioinformatics :  BMB3600 - Bioinformatics March 25 – gene finding I March 30 – gene finding 2 April 01 – prediction of binding motifs April 06 – microarray data analysis April 08 – sequence comparison April 13 – protein function prediction 1 April 15 – protein function prediction 2 April 20 – protein structure prediction 1 April 22 – protein structure prediction 2 April 27 – take-home exam Gene Finding I -- outline:  Gene Finding I -- outline Problem definition Basic gene structures in eukaryotic versus prokaryotic genomes Codon and reading frames Codon frequencies in coding versus non-coding regions Basic idea of distinguishing coding versus non-coding regions Computational methods for distinguishing coding from non-coding regions Collecting data from model building How to develop a simple gene finder Gene finding:  Gene finding Human genome has ~3 billion base pairs and has about 35,000 protein-coding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Where are the protein-encoding genes? The basic idea of pattern recognition:  The basic idea of pattern recognition How do kids learn to distinguish “dogs” from “cats”? were “trained” by being told “A is a dog”, “B is a cat”, “C is another dog”, ….. they learn to “extract” common features (patterns) among animals they were told to be “dogs” and “cats” then apply these extracted features to identify new dogs and cats Pattern recognition is generally done by providing “training sets” which are individually labeled “positives” versus “negatives”, or “good” versus “bad”, etc. learning the general rules that separate the “positives” from “negatives” or “good” from “bad”, …. applying the learned rules to new situations Gene finding through learning:  Gene finding through learning Learning “general rules” about finding genes ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Over the years, numerous genes have been identified through experiments. Also some DNA segments are known to be non-genes verified by experiments Gene finding through learning:  Gene finding through learning So we know ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgt gggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagag gtcagtgactgatgatcgatgcatgcatg gatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatg ctagatcgtaggtagtagctagatgcagggataaacacacggaggc gagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaa ………………………………… genes non-genes Gene finding through learning:  Gene finding through learning Is a gene? Remember “dogs”, “cats” …. but the “patterns” here are much more hidden and more complex than the distinguishing features between “dogs” and “cats” gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag We need to study the basic structures of genes first ….! Basic Gene Structures:  Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, Basic Gene Structure:  Basic Gene Structure Prokaryotic genes coding regions, non-coding regions translation starts and stops gene gene gene promoter start stop Prokaryotic genes are easier to identify than eukaryotic genes because of the simplicity of their gene structure and the density of genes in the genome Gene Structure -- codons:  Gene Structure -- codons A triplet of nucleotides is called a codon There are 64 codons (4 * 4 * 4 = 64) AAA, ….., TTT Three codons (TAG, TGA, TAA) are called stop codons as they code the termination signal of a gene Each of the other codons codes an amino acid Gene Structure – reading frame:  Gene Structure – reading frame Reading (or translation) frame: each DNA segment has six possible reading frames Reading frame #0 ATG GCT TAC GCT TGC Reading frame #1 TGG CTT ACG CTT GA. Reading frame #2 GGC TTA CGC TTG A.. ATGGCTTACGCTTGA Forward strand: Reading frame #0 TCA AGC GTA AGC CAT Reading frame #1 CAA GCG TAA GCC AT. Reading frame #2 AAG CGT AAG CCA T.. TCAAGCGTAAGCCAT Reverse strand: Gene Structure – open reading frame (ORF):  Gene Structure – open reading frame (ORF) Open reading frame (ORF): a segment of DNA with one in-frame start codon and one in-frame stop codon at the two ends and no in-frame stop codon in the middle each ORF has a fixed reading frame How many genes can an ORF have inside it? Answer: one because an ORF has only one stop Gene Structure -- open reading frame (ORF):  Gene Structure -- open reading frame (ORF) Generally true: all long (> 300 bp) orfs in prokaryotic genomes encode genes But this may not necessarily be true for eukaryotic genomes Coding region – gene in prokaryotic genomes exon in eukaryotic genomes Gene Structure:  Gene Structure Each coding region (exon or whole gene) has a fixed translation frame A coding region always sits inside an ORF of same reading frame All exons of a gene are on the same strand Neighboring exons of a gene could have different reading frames frame 1 frame 2 frame 3 Gene Structure – reading frame consistency:  Gene Structure – reading frame consistency Now … we are talking about a little more “complex” features Neighboring exons of a gene should be frame-consistent ATG GCT TGG GCT TTA A -------------- GT TTC CCG GAG AT ------ T GGG exon 1 exon 3 exon 2 exon1 [i, j] in frame a and exon2 [m, n] in frame b are consistent if b = (m - j - 1 + a) mod 3 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 ...... splicing! Codon Frequencies:  Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein sequences is NOT evenly distributed The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanella Dicodon Frequencies:  Dicodon Frequencies Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs! Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region! Dicodon Frequencies:  Dicodon Frequencies Hence if we see many such dicodons in a DNA segment, we may want to bet that this region is a non-coding region! This is the very basic idea of gene finding! Dicodon Frequencies:  Dicodon Frequencies Dicodon frequencies in coding versus non-coding are genome-dependent shewanella bovine Dicodon Frequencies:  Dicodon Frequencies Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region? Basic idea of gene finding:  Basic idea of gene finding Most dicodons show bias towards either coding or non-coding regions; only fraction of dicodons is neutral Foundation for coding region identification Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regions Basic idea of gene finding :  Basic idea of gene finding in-frame (the correct frame) versus any-frame dicodons ATG TTG GAT GCC CAG AAG............ in-frame dicodons not in-frame dicodons In-frame: ATG TTG GAT GCC CAG AAG Not in-frame: TGTTGG, ATGCCC AGAAG ., GTTGGA AGCCCA, AGAAG .. any-frame Basic idea of gene finding:  Basic idea of gene finding In-frame dicodon frequencies provide a more sensitive measure than any-frame dicodon frequencies Computational model for gene finding:  Computational model for gene finding YES, it is still simple …… Preference model: for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding and non-coding regions, FC(X), FN(X) calculate X’s preference value P(X) = log (FC(X)/FN(X)) Properties: P(X) is 0 if X has the same frequencies in coding and non-coding regions P(X) has positive score if X has higher frequency in coding than in non-coding region; the larger the difference the more positive the score is P(X) has negative score if X has higher frequency in non-coding than in coding region; the larger the difference the more negative the score is Computational model for gene finding:  Computational model for gene finding Example Coding preference of a region (an any-frame model) AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region. Computational model for gene finding:  Computational model for gene finding Ok, now you may want to run away ….. In-frame preference model Actually, the concept is still simple ……. Computational model for gene finding:  Computational model for gene finding In-frame preference model (most commonly used in prediction programs) Application step: For each possible reading frame of a region, calculate the total in-frame preference score  P0(X), the total (in-frame + 1) preference score  P1(X), the total (in-frame + 2) preference score  P2(X), and sum them up If the score is positive, predict it to be a coding region; otherwise non-coding Computational Gene Finding:  Computational Gene Finding Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) preference scores 0 +5 -5 Computational Gene Finding:  Computational Gene Finding Making the call: coding or non-coding and where the boundaries are Need a training set with known coding and non-coding regions select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line? Computational Gene Finding:  Computational Gene Finding Why dicodon (6mer)? Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomes Collecting data:  Collecting data Where can we collect the data for estimating the initial dicodon frequencies? GenBank http://www.ncbi.nlm.nih.gov/entrez an example: Shewanella oneidensis MR-1

Add a comment

Related presentations

Related pages

Gene prediction - Wikipedia, the free encyclopedia

In computational biology gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes ...
Read more

Gene Finding - Utah State University

Gene prediction methods or ... Tahoma Arial Wingdings 宋体 Blends 1_Blends Gene Finding Gene Finding Genome Completely Sequenced Genomes ...
Read more

Computational Gene-finding — Bioinformatics 0.1 ...

Finding start and stop codons in a DNA sequence¶ To look for all the potential start and stop codons in a DNA sequence, we need to find all the “ATG”s ...
Read more

Gene Structure & Gene Finding

Gene Structure & Gene Finding David Wishart Rm. 3-41 Athabasca Hall david.wishart@ualberta.ca Outline for Next 3 Weeks • Genes and Gene Finding (Prokaryotes)
Read more

Presentation "1 Gene Finding Charles Yan. 2 Gene Finding ...

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Read more

What is a Gene? Prokaryotic Gene Finding

1 • Tues, Nov 29: Gene Finding 1 • Thurs, Dec 1: Gene Finding 2 • Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule)
Read more

Presentation "1 Gene Finding Charles Yan. 2 Gene Finding ...

1 Gene Finding Charles Yan. 2 Gene Finding Content sensors Extrinsic content sensors Intrinsic content sensors Signal sensors Splice site prediction Promoter.
Read more

Gene Structure & Gene Finding Part II

Gene Finding in Eukaryotes Eukaryotes • Complex gene structure • Large genomes (0.1 to 10 billion bp) • Exons and Introns (interrupted) • Low ...
Read more

Computational gene finding

(c) Devika Subramanian, 2009 2 Outline (3 lectures) The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding
Read more