bai2

50 %
50 %
Information about bai2

Published on November 20, 2007

Author: lebahiep

Source: slideshare.net

Whole-Genome Prokaryote Phylogeny without Sequence Alignment Bailin HAO and Ji QI T-Life Research Center, Fudan University Shanghai 200433, China Institute of Theoretical Physics, Academia Sinica Beijing 100080, China http://www.itp.ac.cn/~hao/

Classification of Prokaryotes: A Long-Standing Problem Traditional taxonomy: too few features Morphology : spheric, helices, rod-shaped…… Metabolism : photosythesis, N-fixing, desulfurization…… Gram staining : positive and negative SSU rRNA Tree (Carl Woese et al., 1977): 16S rRNA: ancient conserved sequences of about 1500kb Discovery of the three domains of life: Archaea, Bacteria and Eucarya Endosymbiont origin of mitochondria and chloroplasts

Traditional taxonomy: too few features

Morphology : spheric, helices, rod-shaped……

Metabolism : photosythesis, N-fixing, desulfurization……

Gram staining : positive and negative

SSU rRNA Tree (Carl Woese et al., 1977):

16S rRNA: ancient conserved sequences of about 1500kb

Discovery of the three domains of life: Archaea, Bacteria and Eucarya

Endosymbiont origin of mitochondria and chloroplasts

The SSU rRNA Tree of Life: A big progress in molecular phylogeny of prokaryotes as evidenced by the history of the Bergey’s Manual

Bergey’s Manual Trust: Bergey’s Manual 1st Ed. “ Determinative Bacteriology”: 1923 8th Ed. “ Determinative Bacteriology”: 1974 1 st Ed. “ Systematic Bacteriology”: 1984-1989, 4 volumes 9 th Ed. “ Determinative Bacteriology”: 1994 2 nd Ed. “ Systematic Bacteriology”: 2001-200?, 5 volumes planned; On-Line “ Taxonomic Outline of Procarytes ” by Garrity et al. (October 2003) (26 phyla: A1-A2, B1-B24)

1st Ed. “ Determinative Bacteriology”: 1923

8th Ed. “ Determinative Bacteriology”: 1974

1 st Ed. “ Systematic Bacteriology”: 1984-1989, 4 volumes

9 th Ed. “ Determinative Bacteriology”: 1994

2 nd Ed. “ Systematic Bacteriology”: 2001-200?, 5 volumes planned; On-Line “ Taxonomic Outline of Procarytes ” by Garrity et al. (October 2003)

(26 phyla: A1-A2, B1-B24)

Our Final Result 132 organisms (16A + 110B + 6E) Input: genome data Output: phylogenetic tree No selection of genes, no alignment of sequences, no fine adjustment whatsoever See the tree first. Story follows.

132 organisms (16A + 110B + 6E)

Input: genome data

Output: phylogenetic tree

No selection of genes, no alignment of sequences, no fine adjustment whatsoever

See the tree first. Story follows.

 

Protein Tree for 145 Organisms From 82 Genera (K=5) 16 Archaea (11 genera, 16 species) 123 Bacteria (65 genera, 98 species) 6 Eukaryotes

 

Complete Bacterial Genomes Appeared since 1995 Early Expectations: More support to the SSU rRNA Tree of Life Add details to the classification (branchings and groupings) More hints on taxonomic revisions

More support to the SSU rRNA Tree of Life

Add details to the classification (branchings and groupings)

More hints on taxonomic revisions

Confusion brought by the hyperthermophiles Aquifex aeolicus (Aquae) 1998: 1551335 Thermotoga maritima (Thema) 1999: 1860725 “ Genome Data Shake tree of life ” Science 280 (1 May 1998) 672 “ Is it time to uproot the tree of life? ” Science 284 (21 May 1999) 130 “ Uprooting the tree of life ” W. Ford Doolittle, Scientific American (February 2000) 90

Confusion brought by the hyperthermophiles

Aquifex aeolicus (Aquae) 1998: 1551335

Thermotoga maritima (Thema) 1999: 1860725

“ Genome Data Shake tree of life ”

Science 280 (1 May 1998) 672

“ Is it time to uproot the tree of life? ”

Science 284 (21 May 1999) 130

“ Uprooting the tree of life ”

W. Ford Doolittle, Scientific American (February 2000) 90

Debate on Lateral Gene Transfer Extreme estimate: 17% in E. Coli Limitations of the above approach B. Wang, J. Mol. Evol. 53 (2001) 244 “ Phase transition” and “crystalization” of species (C. Woese 1998) Lateral transfer within smaller gene pools as an innovative agent Composition vector may incorporate LGT within small gene pools

Extreme estimate: 17% in E. Coli

Limitations of the above approach

B. Wang, J. Mol. Evol. 53 (2001) 244

“ Phase transition” and “crystalization” of species (C. Woese 1998)

Lateral transfer within smaller gene pools as an innovative agent

Composition vector may incorporate LGT within small gene pools

Alignment-Based Molecular Phylogeny TCAGACGC TCGGAGT T C A G A C G C T C G G A - G T Scoring scheme Gap penalty 16S rRNA tree was based on sequence alignment

Alignment-Based Molecular Phylogeny

TCAGACGC

TCGGAGT

T C A G A C G C

T C G G A - G T

Scoring scheme

Gap penalty

16S rRNA tree was based on sequence alignment

Problem: sequence alignment cannot be readily applied to complete genomes Homology -> alignment Different genome size, gene content and gene order Gene A A ’ B Gene B ’ C ? 1st species 2nd species

Problem: sequence alignment cannot be readily applied to complete genomes

Homology -> alignment

Different genome size, gene content and gene order

Our Motivations: Develop a molecular phylogeny method that makes use of complete genomes – no selection of particular genes Avoid sequence alignment Try to reach higher resolution to provide an independent comparison with other approaches such as SSU tRNA trees Make comparison with bacteriologists’ systematics as reflected in Bergey’s Manual (2001, 2002) Our paper accepted by J. Molecular Evolution

Develop a molecular phylogeny method that makes use of complete genomes – no selection of particular genes

Avoid sequence alignment

Try to reach higher resolution to provide an independent comparison with other approaches such as SSU tRNA trees

Make comparison with bacteriologists’ systematics as reflected in Bergey’s Manual (2001, 2002)

Our paper accepted by J. Molecular Evolution

Other Whole-Genome Approaches Gene content Presence or absence of COGs Conserved Gene Pairs “ Information” distances Domain order in proteins (Ken Nishikawa’s talk at InCoB2003) …

Gene content

Presence or absence of COGs

Conserved Gene Pairs

“ Information” distances

Domain order in proteins (Ken Nishikawa’s talk at InCoB2003)



Comparison of Complete Genomes/Proteomes Compositional vectors Nucleotides: a 、 t 、 c 、 g aatcgcgcttaagtc Di-nucleotide (K=2) distribution: {aa at ac ag ta tt tc tg ca ct cc cg ga gt gc gg} { 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0, 2, 0, 1 ,2 , 0} } }

Compositional vectors

Nucleotides: a 、 t 、 c 、 g

aatcgcgcttaagtc

Di-nucleotide (K=2) distribution:

{aa at ac ag ta tt tc tg ca ct cc cg ga gt gc gg}

{ 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0, 2, 0, 1 ,2 , 0}

K-strings make a composition vector DNA sequence  vector of dimension 4 K Protein sequence  vector of dimension 20 K Given a genomic or protein sequence  a unique composition vector The converse: a vector  one or more sequences ? K big enough -> uniqueness Connection with the number of Eulerian loops in a graph (a separate study available as a preprint at ArXiv:physics/0103028 and from Hao’s webpage) ↑

K-strings make a composition vector

DNA sequence  vector of dimension 4 K

Protein sequence  vector of dimension 20 K

Given a genomic or protein sequence  a unique composition vector

The converse: a vector  one or more sequences ?

K big enough -> uniqueness

Connection with the number of Eulerian loops in a graph (a separate study available as a preprint at ArXiv:physics/0103028 and from Hao’s webpage)

A Key Improvement: Subtraction of Random Background Mutations took place randomly at molecular level Selection shaped the direction of evolution Many neutral mutations remain as random background At single amino acid level protein sequences are quite close to random Highlighting the role of selection by subtraction a random background

Mutations took place randomly at molecular level

Selection shaped the direction of evolution

Many neutral mutations remain as random background

At single amino acid level protein sequences are quite close to random

Highlighting the role of selection by subtraction a random background

Frequency and Probability A sequence of length A K-string Frequency of appearance Probability

A sequence of length

A K-string

Frequency of appearance

Probability

Predicting #(K-strings) from that of lengths (K-1) and (K-2) strings Joint probability vs. conditional probability Making the weakest Markov assumption: Another joint probability:

Joint probability vs. conditional probability

Making the weakest Markov assumption:

Another joint probability:

(K-2)-th Order Markov Model Change to frequencies: Normalization factor may be ignored when L>>K

Change to frequencies:

Normalization factor may be ignored when L>>K

Construct compositional vectors using these modified string counts: For the i-th string type of species A we use

Construct compositional vectors using these modified string counts:

For the i-th string type of species A we use

Composition Distance Define correlation between two compositional vectors by the cosine of angle From two complete proteomes: A : {a 1 ,a 2 ,……,a n } n=20 5 = 3 200 000 B : {b 1 ,b 2 ,……,b n } C(A,B) ∈[-1,1] Distance D(A,B)∈[0,1]

Define correlation between two compositional vectors by the cosine of angle

From two complete proteomes:

A : {a 1 ,a 2 ,……,a n } n=20 5 = 3 200 000

B : {b 1 ,b 2 ,……,b n }

C(A,B) ∈[-1,1]

Distance

D(A,B)∈[0,1]

Materials: Genomes from NCBI ( ftp.ncbi.nih.gov/genomes/Bacteria/ ) Not the original GenBank files 6 Eucaryote genomes were included for reference Tree construction: Neighbor-Joining in Phylip

Protein Tree for 132 species (K=5) 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes

 

Protein Tree for 132 species K=6 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes

 

Protein Class vs. Whole Proteome Trees based on collection of ribosomal proteins (SSU + LSU): ribosomal proteins are interwoven with rRNA to form functioning complex; results consistent with SSU rRNA trees Trees based on collection of aminoacyl-tRNA synthetases (AARS). Trees based on single AARS were not good. Trees based on all 20 AARSs much better but not as good as that based on rProteins.

Trees based on collection of ribosomal proteins (SSU + LSU): ribosomal proteins are interwoven with rRNA to form functioning complex; results consistent with SSU rRNA trees

Trees based on collection of aminoacyl-tRNA synthetases (AARS). Trees based on single AARS were not good. Trees based on all 20 AARSs much better but not as good as that based on rProteins.

Genus Tree based on Ribosomal Proteins

A Genus Tree based on Aminoacyl tRNA synthetases

Chloroplast Tree Sequences of about 100 000 bp Tree of the endosymbiont partners Paper accepted by Molecular Biology and Evolution on 12 August 2003

Sequences of about 100 000 bp

Tree of the endosymbiont partners

Paper accepted by Molecular Biology and Evolution on 12 August 2003

Chloroplast tree

Coronaviruses including Human SARS-CoV Sequences of tens kilo bases SARS squence: about 29730 bases Paper published in Chinese Science Bulletin on 26 June 2003

Sequences of tens kilo bases

SARS squence: about 29730 bases

Paper published in Chinese Science Bulletin on 26 June 2003

Coronavirus tree

Understanding the Subtraction Procedure: Analysis of Extreme Cases in E. coli There are 1 343 887 5-strings belonging to 841832 different types. Maximal count before subtraction: 58 for the 5-peptide GKSTL. 58 reduces to 0.646 after subtraction. Maximal component after subtraction: 197 for the 5-peptide HAMSC. The number 197 came from a single count 1 before the subtraction.

There are 1 343 887 5-strings belonging to 841832 different types.

Maximal count before subtraction: 58 for the

5-peptide GKSTL. 58 reduces to 0.646 after subtraction.

Maximal component after subtraction: 197 for the 5-peptide HAMSC. The number 197 came from a single count 1 before the subtraction.

GKSTL: how 58 reduces to 0.646? #(GKST)=113 #(KSTL)=77 #(KST)=247 Markov prediction: 113*77/247=35.23 Final result: (58-35.23)/35.23=0.646

#(GKST)=113

#(KSTL)=77

#(KST)=247

Markov prediction: 113*77/247=35.23

Final result: (58-35.23)/35.23=0.646

HAMSC: how 1 grows to 197? #(HAMS)=1 #(AMSC)=1 #(AMS)=198 Markov prediction: 1*1/198=1/198 Final result: (1-1/198)/(1/198)=197

#(HAMS)=1

#(AMSC)=1

#(AMS)=198

Markov prediction: 1*1/198=1/198

Final result: (1-1/198)/(1/198)=197

6121 Exact Matches of GKSTL In PIR Rel.1.26 with >1.2 Mil Proteins These 6121 matches came from a diverse taxonomic assortment from virus to bacteria to fungi to plants and animals including human being In the parlance of classic cladistics GKSTL contributes to plesiomorphic characters that should be eliminated in a strict phylogeny The subtraction procedure did the job.

These 6121 matches came from a diverse taxonomic assortment from virus to bacteria to fungi to plants and animals including human being

In the parlance of classic cladistics GKSTL contributes to plesiomorphic characters that should be eliminated in a strict phylogeny

The subtraction procedure did the job.

15 Exact Matches of HAMSC: In PIR Rel.1.26 with >1.2 Mil Proteins 1 match from Eukaryotic protein 4 matches (the same protein) from virus 10 matches from prokaryotes, among which 3 from Shegella and E. coli (HAMSCAPDKE) 3 from Samonella (HAMSCAPERD) HAMSC is characteristic for prokaryotes HAMSCA is specific for enterobacteria

1 match from Eukaryotic protein

4 matches (the same protein) from virus

10 matches from prokaryotes, among which

3 from Shegella and E. coli (HAMSCAPDKE)

3 from Samonella (HAMSCAPERD)

HAMSC is characteristic for prokaryotes

HAMSCA is specific for enterobacteria

Stable Topology of the Tree K=1: makes some sense! K=2,3,4: topology gradually converges K=5 and K=6: present calculation K=7 and more: too high resolution; star-tree or bush expected

K=1: makes some sense!

K=2,3,4: topology gradually converges

K=5 and K=6: present calculation

K=7 and more: too high resolution; star-tree or bush expected

Statistical Test of the Tree Bootstrap versus Jack knife Bootstrap in sequence alignments “Bootstrap” by random selections from the AA-sequence pool A time consuming job 180 bootstraps for 72 species

Bootstrap versus Jack knife

Bootstrap in sequence alignments

“Bootstrap” by random selections

from the AA-sequence pool

A time consuming job

180 bootstraps for 72 species

About 70% genes for every species were selected in one bootstrap

“ K-string Picture” of Evolution K=5 ->3 200 000 points in space of 5-strings K=6 ->64 000 000 points In the primordial soup: short polypeptides of a limited assortment Evolution by growth, fusion, mutation leads to diffusion in the string space String space not saturated yet

K=5 ->3 200 000 points in space of

5-strings

K=6 ->64 000 000 points

In the primordial soup: short polypeptides of a limited assortment

Evolution by growth, fusion, mutation leads to diffusion in the string space

String space not saturated yet

The Problem of Higher Taxa 1974: Bacteria as a separate kingdom 1994: Archaea and Bacetria as two domains The relation of higher taxa?

1974: Bacteria as a separate kingdom

1994: Archaea and Bacetria as two domains

The relation of higher taxa?

Summary As composition vectors do not depend on genome size and gene content. The use of whole genome data is straightforward Data independent on that of 16S rRNA Method different from that based on SSU rRNA Results agree with SSU rRNA trees and the Bergey’s Manual Hint on groupings of higher taxa A method without “free parameters”: data in, tree out Possibility of an automatic and objective classification tool for prokaryotes

Summary

As composition vectors do not depend on genome size and gene content. The use of whole genome data is straightforward

Data independent on that of 16S rRNA

Method different from that based on SSU rRNA

Results agree with SSU rRNA trees and the Bergey’s Manual

Hint on groupings of higher taxa

A method without “free parameters”: data in, tree out

Possibility of an automatic and objective classification tool for prokaryotes

Conclusion: The Tree of Life is saved! There is phylogenetic information in the prokaryotic proteomes. Time to work on molecular definition of taxa. Thank you!

 

 

Protein Tree for 132 species (K=5) 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes

 

 

A Failed Attempt Using Avoidance Sinatures

 

Comparison with the Bergey’s Manual

Tree Construction phylip package of J. Felsenstein (Neighbor-Joining) The Fitch method is not feasible here, Nondistance-matrix method (MP, ML et al) Material ftp://ncbi.nlm.nih.gov/genomes/Bacteria/   Phyla Classes Orders Families Genera Species Strains Archaea 2 7 9 9 9 13 13 Bacteria 9 14 23 28 37 46 57 Total 11 21 32 37 46 59 70

Tree Construction

phylip package of J. Felsenstein (Neighbor-Joining)

The Fitch method is not

feasible here,

Nondistance-matrix method (MP, ML et al)

Material

ftp://ncbi.nlm.nih.gov/genomes/Bacteria/

Early expectation from genome data Was there intensive lateral gene transfer? Gene tree cannot be equated to the real tree of life Genome data: 10 6 to 10 7 Difficult to align whole genome data

Was there intensive lateral gene transfer?

Gene tree cannot be equated to the real tree of life

Genome data: 10 6 to 10 7

Difficult to align whole genome data

Prokaryote and Eukaryote Three Kingdoms( Carl Woese ,16S rRNA ) Archaea Eubacteria Eukarya Five Kingdoms ( Lynn Margulis ) Bacteria ( Archaea, Eubacteria ) Protoctista Animalia Fungi Plantae

Prokaryote and Eukaryote

Three Kingdoms( Carl Woese ,16S rRNA )

Archaea

Eubacteria

Eukarya

Five Kingdoms ( Lynn Margulis )

Bacteria ( Archaea, Eubacteria )

Protoctista

Animalia

Fungi

Plantae

Common features of Archaea and Eubacteria: Small cells, no nucleus membrane, ring DNA, no CAP at 5’end of mRNA, presence of S-D segments Many proteins associated with replication, transcription, and translation are common in Archaea and Eukaryote Features of Archaea: lack of some enzymes, insensitive to some antibiotics

Common features of Archaea and Eubacteria:

Small cells, no nucleus membrane, ring DNA,

no CAP at 5’end of mRNA, presence of S-D

segments

Many proteins associated with replication, transcription, and translation are common in Archaea and Eukaryote

Features of Archaea: lack of some enzymes, insensitive to some antibiotics

《 Compositional Representation of Protein Sequences and the Number of Eulerian Loops 》 by Bailin Hao, Huimin Xie, Shuyu Zhang K=5: 76.7% proteins have unique reconstruction K=6:  94.0% K=10: >99% Checked 2820 AA-seqs from pdb.seq, a special selection of SWISS-PROT See Los Alamos National Lab E-Archive: physics/0103028

《 Compositional Representation of Protein Sequences and the Number of Eulerian Loops 》

by Bailin Hao, Huimin Xie, Shuyu Zhang

K=5: 76.7% proteins have unique reconstruction

K=6:  94.0%

K=10: >99%

Checked 2820 AA-seqs from pdb.seq, a special selection of SWISS-PROT

See Los Alamos National Lab E-Archive: physics/0103028

Subtraction of Random Background Using a (K-2)-order Markov Model K=2: genomic signature by Karlin and Burge May be justified by using Maximal Entropy Principle with appropriate constraints (Hu & Wang, 2001)

Using a (K-2)-order Markov Model

K=2: genomic signature by Karlin and Burge

May be justified by using Maximal Entropy Principle with appropriate constraints (Hu & Wang, 2001)

What to do next Detailed comparison with traditional taxonomy Add more eukaryotes Elucidation of the foundatrion and limitation of compositional approach Software and web interface Problem of lateral gene transfer Viruses ?

Detailed comparison with traditional taxonomy

Add more eukaryotes

Elucidation of the foundatrion and limitation of compositional approach

Software and web interface

Problem of lateral gene transfer

Viruses ?

Confusion brought by the hyperthermophiles Aquifex aeolicus (Aqua) 1998: 1551335 Thermotoga maritima (Tmar) 1999: 1860725 “ Genome Data Shake tree of life” Science 280 (1 May 1998) 672 “ Is it time to uproot the tree of life?” Science 284 (21 May 1999) 130 “ Uprooting the tree of life” Sci. Amer. (February 2000) 9 Problem of Lateral Gene Transfer (LGT): tree or network Problem of higher taxa

Confusion brought by the hyperthermophiles

Aquifex aeolicus (Aqua) 1998: 1551335

Thermotoga maritima (Tmar) 1999: 1860725

“ Genome Data Shake tree of life”

Science 280 (1 May 1998) 672

“ Is it time to uproot the tree of life?”

Science 284 (21 May 1999) 130

“ Uprooting the tree of life”

Sci. Amer. (February 2000) 9

Problem of Lateral Gene Transfer (LGT): tree or network

Problem of higher taxa

Add a comment

Related presentations

Related pages

What is BAI2 file format? - Definition from WhatIs.com

BAI2 file format is a specialized and standardized set of codes used for cash management by the Bank Administration Institute (BAI). BAI2 files come to an ...
Read more

(USA) Import a BAI2 statement and manually reconcile the ...

You can import the BAI2 statement and validate the check entries against the bank transactions. Then you can manually correct the discrepancies ...
Read more

bai2 on PyPI - Libraries

BAI2 Parser - a Python library on PyPI - Libraries.io. BAI2 Parser. Toggle navigation. Features; Pricing; Search. Login with GitHub
Read more

Financial Services Industry - Webinars - Articles | BAI

BAI empowers financial services leaders to make smart business decisions that drive positive change and move the industry forward.
Read more

BAI2 Format for Lockbox - ERP Financials - SCN Wiki

Purpose. The purpose of this page is to clarify the understanding of the BAI2 format for lockbox, such as characters and segments of the file. Overview
Read more

Than Bai 2 - Part 1 - YouTube

Than Bai 2 - Part 4 - Duration: 10:01. xsupzz 524,683 views. 10:01 Thần Bài 3 - Châu tinh tri trì [Full HD] - Duration: 1:52:01. ...
Read more

BaI2 | Sigma-Aldrich

Search results for BaI2 at Sigma-Aldrich ... Compare products: Select the checkbox on up to 4 items, then click 'compare' for a detailed product comparison
Read more

BAI2 file format - SAP vs Bank | SCN

Hi All, We are implementing lockbox for Canada, our bank has sent us a BAI2 file, which seems to be little different from the one generated by the SAP ...
Read more

BAI2 - Corporate-to-Bank

Need any help? One of our Corporate-to-Bank experts would be happy to answer any questions you have. Simply Ask a C2B expert »
Read more

ADGRB2 Gene - GeneCards | AGRB2 Protein | AGRB2 Antibody

Complete information for ADGRB2 gene (Protein Coding), Adhesion G Protein-Coupled Receptor B2, ... Cloning and characterization of BAI2 and BAI3, ...
Read more