advertisement

UC Davis EVE161 Lecture 15 by @phylogenomics

0 %
100 %
advertisement
Information about UC Davis EVE161 Lecture 15 by @phylogenomics
Education

Published on February 27, 2014

Author: phylogenomics

Source: slideshare.net

Description

Slides for Lecture 15 in EVE 161 Course by Jonathan Eisen at UC Davis
advertisement

Lecture 14: EVE 161:
 Microbial Phylogenomics ! Lecture #15: Era IV: Shotgun Metagenomics ! UC Davis, Winter 2014 Instructor: Jonathan Eisen Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !1

Where we are going and where we have been • Previous lecture: ! 14: Era IV: Metagenomics • Current Lecture: ! 15: Era IV: Shotgun Metagenomics ! Next Lecture: ! 16: Era IV: Function in Metagenomics Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !2

Era IV: Genomes in the environment Era IV: Shotgun Metagenomics Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Environmental Shotgun Sequencing • • ESS first applied to endosymbiont genomes • • Buchnera genome sequenced with ESS Endosymbionts relatively clonal within one host and even within one species sometimes Many others too Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wolbachia Metagenomic Sequencing shotgun sequence Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wolbachia pipientis wMel Wu et al., 2004. Collaboration between Jonathan Eisen and Scott O’Neill (Yale, U. Queensland). Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

articles Community structure and metabolism through reconstruction of microbial genomes from the environment Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4, Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2 1 Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California, Berkeley, California 94720, USA 4 Joint Genome Institute, Walnut Creek, California 94598, USA RESEARCH ARTICLE ........................................................................................................................................................................................................................... Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level. The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3 environment. 2 2 3 Environmental Genome Shotgun Sequencing of the Sargasso Sea Aaron L. Halpern, Doug Rusch, Jonathan A. Eisen, Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3 The study of microbial evolution and ecology has been revolutio- fluorescence3in situ hybridization Anthony H. Knap,6 biofilms Derrick E. Fouts, Samuel Levy,2 (FISH) revealed that all nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in Michael W. Lomas,6 Ken Nealson,5 Owen White,3 and other been the main source of sequence data, and only a small fraction of a few cases, Acidimicrobium) and1archaea (Ferroplasma 6 Jeremy Peterson,3 Thermoplasmatales). The genome of one microorganisms have been cultivated4–6. Consequently, focus has members of theJeff Hoffman, Rachel Parsons, of these shifted towards the analysis of uncultivated microorganisms via archaea, Ferroplasma acidarmanus fer1, isolated fromRogers,4 Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui the Richmond 5 cloning of conserved genes and genome fragments directly from mine, has been sequenced previously (http://www.jgi.doe.gov/JGI_ Hamilton O. Smith1 the environment7–9. To date, only a small fraction of genes have been microbial/html/ferroplasma/ferro_homepage.html). Slides for UC Davis EVE161 Course biofilm (Fig.Jonathan Eisen Winter 2014 was recovered from individual environments, limiting the analysis of A pink Taught by 1a) typical of AMD communities chlorococcus, tha photosynthetic bio Surface water were collected ab from three sites o February 2003. A lected aboard the S station S” in May are indicated on F S1; sampling prot one expedition to was extracted from genomic libraries w 2 to 6 kb were m prepared plasmid both ends to!11 provi

Shotgun metagenomics shotgun sequence Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !12

articles Community structure and metabolism through reconstruction of microbial genomes from the environment Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4, Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2 1 Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California, Berkeley, California 94720, USA 4 Joint Genome Institute, Walnut Creek, California 94598, USA ........................................................................................................................................................................................................................... Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level. The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme environment. The study of microbial evolution and ecology has been revolutio- fluorescence in situ hybridization (FISH) revealed that all biofilms nized by DNA sequencing and analysis1–3. However, isolates have contained mixtures of bacteria (Leptospirillum, Sulfobacillus and, in Slides for UC Davis a small fraction of a few cases, Acidimicrobium) and archaea been the main source of sequence data, and onlyEVE161 Course Taught by Jonathan Eisen Winter 2014 (Ferroplasma and other

is internally self consistent, with 97.2% of end pairs from fer1. We designate uncultured Ferroplasma species distinct from the same Acid Mine Drainage 2004 the appropriate orientation and separation, as this as Ferroplasma type II. The dominance of this organism type clone assembled with was unexpected before the genomic analysis. We assigned the (tracking and chimaericto scaffolds expected for a low rate of mispairing error roughly 3£ coverage, high GþC(474 scaffolds Leptospirillum group III on the basis of rRNA markers up to 31 kb, totalling 2.66 Mb). Comparison of these scaffolds with clones). those assigned to Leptospirillum group II indicates significant sequence divergence and only locally conserved gene order, conThe first step in assignment of scaffolds to organism types was to The first step in assignment of scaffolds to organism types was to Figure 1 The pink biofilm. a, Photograph of the biofilm in the Richmond mine (hand included for scale). b, FISH image of a. Probes targeting bacteria (EUBmix; fluorescein isothiocyanate (green)) and archaea (ARC915; Cy5 (blue)) were used in combination with a probe targeting the Leptospirillum genus (LF655; Cy3 (red)). Overlap of red and green (yellow) indicates Leptospirillum cells and shows the dominance of Leptospirillum. c, Relative microbial abundances determined using quantitative FISH counts. 2 firming that the scaffolds belong to a relatively distant relative of Leptospirillum group II. A partial 16S rRNA gene sequence from Sulfobacillus thermosulfidooxidans was identified in the unassembled reads, suggesting very low coverage of this organism. If any Sulfobacillus scaffolds .2 kb were assembled, they would be grouped with the Leptospirillum group III scaffolds. We compared the 3£ coverage, low GþC scaffolds (580 scaffolds, 4.12 Mb) to the fer1 genome in order to assign them to organism types (Supplementary Fig. S6). Scaffolds with $96% nucleotide identity to fer1 were assigned to an environmental Ferroplasma type I genome (170 scaffolds up to 47 kb in length and comprising 1.48 Mb of sequence). The remaining low-coverage, low GþC scaffolds are tentatively assigned to G-plasma. The largest scaffold in this bin (62 kb) contains the G-plasma 16S rRNA gene. The 410 scaffolds assigned to G-plasma comprise 2.65 Mb of sequence. A partial 16S rRNA gene sequence from A-plasma was identified in the unassembled reads, suggesting low coverage of this organism. Any scaffolds from A-plasma .2 kb would be included in the G-plasma bin. Although eukaryotes are present in the AMD system, they were in low abundance in the biofilm studied. So far, no scaffolds from eukaryotes have been detected. As independent evidence that the Leptospirillum group II and Ferroplasma type II genomes are nearly complete, we located a full complement of transfer RNA synthetases in each genome data set. An almost complete set of these genes was also recovered from Leptospirillum group III. The G-plasma bin contains more than a full set of tRNA synthetases, consistent with inclusion of some A-plasma scaffolds. In addition, we established that the Leptospirillum group II, Leptospirillum group III, Ferroplasma type I, Ferroplasma type II and G-plasma bins contained only one set of rRNA genes. NATURE | doi:10.1038/nature02340 | www.nature.com/nature Slides for UC Davis EVE1612004 Nature Publishing Jonathan Eisen Winter 2014 © Course Taught by Group le u c re u th w L u th se fi L S a a g 4 ty !14 id

Methods • Plasmid library • Shotgun sequence • Assembled • Binning ! GC content ! Coverage • Potential “nearly” complete genomes ! Leptospirillum group II ! Ferroplasma type II ! Evidence for completeness: housekeeping genes • Annotation, population analysis Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Leptospirillum group II genome may reflect strong recent environmental selection for this genome type or be the result of a founder effect. undergone homologous recombination. It is unlikely that the reads with pattern transitions represent variants that arose simply through accumulation of nucleotide polymorphisms, because this Figure 2 Segment of the Ferroplasma type II composite genome. a, A 4.2-kb region showing annotated open reading frames (ORFs) (red), average read depth (blue line), and the number of nucleotide polymorphisms in the ‘green’ and ‘yellow’ relative to the ‘pink’ strain (green and yellow lines) averaged over 60-bp windows. Black dots indicate recombination sites. b, Alignment of individual reads (XYG) for a 96-bp region in a. Letters indicate nucleotide polymorphisms in the green and yellow strains relative to the pink strain. Note the recombinant sequence (XYG48207). c, Evolutionary distance tree inferred from the ancestral strain sequences in a. NATURE | doi:10.1038/nature02340 | www.nature.com/nature ©2004 Nature Publishing Group Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 3

tein-coding sequences yields a very large number of genomic limited evidence integrases). We c genes in order to system is large e transfer. Identical plasma and Ferro contexts), sugges both lineages. Sim with identical ad genomic contexts indicating that a groups. Metabolic analy Figure 3 Schematic diagram illustrating a diversity of mosaic genome types within the Ferroplasma type II population that are inferred to have arisen by homologous recombination between three closely related ancestral genome types (pink, yellow and green). 4 We recovered nea members of the group II are par phylum member the metabolic pa Ferroplasma type plementary Infor logical roles of understanding of The acidophi that grow in th ©2004 Nature Publishing Group Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

genes needed to fix carbon by means of the Calvin–Benson– Bassham cycle (using type II ribulose 1,5-bisphosphate carboxylase–oxygenase). All genomes recovered from the AMD system fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway by some, or all, organisms. Given the large number of ABC-type sugar and amino acid transporters encoded in the Ferroplasma type Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs identified in the Leptospirillum group II genome (63% with putative assigned function) and 1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation, Slides for UC Davis EVE161 Course pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate Taught by Jonathan Eisen Winter 2014 carboxylase–oxygenase. THF, tetrahydrofolate. !18

RESEARCH ARTICLE Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3 Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3 Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3 Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6 Michael W. Lomas,6 Ken Nealson,5 Owen White,3 Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6 Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4 Hamilton O. Smith1 We have applied “whole-genome shotgun sequencing” to microbial populations collected en masse on tangential flow and impact filters from seawater samples collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs http://www.sciencemag.org/content/304/5667/66 of nonredundant sequence was generated, annotated, and analyzed to elucidate the gene content, diversity, and relative abundance of the organisms within these environmental samples. These data are estimated to derive from at least 1800 genomic species based on sequence relatedness, including 148 previously unknown bacterial phylotypes. We have identified over 1.2 million previously unknown genes represented in these samples,by Jonathanmore than 782 new Slides for UC Davis EVE161 Course Taught including Eisen Winter 2014 chlorococcus, th photosynthetic bi Surface wate were collected a from three sites February 2003. A lected aboard the station S” in Ma are indicated on S1; sampling pro one expedition to was extracted fro genomic libraries 2 to 6 kb were prepared plasmid both ends to prov Craig Venter Sc nology Center on ers (Applied Bi Whole-genome ra the Weatherbird II 4) produced 1.66 in length, for a tota microbial DNA se sequences were g !19

two groups of scaffolds representing two disSargasso Sea related to the published tinct strains closely at depths ranging from 4ϫ to 36ϫ (indicated with shading in table S3 with nine depicted in Fig. 1. MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from 22 February 2003. The station locations are overlain with their respective identifications. Note the elevated levels of chlorophyll (green color shades) around station 3, which are not present around stations 11 and 13. http://www.sciencemag.org/content/304/5667/66 Fig. 2. Gene conserSlides vation among closely for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !20

• Sampling Protocols. Sampling on the RV Weatherbird II was done as follows: Seawater (170 liters) from stations 11 and 13 was directly filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences) followed in series by a 0.22µm Supor membrane disc filter (Pall Life Sciences). The sample from station 3 was pumped into a 250 L carboy prior to being filtered through the impact filters. The length of time from collection of the sample until the end of the filtration step was approximately one hour. Filters were placed in 5ml of sucrose lysis buffer (20mM EDTA, 400mM NaCl, 0.75 M Sucrose, 50mM Tris-HCl, pH 9.0) and stored in liquid nitrogen on the Weatherbird then placed at -80oC until DNA extractions were done. Alternatively seawater (340 liters) was collected from 5 meters below the surface into a carboy then filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences), followed by concentration to 1 liter using a Pellicon tangential flow filtration system (Millipore) with a 0.1µm Durapore VVPP cartridge (Millipore); again the total time for the filtration and concentration was approximately one hour. Cells were pelleted at 10,000 rpm, 4oC for 30 minutes. ). The impact filters and the retentate from the TFF were then handled as described above. The carboys, tubing and filter systems were cleaned with a 10% hydrochloric acid wash prior to each leg of the sampling. Any of the sampling equipment (tubing, etc.) that could reasonably be soaked was soaked in an acid bath is for at least 24 hours. Sampling carboys were filled with the acid wash and “soaked” for at least 24 hours as well. All acid washed items were subsequently rinsed very liberally with Milli-Q water. A liberal Milli-Q water rinse was also conducted between samples on the same leg. All spigots from the carboys were covered with a ziploc bag until needed. Tubing was stored in clean ziploc bags until needed. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sample preparation. The impact filters were cut into quarters and placed in individual 50 ml conical tubes. TE buffer (5 ml, pH 8) containing 150 ug/ml lysozyme was added to each tube. The tubes were incubated at 37oC for 2 hours. SDS was added to 0.1% and the samples were then put through three freeze/thaw cycles. The lysate was then treated with Proteinase K (100 ug/mL) for one hour at 55oC followed by three aqueous phenol extractions and one extraction with phenol/ chloroform. The supernatant was then precipitated with two volumes of 100% ethanol and the DNA pellet washed with 70% ethanol. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

DNA preparation. DNA was randomly sheared, end-polished with consecutive BAL31 nuclease and T4 DNA polymerase treatments, and size-selected by electrophoresis on 1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen, catalog no.! N408-18), DNA was purified by three rounds of gel electrophoresis to remove excess adapters, and the fragments, now with 3'-CACA overhangs, were inserted into Bst XIlinearized plasmid vector with 3'-TGTG overhangs. Fragments were cloned in a mediumcopy pBR322 derivative. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sequence assembly. With default parameter settings, the highly covered genome sequences would have been treated as repetitive DNA by the Celera Assembler. Since the Celera Assembler constructs scaffolds only from a backbone of sequence heuristically classified as unique, these organisms would not have been eligible for scaffolding and would have been absent from the final assembly. However, by tuning the threshold parameter for classifying unique sequence, we were able to compensate for the apparent repetitiveness of these genomic regions, and scaffold them appropriately. This was accomplished by identifying the most deeply assembling, obviously nonrepetitive contigs in an initial run of the assembler (in this case, the strong assemblies at 21-36x coverage which were identified as gene-rich Burkholderia-like and plasmid scaffolds), and using a value slightly below the calculated “A-statistic” (an empirical uniqueness measure within the Assembler) of these contigs as the threshold parameter in a subsequent run. This allows the deep contigs to be treated as unique sequence, when they would otherwise be labeled as repetitive. At the other end of the spectrum, rare organisms in the sample have been sampled by sequencing only to a shallow depth of coverage. Routine assembly would not have considered the small fragment overlap based assemblies with shallow coverage as an eligible basis for scaffolding, due to a minimum length requirement of 1000bp, which is typically in place for efficiency. Therefore, in the present use case, the organisms represented by these sequences would not have been ordered and oriented with mate-pairs without adjusting the default minimum length to compensate for the low anticipated coverage depth and assembly length. With this selection of parameters, more suitable to the enivironmental project at hand, we were able to adequately assemble both the dominant and rare species simultaneously. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Methods • Plasmid library • Shotgun sequence • Assembled • No Major Binning • Potential “nearly” complete genomes • Annotation, population analysis, phylogenetic analysis Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

e relatively limited depth of serage given the level of diversity ple. genome shotgun (WGS) assembly sited at DDBJ/EMBL/GenBank ect accession AACY00000000, have been deposited in a correeDB trace archive. The version his paper is the first version, 00. Unlike a conventional WGS deposited not just contigs and e unassembled paired singletons singletons in order to accuratediversity in the sample and across the entire sample withabase. and large assemblies. Our ocused on the well-sampled geacterizing scaffolds with at least depth. There were 333 scaffolds 26 contigs and spanning 30.9 his criterion (table S3), accounty 410,000 reads, or 25% of the ly data set. From this set of wellal, we were able to cluster and blies by organism; from the rare ample, we used sequence similarods together with computational obtain both qualitative and quans of genomic and functional diverparticular marine environment. yed several criteria to sort the y pieces into tentative organism nclude depth of coverage, oligo- Fig. 2. Gene conservation among closely related Prochlorococcus. The outermost concentric circle of the diagram depicts the competed genomic sequence of Prochlorococcus marinus MED4 (11). Fragments from environmental sequencing were compared to this completed Prochlorococcus genome and are shown in the inner concentric circles and were given boxed outlines. Genes for the outermost circle have been assigned psuedospectrum colors based on the position of those genes along the chromosome, where genes nearer to the start of the genome are colored in red, and genes nearer to the end of the genome are colored in blue. Fragments from environmental sequencing were subjected to an analysis that identifies conserved gene order between those fragments and the completed Prochlorococcus MED4 genome. Genes on the environmental genome segments that exhibited conserved gene order are colored with the same color assignments as the Prochlorococcus MED4 chromosome. Colored regions on the environmental segments exhibiting color differences from the adjacent outermost concentric circle are the result of conserved gene order with other MED4 regions and probably represent chromosomal rearrangements. Genes that did not exhibit conserved gene order are colored in black. http://www.sciencemag.org/content/304/5667/66 www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 67

RESEARCH ARTICLE Fig. 3. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Predicted proteins from 4B7 and the scaffolds showing significant homology to 4B7 by tBLASTx are arrayed in positional order along the x and y axes. Colored boxes represent BLASTp matches scoring at least 25% similarity and with an e value of better than 1e-5. Black vertical and horizontal lines delineate scaffold borders. http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Fig. 4). Oth separated, p nation of sh nomic signa greater dive genomes (9 Discrete continuum scaffolds (21 and 9.35 M single nucl 10,000 base ence of disc the remaini SNP rate ran a length-we We closely alignments and were ab distinct clas related hap creasing th (10), and re homogenou consensus w haplotypes, fold region cus scaffold

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward direction are shown in the outer concentric circle; reverse coding genes are shown in the inner concentric circle. The genes have been given role category assignment and colored accordingly: amino acid biosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cell envelope, light green; cellular processes, red; central intermediary metabolism, brown; DNA metabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta; protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange; regulatory functions and signal transduction, olive; transcription, dark green; transport and binding proteins, blue-green; genes with no known homology to other proteins and genes with homology to genes with no known function, white; genes of unknown function, gray; Tick marks are placed on 10-kb intervals. 68 homogenous blend of discrepancies from consensus without any apparent separation haplotypes, such as the Prochlorococcus s fold region (Fig. 5). Indeed, the Prochloroc cus scaffolds display considerable heteroge ity not only at the nucleotide sequence le (Fig. 5) but also at the genomic level, wh multiple scaffolds align with the same regio the MED4 (11) genome but differ due to g or genomic island insertion, deletion, rearran ment events. This observation is consistent w previous findings (12). For instance, scaffo 2221918 and 2223700 share gene synteny w each other and MED4 but differ by the inser of 15 genes of probable phage origin, lik representing an integrated bacteriophage. Th genomic differences are displayed graphic in Fig. 2, where it is evident that up to f conflicting scaffolds can align with the sa region of the MED4 genome. More than 8 of the Prochlorococcus MED4 genome can aligned with Sargasso Sea scaffolds gre than 10 kb; however, there appear to b couple of regions of MED4 that are not rep sented in the 10-kb scaffolds (Fig. 2). larger of these two regions (PMM1187 PMM1277) consists primarily of a gene clu coding for surface polysaccharide biosynthe which may represent a MED4-specific poly charide absent or highly diverged in our S gasso Sea Prochlorococcus bacteria. The he ogeneity of the Prochlorococcus scaffolds sug that the scaffolds are not derived from a sin discrete strain, but instead probably represen conglomerate assembled from a population closely related Prochlorococcus biotypes. The gene complement of the Sargas The heterogeneity of the Sargasso sequen complicates the identification of micro genes. The typical approach for microbial notation, model-based gene finding, relies tirely on training with a subset of manu 2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

frames (5). A total of 69,901 novel genes belonging to 15,601 single link clusters were identified. The predicted genes were categorized Table 1. Gene count breakdown by TIGR role category. Gene set includes those found on assemblies from samples 1 to 4 and fragment reads from samples 5 to 7. A more detailed table, separating Weatherbird II samples from the Sorcerer II samples is presented in the SOM (table S4). Note that there are 28,023 genes which were classified in more than one role category. TIGR role category Amino acid biosynthesis Biosynthesis of cofactors, prosthetic groups, and carriers Cell envelope Cellular processes Central intermediary metabolism DNA metabolism Energy metabolism Fatty acid and phospholipid metabolism Mobile and extrachromosomal element functions Protein fate Protein synthesis Purines, pyrimidines, nucleosides, and nucleotides Regulatory functions Signal transduction Transcription Transport and binding proteins Unknown function Miscellaneous Conserved hypothetical Total genes 37,118 25,905 27,883 17,260 13,639 25,346 69,718 18,558 1,061 28,768 48,012 19,912 8,392 4,817 12,756 49,185 38,067 1,864 794,061 Total number of roles assigned 1,242,230 Total number of genes 1,214,207 Fig. 5. Prochlorococcus-related scaffold 2223290 illustra nity of closely related organisms, distinctly nonpunctat global structure of Scaffold 2223290 with respect to asse sequence alignment. Blue segments, contigs; green segm stages of the assembly of fragments into the resulting fragments were initially assembled in several different form the final contig structure. The multiple sequenc homogenous blend of haplotypes, none with sufficie separate assembly. http://www.sciencemag.org/content/304/5667/66 www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

d curated genes. With the vast maSargasso sequence in short (less unassociated scaffolds and singleundreds of different organisms, it is o apply this approach. Instead, we n evidence-based gene finder (5). ence in the form of protein alignquences in the bacterial portion of ndant amino acid (nraa) data set sed to determine the most likely e. Likewise, approximate start and s were determined from the boundtes of the alignments and refined to cific start and stop codons. This entified 1,214,207 genes covering B of the total data set. This repreximately an order of magnitude http://www.sciencemag.org/content/304/5667/66 nces than currently archived in the Slides for UC ssProt database (14), which con- Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 RESEA

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA phylotyping from metagenomics http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !32

Shotgun Sequencing Allows Alternative Anchors (e.g., RecA) http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !33

nomic group using the phylogenetic analysis described for rRNA. For example, our data set marker genes, is roughly comparable to the 97% cutoff traditionally used for rRNA. Thus http://www.sciencemag.org/content/304/5667/66 Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. The relative contribution of organisms from different major phylogenetic groups (phylotypes) was measured using multiple phylogenetic markers that have been used previously in phylogenetic studies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). The relative proportion of different phylotypes for each sequence (weighted by the depth of coverage of the contigs from which those sequences came) is shown. The phylotype distribution was determined as follows: (i) Sequences in the Sargasso data set corresponding to each of these genes were identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for each phylogenetic marker identified in the Sargasso data separately compared with all members of that gene family in all complete genome sequences (only complete genomes were used to control for the differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of each sequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree. Slides for UC Davis RIL 2004 VOL 304 SCIENCE www.sciencemag.org EVE161 Course Taught by Jonathan Eisen Winter 2014 !34

method based on fitting the observed depth of coverage to a theoretical model of assembly progress for a sample corresponding to a mix- that a minimum of 12-fold deeper sampling would be required to obtain 95% of the unique sequence. However, these are only lower Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column uses symbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. Sequence ID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTx e-value chosen to identify orthologs when querying the E. coli sequence against the complete Sargasso Sea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at the point along the query for which this value is maximal. Observed “species” shows the number of distinct clusters of reads from the maximum fragment depth column, after grouping reads whose containing assemblies had an overlap of at least 40 bp with Ͼ 94% nucleotide identity (single-link clustering). Singleton “species” shows the number of distinct clusters from the observed “species” column that consist of a single read. Most abundant column shows the fraction of the maximum fragment depth that consists of single largest cluster. Protein Sequence ID Ortholog cutoff AtpD GyrB Hsp70 RecA RpoB TufA NTL01EC03653 NTL01EC03620 NT01EC0015 NTL01EC02639 NTL01EC03885 NTL01EC03262 1e-32 1e-11 1e-31 1e-21 1e-41 1e-41 Max. fragment depth Observed “species” Singleton “species” Most abundant (%) 836 924 812 592 669 597 456 569 515 341 428 397 317 429 394 244 331 307 6 4 4 8 7 3 of se ever nity. resen know scaff cont even SAR cove fold, 21,0 popu uted V key proa men the r isms half men equa colle Table 3. Diversity models based on depth of coverage. Each row correcolumn) in the sample. The thi http://www.sciencemag.org/content/304/5667/66 sponds to an abundance class of organisms. The first column in each a genome expected to be s Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 model “fr(asm)” gives the fraction of the assembly consensus modeled gives the resulting estimat

Figure S6. Accumulation curve for rpoB. Observed (black) OTU counts for rpoB (based on the fragment grouping summarized in Table 2), as well as the Chao1-corrected estimate of total species (red; see (3)). Points are mean values of 1000 shufflings of the observed data, while bars show 90% confidence intervals. http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised Figure S7. Each point in the figure corresponds to a scaffold from the assembly (restricted to scaffolds > 10kb). Scaffolds were placed in separate panels of the figure according to the most closely related organism as indicated by the BLAST searches described in the text. Within a panel, a scaffold is shown with x coordinate equal to its length, y coordinate equal to its estimated depth of coverage, and color determined by which of 6 k-mer composition clusters it was assigned to. Depth of coverage was estimated as the total base pairs in reads belonging to a given assembly piece divided by the length of the consensus sequence for the piece. K-mer composition clusters were determined by representing each scaffold as a vector of the frequencies of all possible 4mers, considering both the forward and reverse strands of the sequence, and then applying the K-means clustering algorithm. http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Functional Diversity of Proteorhodopsins? http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !38

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised Figure S10. Scaffold 2217664, containing the gene encoding Proteorhodopsin. Genes are colored using color assignments described in Fig. 2, and contig boundaries are indicated with red vertical lines. In this scaffold, rhodopsin is associated with a DNA-directed RNA polymerase, sigma subunit (rpoD) originating in the CFB group. http://www.sciencemag.org/content/304/5667/66 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Binning challenge A B C D E F G T U V W X Y Z Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !40

Binning challenge A B C D E F G T U V W X Y Z Best binning method: reference genomes Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !41

Glassy Winged Sharpshooter • Feeds on xylem sap • Vector for Pierce’s Disease • Potential bioterror agent • Collaboration with Nancy Moran to sequence symbiont genomes • Funded by NSF • Published in PLOS Biology 2006 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wu et al. 2006 PLoS Biology 4: e188. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sharpshooter Shotgun Sequencing shotgun Collaboration with Nancy Moran’s Wu et al. 2006 PLoS Biology 4: e188. lab Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Binning challenge A B C D E F G No reference genome? What do you do? ! Phylogeny .... Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 T U V W X Y Z

CFB Phyla Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sulcia makes vitamins and cofactors Baumannia makes amino acids Wu et al. 2006 PLoS Biology 4: e188. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 48

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sorcerer II GOS Expedition Figure 1. Sampling Sites Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part of this study. doi:10.1371/journal.pbio.0050077.g001 environments as well as a few nonmarine aquatic samples for the pilot Sargasso Sea study, 200 l surface seawater was contrast (Table Eisen Winter 2014 filtered to isolate microorganisms UC Davis EVE161analysis. Taught by Jonathan1). Slides for for metagenomic Course

Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees Dongying Wu1, Martin Wu1,4, Aaron Halpern2,3, Douglas B. Rusch2,3, Shibu Yooseph2,3, Marvin Frazier2,3, J. Craig Venter2,3, Jonathan A. Eisen1* 1 Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, University of California Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 The J. Craig Venter Institute, Rockville, Maryland, United States of America, 3 The J. Craig Venter Institute, La Jolla, California, United States of America, 4 University of Virginia, Charlottesville, Virginia, United States of America Abstract Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences. Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them. Citation: Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011 Editor: Robert Fleischer, Smithsonian Institution National Zoological Park, United States of America Received October 25, 2010; Accepted February 20, 2011; Published March 18, 2011 This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Funding: The development and main work on this project was supported by the National Science Foundation via an ‘‘Assembling the Tree of Life’’ grant (number 0228651) to to Jonathan A. Eisen and Naomi Ward. The final work on this project was funded by the Gordon and Betty Moore Foundation (through

Stalking the Fourth Domain Figure 1. Phylogenetic tree of the RecA superfamily. All RecA sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in the text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily. doi:10.1371/journal.pone.0018011.g001 PLoS ONE | www.plosone.org 5 March 2011 | Volume 6 | Issue 3 | e18011 Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits. doi:10.1371/journal.pone.0018011.t002 Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily Unknown 2). This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2 subfamily (cluster ID 9). doi:10.1371/journal.pone.0018011.g002 Slides for UC PLoS ONE | www.plosone.org Davis EVE161 Course7Taught by Jonathan Eisen| Winter 2014 3 March 2011 Volume 6 | Issue | e18011

Stalking the Fourth Domain Figure 3. Phylogenetic tree of the RpoB superfamily. All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored panels. doi:10.1371/journal.pone.0018011.g003 Methods these 340 sequences were extracted from the European Ribosomal [66] and then Slides forIdentification of deeply-branching ss-rRNA sequences by Jonathan than 90% gaps or with 2014remove UC Davis EVE161 Course Taught RNA databasemore Eisen manually curated toalignment Winter poor columns with

Add a comment

Related presentations

Related pages

UC Davis EVE161 Lecture 15 by @phylogenomics - Education

Download UC Davis EVE161 Lecture 15 by @phylogenomics. Transcript. 1.Lecture 14:EVE 161: Microbial Phylogenomics !Lecture #15: Era IV: Shotgun Metagenomics !
Read more

UC Davis EVE161 Lecture 9 by @phylogenomics - Education

UC Davis EVE161 Course Lecture Slides ... 1.Lecture 9:EVE 161: Microbial Phylogenomics !Lecture #9: Era II: rRNA Case Study !
Read more

UC Davis EVE161 Lecture 10 by @phylogenomics - Education

1.Lecture 10:EVE 161: Microbial Phylogenomics !Lecture #10: Era III: Genome Sequencing ! UC Davis, Winter 2014 Instructor: Jonathan EisenSlides for UC ...
Read more

UC Davis EVE161 Lecture 13 by @phylogenomics - Education

The document was removed. Please view another documents 1 × Close Share UC Davis EVE161 Lecture 13 by @phylogenomics
Read more

EVE161 Class at UCDavis Winter 2014 Lecture 1 - Jonathan ...

EVE161 Class at UCDavis Winter 2014 Lecture 1 ... Lecture 15 SectionB BIS2C UC Davis Spring ... Lecture 6 SectionA BIS2C UC Davis Spring ...
Read more

Uc Davis | LinkedIn

View 87962 Uc Davis ... Systems at UC Davis. Tom also directs UC's division of Agriculture and ... UC Davis EVE161 Lecture 9 by @phylogenomics Views 911 ...
Read more

EVE161 Class at UCDavis Winter 2014 Lecture 3 - Jonathan ...

EVE161 Class at UCDavis Winter 2014 Lecture 3 ... Talk by Jonathan Eisen @phylogenomics: ... Lecture 15 SectionB BIS2C UC Davis Spring 2014 ...
Read more

Uc Davis | LinkedIn

View 83456 Uc Davis posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. LinkedIn Home What is LinkedIn? Join Today
Read more