Published on March 12, 2009
Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8
A Dual-Use Pipeline Multiple types of stored evidence Persistent & Flexibly Interleaved Supports selective re-annotation Features annotation-driving databases - CHAR - TIGRFAMs - Genome Properties - BrainGrab Rules Evidence used by Machine and by Experts MANATEE interface for annotators Capture new rules with BrainGrab
Computable objects: Output from one program becomes input to another. HMM results drive Genome Properties Genome Properties guide GO process assignments GO process terms
Identification of Genome Features IMM ORFs Genome Sequence built Glimmer builds a statistical model from the training set : Other Genome Features • rRNA, tRNA, Rfam • IS elements ·Phage regions ·Repeats
Gene Finding Glimmer & friends, homology methods Homology Searches (gathering evidence) BLAST-Extend-Repraze Hidden Markov Models misc. Structural Curation ( ORF Management) Auto_Gene_Curate (start sites, overlaps) InterEvidence Functional Assignments Auto_Annotate Manual Mapped Data Availability
Homology Searches HMM searches: TIGRFAMs & Pfam • BLAST searches: against internal NIAA • PROSITE motifs • InterPro • TmHMM • SignalP • Lipoprotein • Psort • Generate Paralogous Families • Custom databases searches (TransportDB, Rules) •
Gene Model Curation • Overlaps resolved by evidence competition • Start site curation • Missed genes / unsupported gene calls
Evidence can Overhang the Gene Blast-Extend-Repraze (BER) The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up- and downstream extensions. Red line is the match protein. end5 end3 ORFxxxxx 300 bp 300 bp search protein match protein normal full length match ! similarity extends upstream through a start, or downstream past a frameshift * similarity extends in the same frame through a stop codon
Pfam vs. TIGRFAMs Names for homology Functional assignments to proteins domains in proteins Granularity tuned for Granularity tuned for single-hit equivalogs twilight-level sequence (mono-functional !) similarity detection Explains things to Generates computable annotator objects --> pathway reconstructions TIGRFAMs: RULES Pfam : Explanations
TIGRFAMs equivalogs vs. Pfam domains } X TIGRxxxxx X X Y Z } PFxxxxx
TIGRFAMs as annotation rules EC number computable ! GO term computable ! protein name computable ? HMM hit computable !!
Isology (homology) types: ranking our rules EXCEPTION additional info, e.g. “vegetative” EQUIVALOG the SAME (in enough ways) to receive the same name across multiple genomes, reflecting one specific function. SUBFAMILY can name a whole class DOMAIN class name for a protein region (and apply these classifications also to Pfam)
CHAR : Experimentally Characterized Protein Database • Highly curated database of experimentally characterized proteins; connects protein accessions, known function, and the scientific literature. •What does it include: –Controlled vocabulary describes the type of experimentation performed in each publication –Key annotation fields (protein name, gene symbol, Enzyme Commission (EC) number, taxonomic data, Gene Ontology (GO) terms) are extracted –Synonymous protein accessions obtained from public databases (Genbank and UniProt) are stored
Annotation Proceeds from … Inside --> out (e.g. AutoAnnotate): for every protein Collect evidence Best-guess annotation Outside --> in (e.g. TIGRFAMs): for every model Search tool + cutoff + standards = annotation rule Achieves partial coverage Hybrid (BrainGrab) for every unfinished protein Look for means to annotate: blastp, synteny, hole-filling, etc. annotator logic as a new rule Capture Add to library of rules/models for all future genomes
RULES T OR P IM validate NEW Subject Genome Br share ai nG ra b Proper Realm of Annotator Attention Trusted Complete Automatic genome genome
A Teachable Moment EcHS_A1984 is manually annotated confidently because it is similar enough to : SP|P07363|CHEA_ECOLI Chemotaxis protein cheA EC 184.108.40.206 (method: defines “similar enough”) BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1] Must be the only protein in genome that scores >= 1600 by blastp, covering >= 95 % of the length of the characterized protein and >= 92 % of the target protein, with >= 60 % sequence identity.
a sample of expert opinion: “For This Particular Protein Family” I (D.H.H.) assert that any > 75 %-identical, full- length match is the same protein. Ditto any > 65 % match, as long as the region is clearly syntenic. Ditto any single-copy > 50 % match, as long as it fills this hole in this otherwise mostly complete pathway.
B “Bag of Genes” G Genome Properties E Evidence to drive other programs Image from Gödel, Escher, Bach: an Eternal Golden Braid by Douglas Hofstadter, 1979
Genome Properties: annotation at the level of systems not some NO YES supported evidence pathway (glyoxylate shunt) system (type III secretion) structure (outer membrane) genometrics (GC content) phenotype (motility, pathogenesis)
Some Novel Genome Properties 12 subtypes of CRISPR/Cas system PEP-CTERM / exo-sortase: Biofilm-associated protein sorting Type VI secretion (53 loci in B. mallei 23344) Post-translational selenium-modified enzymes Heterocycle-containing bacterial toxin production: BA_2677 = “heterocyclo-anthracin”
A family of variable putative toxins with patterns of CGG insertions.
Future Annotation Pipeline Enhancements • Populate the Characterized Protein Database • Develop META-RULES from CHAR • BrainGrab for novel content • Import additional computable evidence • Improve exchanges of validation sets • Build a protein names ontology
Acknowlegements Ramana Madupu Jeremy Selengut Alex Richter JCVI microbial annotation team
Prokaryotic Annotation ... JCVI has developed highly refined systems ... The automated Prokaryotic Annotation Pipeline was developed to generate ...
Prokaryotic Automatic Annotation Pipeline SOP J Craig Venter Institute Author: Ramana Madupu Version: 1.01 Effective Date: 4/1/2009 1. Abstract This SOP ...
JCVI Prokaryotic Annotation Pipeline. The automated Prokaryotic Annotation Pipeline was developed to ... The DAGchainer software computes chains of ...
Browse Code SVN Repository Description. The automated Prokaryotic Annotation Pipeline was developed to generate on demand ORF prediction and functional ...
1.Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8 . 2. A Dual-UsePipeline Multiple types of stored evidence Persistent & Flexibly ...
Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8 A Dual-Use Pipeline Multiple types of stored ...
JCVI Prokaryotic Genome Annotation. ... JCVI employs an automated annotation pipeline that identifies genome features in the raw DNA sequence, ...
The automated Prokaryotic Annotation Pipeline was developed to generate on demand ORF prediction and functional annotation for prokaryotic genomes. A …
The JCVI prokaryotic metagenomics pipeline is designed for significant input flexibility. Gene finding, which is referred to in this paper as structural ...