Automated Prokaryotic Annotation at JCVI

50 %
50 %
Information about Automated Prokaryotic Annotation at JCVI

Published on March 12, 2009

Author: Pathema



Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida.
Presenter: Dan Haft

Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8

A Dual-Use Pipeline  Multiple types of stored evidence  Persistent & Flexibly Interleaved  Supports selective re-annotation  Features annotation-driving databases - CHAR - TIGRFAMs - Genome Properties - BrainGrab Rules  Evidence used by Machine and by Experts  MANATEE interface for annotators  Capture new rules with BrainGrab

Computable objects: Output from one program becomes input to another.  HMM results drive Genome Properties  Genome Properties guide GO process assignments  GO process terms

Identification of Genome Features IMM ORFs Genome Sequence built Glimmer builds a statistical model from the training set : Other Genome Features • rRNA, tRNA, Rfam • IS elements ·Phage regions ·Repeats

Gene Finding Glimmer & friends, homology methods Homology Searches (gathering evidence) BLAST-Extend-Repraze Hidden Markov Models misc. Structural Curation ( ORF Management) Auto_Gene_Curate (start sites, overlaps) InterEvidence Functional Assignments Auto_Annotate Manual Mapped Data Availability

Homology Searches HMM searches: TIGRFAMs & Pfam • BLAST searches: against internal NIAA • PROSITE motifs • InterPro • TmHMM • SignalP • Lipoprotein • Psort • Generate Paralogous Families • Custom databases searches (TransportDB, Rules) •

Gene Model Curation • Overlaps resolved by evidence competition • Start site curation • Missed genes / unsupported gene calls

Evidence can Overhang the Gene Blast-Extend-Repraze (BER) The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up- and downstream extensions. Red line is the match protein. end5 end3 ORFxxxxx 300 bp 300 bp search protein match protein normal full length match ! similarity extends upstream through a start, or downstream past a frameshift * similarity extends in the same frame through a stop codon

Pfam vs. TIGRFAMs  Names for homology  Functional assignments to proteins domains in proteins  Granularity tuned for  Granularity tuned for single-hit equivalogs twilight-level sequence (mono-functional !) similarity detection  Explains things to  Generates computable annotator objects --> pathway reconstructions  TIGRFAMs: RULES  Pfam : Explanations

TIGRFAMs equivalogs vs. Pfam domains } X TIGRxxxxx X X Y Z } PFxxxxx

TIGRFAMs as annotation rules  EC number computable !  GO term computable !  protein name computable ?  HMM hit computable !!

Isology (homology) types: ranking our rules  EXCEPTION additional info, e.g. “vegetative”  EQUIVALOG the SAME (in enough ways) to receive the same name across multiple genomes, reflecting one specific function.  SUBFAMILY can name a whole class  DOMAIN class name for a protein region (and apply these classifications also to Pfam)

CHAR : Experimentally Characterized Protein Database • Highly curated database of experimentally characterized proteins; connects protein accessions, known function, and the scientific literature. •What does it include: –Controlled vocabulary describes the type of experimentation performed in each publication –Key annotation fields (protein name, gene symbol, Enzyme Commission (EC) number, taxonomic data, Gene Ontology (GO) terms) are extracted –Synonymous protein accessions obtained from public databases (Genbank and UniProt) are stored

Annotation Proceeds from … Inside --> out (e.g. AutoAnnotate): for every protein   Collect evidence  Best-guess annotation Outside --> in (e.g. TIGRFAMs): for every model  Search tool + cutoff + standards = annotation rule  Achieves partial coverage Hybrid (BrainGrab) for every unfinished protein   Look for means to annotate: blastp, synteny, hole-filling, etc. annotator logic as a new rule  Capture  Add to library of rules/models for all future genomes

RULES T OR P  IM  validate NEW  Subject Genome Br  share ai nG ra b  Proper Realm of Annotator Attention  Trusted  Complete  Automatic  genome  genome

A Teachable Moment EcHS_A1984 is manually annotated confidently because it is similar enough to :  SP|P07363|CHEA_ECOLI  Chemotaxis protein cheA  EC (method: defines “similar enough”) BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1] Must be the only protein in genome that scores >= 1600 by blastp, covering >= 95 % of the length of the characterized protein and >= 92 % of the target protein, with >= 60 % sequence identity.

a sample of expert opinion: “For This Particular Protein Family”  I (D.H.H.) assert that any > 75 %-identical, full- length match is the same protein.  Ditto any > 65 % match, as long as the region is clearly syntenic.  Ditto any single-copy > 50 % match, as long as it fills this hole in this otherwise mostly complete pathway.

B “Bag of Genes” G Genome Properties E Evidence to drive other programs Image from Gödel, Escher, Bach: an Eternal Golden Braid by Douglas Hofstadter, 1979

Genome Properties: annotation at the level of systems not some NO YES supported evidence  pathway (glyoxylate shunt) system (type III secretion)  structure (outer membrane)  genometrics (GC content)  phenotype (motility, pathogenesis) 

Some Novel Genome Properties  12 subtypes of CRISPR/Cas system  PEP-CTERM / exo-sortase: Biofilm-associated protein sorting  Type VI secretion (53 loci in B. mallei 23344)  Post-translational selenium-modified enzymes  Heterocycle-containing bacterial toxin production: BA_2677 = “heterocyclo-anthracin”

A family of variable putative toxins with patterns of CGG insertions.

Future Annotation Pipeline Enhancements • Populate the Characterized Protein Database • Develop META-RULES from CHAR • BrainGrab for novel content • Import additional computable evidence • Improve exchanges of validation sets • Build a protein names ontology

Acknowlegements Ramana Madupu Jeremy Selengut Alex Richter JCVI microbial annotation team

Add a comment

Related presentations

Related pages

JCVI: Research / Projects / Prokaryotic Annotation ...

Prokaryotic Annotation ... JCVI has developed highly refined systems ... The automated Prokaryotic Annotation Pipeline was developed to generate ...
Read more

Prokaryotic Automatic Annotation Pipeline SOP J Craig ...

Prokaryotic Automatic Annotation Pipeline SOP J Craig Venter Institute Author: Ramana Madupu Version: 1.01 Effective Date: 4/1/2009 1. Abstract This SOP ...
Read more

JCVI: Research / Software

JCVI Prokaryotic Annotation Pipeline. The automated Prokaryotic Annotation Pipeline was developed to ... The DAGchainer software computes chains of ...
Read more

JCVI Prokaryotic Annotation Pipeline download ...

Browse Code SVN Repository Description. The automated Prokaryotic Annotation Pipeline was developed to generate on demand ORF prediction and functional ...
Read more

Automated Prokaryotic Annotation at JCVI - Technology

1.Aut omated Prokaryot ic Annotation at the JCVI Da n ie l Ha ft 200 8 . 2. A Dual-UsePipeline Multiple types of stored evidence Persistent & Flexibly ...
Read more

JCVI Prokaryotic Genome Annotation. The annotation pipe ...

JCVI Prokaryotic Genome Annotation. ... JCVI employs an automated annotation pipeline that identifies genome features in the raw DNA sequence, ...
Read more

JCVI Prokaryotic Annotation Pipeline | Reviews for JCVI ...

The automated Prokaryotic Annotation Pipeline was developed to generate on demand ORF prediction and functional annotation for prokaryotic genomes. A …
Read more

The JCVI standard operating procedure for annotating ...

The JCVI prokaryotic metagenomics pipeline is designed for significant input flexibility. Gene finding, which is referred to in this paper as structural ...
Read more

Comparison of JGI, RAST, and JCVI Automated Genome ...

Three such automated annotation tools are JGI, ... initial automatic annotation for prokaryotic genomes. ... through the JCVI Annotation Service, ...
Read more