Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

25 %
75 %
Information about Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Published on June 2, 2008

Author: fbelleau

Source: slideshare.net

Description

Bio2RDF presentation at WWW2007 HCLS Workshop.

http://bio2rdf.org/www2007/

Towards A M ashup To Build Bioinformatics K nowledge System François Belleau, M arc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, Jean M orissette Département d'informatique et de génie logiciel Université Laval

Presentation Plan K nowledge integration vision  Bio2RDF architecture  RDFization of knowledge  Normalization of U RI  Parkinson E xample Demo  Conclusion  Banff, May 8, 2007 CHUL research center - Laval University 2

From the RDF inventor : quot;Wouldn't it be great if you were able to organize all this information based on your own terms, instead of based on the application you use to access the information ?” (1999) Ramanathan V. Guha From WikiPedia : Mashup (web application hybrid) A mashup is a website or application that combines content from more than one source into an integrated experience.(2007) Banff, May 8, 2007 CHUL research center - Laval University 3

Sir Berners-L ee’s vision of semantic web « The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation. » Scientific Americain, 2001 Tim Berners- Lee http://www.w3.org/2006/Talks/0404-mit-tbl/ Banff, May 8, 2007 CHUL research center - Laval University 4

Bio2RDF starting vision at ISM B 2005 Too many knowledge sources  available for life science scientists Too many formats (text, X M L ,  HTM L ) New source each day with  specialized tool or web interface Integration problem recognized by  global community T hanks to Chr istopher Baker, Eric Neum ann, Kei Cheun g and Johan ne Luciaono for their ideas. Banff, May 8, 2007 CHUL research center - Laval University 5

The knowledge integration problem in bioinformatics From the BioPAX group(2004) From Carol Goble at ISW C 2005 Banff, May 8, 2007 CHUL research center - Laval University 6

Integration methods in bioinformatics 1) Davidson 1995 “Transform data to the federated database on demand” 2) Köhler 2003 “In different databases the same things can be given different names” 3) Stein 2003 “link integration, view integration and data warehousing” Banff, May 8, 2007 CHUL research center - Laval University 7

Data warehouse approaches url http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html Banff, May 8, 2007 CHUL research center - Laval University 8

Bio2RDF ’s approach to knowledge integration : “Solve the problem of kn owledge in tegration in biology by applying a sem antic web approach.” Banff, May 8, 2007 CHUL research center - Laval University 9

Other semantic web projects Banff, May 8, 2007 CHUL research center - Laval University 10

Bio2RDF ’s design rules 2. Convert document to RDF format; 3. U se of a triplestore technology (sesame, virtuoso, oracle); 4. Normalize U RIs; 5. Build a mashup as needed to answer specific question (elmo); 6. Query the mashup with SeRQL or SPARQL . Banff, May 8, 2007 CHUL research center - Laval University 11

Bio2RDF ’s architecture #1 #5 #4 #2 #3 #6 Banff, May 8, 2007 CHUL research center - Laval University 12

Bio2RDF ’s knowledge sources Banff, May 8, 2007 CHUL research center - Laval University 13

RDF conversion statistics Data Numb er of RDF sourc LSID example Size of data converted documents e go go:0000001 22 961 507 963 321 kegg path:aae00010 35 257 1 038 593 137 14 292 8 902 205 kegg cpd:c00001 438 724 210 458 897 mgi mgi:96103 17 359 573 639 380 ncbi omim:100050 ncbi geneid:1 2 744 786 67 225 535 082 obo obo's 59 name spaces 279 720 216 007 267 pdb pdb:100d 34 421 16 309 651 935 4 177 176 29 453 203 064 uniprot uniprot:A0A0 00 5 020 2 844 058 uniprot enzyme:1.-.-.- 191 664 364 728 083 uniprot pubmed:100133 uniprot taxonomy :10 337 564 125 630 659 uniprot niref:UniRef100_A0A000 u 7 990 452 14 865 490 144 … … … … Banff, May 8, 2007 CHUL research center - Laval University 14

OpenRDF ’s software http://www.openrdf.org/ Banff, May 8, 2007 CHUL research center - Laval University 15

RDF of geneid:15275 rdf:about • rdfs:label • dc:identifier, title, created • bio2rdf:lsid • bio2rdf:url • bio2rdf:synonym • bio2rdf:xRef • Banff, May 8, 2007 CHUL research center - Laval University 16

RDFizer To rdfize: T o convert existin g docum ent in to RD F form at. efetch rdfizer Banff, May 8, 2007 CHUL research center - Laval University 17

How to rdfize From HTM L pages (prosite:ps00101) • From X M L documents using X SLT • (path:mmu00010) From X M L documents using X Path and • J STL (geneid:15275) From direct SQL access • (ensembl:ensmusg00000025875 ) From RDF document (uniprot:p26838 ) • From Text files (cpd:c00001) • Banff, May 8, 2007 CHUL research center - Laval University 18

1) prosite:ps00101 from html using a regex Banff, May 8, 2007 CHUL research center - Laval University 19

2) Kegg’s path:mmu00010 from X M L using X SL Banff, May 8, 2007 CHUL research center - Laval University 20

3) ensembl:ensmusg00000025875 from SQL Banff, May 8, 2007 CHUL research center - Laval University 21

4) uniprot:p26838 from RDF using SeRQL Banff, May 8, 2007 CHUL research center - Laval University 22

One reality, many names Different namespace identifier ● pubmed:11992264 vs pmid:11992264 Uppercase and lowercase ● uniprot:p26838 vs uniprot:P26838 Version number ● genbank:ac008393 vs genbank:ac008393.7 Total id length ● go:0032283 vs go:32283 Banff, May 8, 2007 CHUL research center - Laval University 23

RDF izing docum ent is not enough we also need norm alized URIs. http:/ / bio2rdf.org/ namespace:id http:/ / bio2rdf.org/ pubmed:11992264 http:/ / bio2rdf.org/ uniprot:p26838 http:/ / bio2rdf.org/ genbank:ac008393 http:/ / bio2rdf.org/ go:0032283 Banff, May 8, 2007 CHUL research center - Laval University 24

U RI Normalization rules Different namespace identifier ● We resolve namespace synonymy with a urlrewrite rule, for example pubmed and pmid. Uppercase and lowercase ● We write every U RI in lowercase Version number ● A owl:sameAs predicate is use to link the different versions of a document. Total id length ● A fixed length is determine for id. Banff, May 8, 2007 CHUL research center - Laval University 25

U rl Rewrite Filter http://tuckey.org/urlrewrite/ < rule> < from> ^/ search:(.*?)@pubmed< / from> < to> / rdfizer/ ncbi-entrez2rdf.jsp?db= pubmed&amp;query= $1< / to> < / rule> < rule> < from> ^/ pubmed:(.*)< / from> < to> / rdfizer/ ncbi-pubmed2rdf.jsp?id= $1< / to> < / rule> < rule> < from> ^/ pmid:(.*)< / from> < to> / rdfizer/ lsid-sameas2rdf.jsp?from= pmid:$1&amp;to= pubmed:$1< / to> < / rule> < rule> < from> ^/ (.*):(.*)< / from> < to type= quot;redirectquot;> http:/ / bio2rdf.org/ $1:$2< / to> < / rule> Banff, May 8, 2007 CHUL research center - Laval University 26

U RL vs L SID http:/ / bio2rdf.org/ uniprot:p26838 owl:sameAs urn:lsid:uniprot.org:uniprot:p26838 http:/ / bio2rdf .org/ un ipr ot:p26838 http:/ / bi o2rdf .org/ ur n:lsid:uni pr ot.or g:unipr ot:p2 6838 Banff, May 8, 2007 CHUL research center - Laval University 27

Our method to answer question T o answer a very specialized question, we build a specifi c kn owledge base (the mash up stored in a RDF triplestore) and then query it wi th SeRQL. Banff, May 8, 2007 CHUL research center - Laval University 28

Parkinson examples 1. What is the semantic network of OMIM records describing Parkinson’s disease? 2. Which MeSH terms are mostly cited in Parkinson’s disease publications? 3. What genes related to Parkinson’s disease are involved in pathways according to Kegg ? Banff, May 8, 2007 CHUL research center - Laval University 29

Time for demo ! Banff, May 8, 2007 CHUL research center - Laval University 30

The big everything about parkinson http:/ / localhost:8080/ bio2rdf/ search:parkinson@omim http:/ / localhost:8080/ bio2rdf/ search:parkinson@geneid http:/ / localhost:8080/ bio2rdf/ search:parkinson@uniprot http:/ / localhost:8080/ bio2rdf/ search:parkinson@kegg http:/ / localhost:8080/ bio2rdf/ load:pubmed http:/ / localhost:8080/ bio2rdf/ sameas:hsa-geneid http:/ / localhost:8080/ bio2rdf/ learn:geneid http:/ / localhost:8080/ bio2rdf/ load:cpd http:/ / localhost:8080/ bio2rdf/ load:reactome http:/ / localhost:8080/ bio2rdf/ load:biopax-xref http:/ / localhost:8080/ bio2rdf/ load:chebi http:/ / localhost:8080/ bio2rdf/ load:obo-xref http:/ / localhost:8080/ bio2rdf/ sameas:keggcompound-cpd 1.700 K triples 97 M bytes in turtle format in 90 minutes Banff, May 8, 2007 CHUL research center - Laval University 31

Third exemple SeRQL query What genes related to Parkinson’s disease are involved in pathways according to Kegg ? SELECT GeneticDisorder-label, Gene-label, pathway-label FROM {GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>}, {GeneticDisorder} rdfs:label {GeneticDisorder-label}, {GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs}, {Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs}, {Gene} rdfs:label {Gene-label}, {Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene}, {xobject} <http://bio2rdf.org/kegg#xobject> {Gene2}, {xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject}, {pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1}, {pathway} rdfs:label {pathway-label} WHERE GeneticDisorder-label like quot;*PARKINSON*quot; Banff, May 8, 2007 CHUL research center - Laval University 32

Query result Banff, May 8, 2007 CHUL research center - Laval University 33

Conclusion Banff, May 8, 2007 CHUL research center - Laval University 34

Before Bio2RDF integration Banff, May 8, 2007 CHUL research center - Laval University 35

Our main results ● RDF is a framework that enables a very simple thing: scalability of the knowledge base complexity. ● The Bio2RDF project proposes to keep complexity in the bioinformatics knowledge space under control by applying this proven web semantic approach. Banff, May 8, 2007 CHUL research center - Laval University 36

Now with Bio2RDF semantic integration Banff, May 8, 2007 CHUL research center - Laval University 37

Bio2RDF ’s vision of knowledge map Banff, May 8, 2007 CHUL research center - Laval University 38

Bio2RDF ’s map of distributed bioinformatics knowledge http://bio2rdf.org/bio2rdf-2007-02.owl Banff, May 8, 2007 CHUL research center - Laval University 39

M ap of semantic resource Banff, May 8, 2007 CHUL research center - Laval University 40

M ontreal’s subway map Banff, May 8, 2007 CHUL research center - Laval University 41

Bio2RDF ’s actual knowledge map Banff, May 8, 2007 CHUL research center - Laval University 42

Achievement Public data + open source software + rdf technology + rdfizer + normalized U RIs = Bio2RDF knowledge integration; A bioinformatic-integration ontology wont exist if it is not adopted by the community, bio2rdf.owl is just a proposed starting point; 46 millions RDF documents are now available at http:/ / bio2rdf.org. Banff, May 8, 2007 CHUL research center - Laval University 43

Bio2RDF project provides open source RDFizer to the community. So much style need to be rdfized, if you are interested to contribute, join us! Now lets build the big knowledge map of bioinformatics… Banff, May 8, 2007 CHUL research center - Laval University 44

Final words Please, tell Sir Tim Berners-L ee that he was right ‘semantic web in bioinformatics’ is a k ille r a p p to illustrate all the potential of the semantic web. And also, tell M ark W ilkinson that semantic web in bioinformatics won’t be full of cr e e p s if we organize it like we did… Banff, May 8, 2007 CHUL research center - Laval University 45

Thanks Jean M orissette Nicole Tourigny Philippe Rigault Bioinformatics lab’s team at CHU L Research Center M any open source communities (OpenRDF, Simile’s project, Tomcat, J STL and many more) W 3C Bio-RDF G roup G énome Québec G énome Canada

Visit http://bio2rdf.org Download http://sourceforge.net/projects/bio2rdf/ Discover http://bio2rdf.org/bio2rdf-2007-02.owl Contact us at bio2rdf@gmail.com Banff, May 8, 2007 CHUL research center - Laval University 47

Add a comment

Related presentations

Related pages

Bio2RDF: Towards a mashup to build bioinformatics ...

... of bioinformatics knowledge integration. Bio2RDF ... of Bio2RDF. To build this knowledge base system, ... Bio2RDF: Towards a Mashup to Build ...
Read more

Journal of Biomedical Informatics - Welcome | Mashup Pubs

Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Franc ois Belleaua,*, Marc-Alexandre Nolina,b,*, Nicole Tourignyb, Philippe Rigaulta ...
Read more

Bio2RDF: Towards A Mashup To Build Bioinformatics ...

Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System ... how the Bio2RDF mashup system can build a triplestore that
Read more

Bio2RDF: Towards a mashup to build bioinformatics ...

Bio2RDF is such a system, built from ... mashup to build bioinformatics knowledge ... process of bioinformatics knowledge integration. Bio2RDF is ...
Read more

Bio2RDF: Towards a mashup to build bioinformatics ...

... process of bioinformatics knowledge integration. Bio2RDF is ... Bio2RDF: Towards a mashup to build ... Bio2RDF is such a system, built from ...
Read more

Bio2RDF: Towards a mashup to build bioinformatics ...

Abstract. Presently, there are numerous bioinformatics databases available on different websites. Although RDF was proposed as a standard format for the ...
Read more

Bio2RDF: Towards a mashup to build bioinformatics ...

Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Belleau, François; Nolin, Marc-Alexandre; Tourigny, Nicole; Rigault, Philippe ...
Read more

et al.: Bio2RDF: towards a mashup to build bioinformatics ...

CiteSeerX - Scientific documents that cite the following paper: et al.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems
Read more

Bio2RDF - Wikipedia, the free encyclopedia

http://bio2rdf.org: Tools; Miscellaneous; Bio2RDF is a Biological database using the Semantic web technologies to provide interlinked life science data. [1]
Read more