Chemical named entity recognition and literature mark-up

57 %
43 %
Information about Chemical named entity recognition and literature mark-up

Published on March 11, 2008

Author: dullhunk

Source: slideshare.net

Description

Presentation by Colin Batchelor, Royal Society of Chemistry publishing, in Manchester, March 2008

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email_address]

Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.

Project Prospect: what we find and how we find it.

RDF: How should we be disseminating it?

Next steps: Basics for a chemical ontology.

 

 

 

 

 

 

Project Prospect: What do we find? Chemical compounds Chemical terms from the IUPAC Gold Book Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

Chemical compounds

Chemical terms from the IUPAC Gold Book

Gene products: function, process, location

Nucleotide and polypeptide sequence terms

Cell types

Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

For compound names:

~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007)

~20% PubChem

~20% ChemDraw

For compound numbers:

~70% author ChemDraw

~30% editors

 

RDF in an RSS reader

RDF: how we do it now Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content In what sense does an article “contain” pyridine or base pairs? We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

Content module from RSS 1.0

http://web.resource.org/rss/1.0/modules/content

In what sense does an article “contain” pyridine or base pairs?

We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

RDF: what it looks like now <item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> <title> [… title] </title> <link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> <description> [… blah] </description> <content:encoded> [… human-readable stuff</content:encoded> [… dublin core stuff …] <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/> </rdf:li> <rdf:li> <content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </rdf:li> </rdf:Bag> </content:items> </item>

<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1>

<title> [… title] </title>

<link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link>

<description> [… blah] </description>

<content:encoded> [… human-readable stuff</content:encoded>

[… dublin core stuff …]

<content:items>

<rdf:Bag>

<rdf:li>

<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/>

</rdf:li>

<rdf:li>

<content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/>

</rdf:li>

</rdf:Bag>

</content:items>

</item>

Basics for a chemical ontology Unambiguous representation of objects of chemical discourse Proper parthood relations

Unambiguous representation of objects of chemical discourse

Proper parthood relations

Basics for a chemical ontology: 1. Objects of chemical discourse Must be able to represent and clearly distinguish Compounds Classes of compound Parts of molecules Mixtures Would be nice to have: Disambiguation cues for the first three

Must be able to represent and clearly distinguish

Compounds

Classes of compound

Parts of molecules

Mixtures

Would be nice to have:

Disambiguation cues for the first three

Imidazole

An imidazole

The imidazole side-chain/group/ring

Can ChEBI handle this? Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069) Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc. ) … and there are no disambiguation cues

Imidazoles (!) (CHEBI:24780)

Imidazole (CHEBI:16069)

Imidazole ring not yet

Imidazolyl group not yet (but methyl, benzyl, etc. )

… and there are no disambiguation cues

Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

One Sense per Discourse (Gale et al. 1992)

… this doesn’t hold at all

One Sense per Collocation (Yarowsky 1993)

… matches our intuitions

Disambiguation: What a one sense per collocation feature set might look like CLASS: w (–1) = a, an, the, this w (0) plural (bit of a cheat, as not a collocation) PART: w (–1) = bridging, terminal w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w (+1) w (+2) = “building block”, “protecting group”, “side chain”

CLASS:

w (–1) = a, an, the, this

w (0) plural (bit of a cheat, as not a collocation)

PART:

w (–1) = bridging, terminal

w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more)

w (+1) w (+2) = “building block”, “protecting group”, “side chain”

Basics for a chemical ontology: 2. Parthood relations Parthood in ChEBI means at least three things: is necessarily chemically part of carbonyl group part_of carbonyl compounds

Parthood in ChEBI means at least three things:

is necessarily chemically part of

carbonyl group part_of carbonyl compounds

Basics for a chemical ontology: 2. Parthood relations Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isn’t) Electron part_of muonium (!)

Is possibly chemically part of:

Lead(2+) part_of lead diacetate

(most lead(2+) isn’t)

Electron part_of muonium (!)

Basics for a chemical ontology: 2. Parthood relations Is part of a mixture Kanamycin A part_of kanamycin

Is part of a mixture

Kanamycin A part_of kanamycin

Basics for a chemical ontology: 2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005) carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)

Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005)

carbonyl compound has_part carbonyl group

Lead diacetate has_part lead(2+) (?!)

Muonium has_part electron

Kanamycin has_part kanamycin A (?!)

Basics for a chemical ontology: 2. Parthood relations Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships Carbonyl compound molecule has_part carbonyl substituent Muonium atom has_part electron Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)

Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships

Carbonyl compound molecule has_part carbonyl substituent

Muonium atom has_part electron

Kanamycin has_component kanamycin A

Lead diacetate has_component lead(2+) (?!)

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) www.projectprospect.org

University of Cambridge: Peter Corbett

OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)

www.projectprospect.org

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

Add a comment