advertisement

Chemical named entity recognition and literature mark-up

50 %
50 %
advertisement
Information about Chemical named entity recognition and literature mark-up

Published on March 11, 2008

Author: dullhunk

Source: slideshare.net

Description

Presentation by Colin Batchelor, Royal Society of Chemistry publishing, in Manchester, March 2008
advertisement

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email_address]

Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.

Project Prospect: what we find and how we find it.

RDF: How should we be disseminating it?

Next steps: Basics for a chemical ontology.

 

 

 

 

 

 

Project Prospect: What do we find? Chemical compounds Chemical terms from the IUPAC Gold Book Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

Chemical compounds

Chemical terms from the IUPAC Gold Book

Gene products: function, process, location

Nucleotide and polypeptide sequence terms

Cell types

Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

For compound names:

~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007)

~20% PubChem

~20% ChemDraw

For compound numbers:

~70% author ChemDraw

~30% editors

 

RDF in an RSS reader

RDF: how we do it now Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content In what sense does an article “contain” pyridine or base pairs? We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

Content module from RSS 1.0

http://web.resource.org/rss/1.0/modules/content

In what sense does an article “contain” pyridine or base pairs?

We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

RDF: what it looks like now <item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> <title> [… title] </title> <link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> <description> [… blah] </description> <content:encoded> [… human-readable stuff</content:encoded> [… dublin core stuff …] <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/> </rdf:li> <rdf:li> <content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </rdf:li> </rdf:Bag> </content:items> </item>

<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1>

<title> [… title] </title>

<link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link>

<description> [… blah] </description>

<content:encoded> [… human-readable stuff</content:encoded>

[… dublin core stuff …]

<content:items>

<rdf:Bag>

<rdf:li>

<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/>

</rdf:li>

<rdf:li>

<content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/>

</rdf:li>

</rdf:Bag>

</content:items>

</item>

Basics for a chemical ontology Unambiguous representation of objects of chemical discourse Proper parthood relations

Unambiguous representation of objects of chemical discourse

Proper parthood relations

Basics for a chemical ontology: 1. Objects of chemical discourse Must be able to represent and clearly distinguish Compounds Classes of compound Parts of molecules Mixtures Would be nice to have: Disambiguation cues for the first three

Must be able to represent and clearly distinguish

Compounds

Classes of compound

Parts of molecules

Mixtures

Would be nice to have:

Disambiguation cues for the first three

Imidazole

An imidazole

The imidazole side-chain/group/ring

Can ChEBI handle this? Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069) Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc. ) … and there are no disambiguation cues

Imidazoles (!) (CHEBI:24780)

Imidazole (CHEBI:16069)

Imidazole ring not yet

Imidazolyl group not yet (but methyl, benzyl, etc. )

… and there are no disambiguation cues

Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

One Sense per Discourse (Gale et al. 1992)

… this doesn’t hold at all

One Sense per Collocation (Yarowsky 1993)

… matches our intuitions

Disambiguation: What a one sense per collocation feature set might look like CLASS: w (–1) = a, an, the, this w (0) plural (bit of a cheat, as not a collocation) PART: w (–1) = bridging, terminal w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w (+1) w (+2) = “building block”, “protecting group”, “side chain”

CLASS:

w (–1) = a, an, the, this

w (0) plural (bit of a cheat, as not a collocation)

PART:

w (–1) = bridging, terminal

w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more)

w (+1) w (+2) = “building block”, “protecting group”, “side chain”

Basics for a chemical ontology: 2. Parthood relations Parthood in ChEBI means at least three things: is necessarily chemically part of carbonyl group part_of carbonyl compounds

Parthood in ChEBI means at least three things:

is necessarily chemically part of

carbonyl group part_of carbonyl compounds

Basics for a chemical ontology: 2. Parthood relations Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isn’t) Electron part_of muonium (!)

Is possibly chemically part of:

Lead(2+) part_of lead diacetate

(most lead(2+) isn’t)

Electron part_of muonium (!)

Basics for a chemical ontology: 2. Parthood relations Is part of a mixture Kanamycin A part_of kanamycin

Is part of a mixture

Kanamycin A part_of kanamycin

Basics for a chemical ontology: 2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005) carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)

Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005)

carbonyl compound has_part carbonyl group

Lead diacetate has_part lead(2+) (?!)

Muonium has_part electron

Kanamycin has_part kanamycin A (?!)

Basics for a chemical ontology: 2. Parthood relations Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships Carbonyl compound molecule has_part carbonyl substituent Muonium atom has_part electron Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)

Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships

Carbonyl compound molecule has_part carbonyl substituent

Muonium atom has_part electron

Kanamycin has_component kanamycin A

Lead diacetate has_component lead(2+) (?!)

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) www.projectprospect.org

University of Cambridge: Peter Corbett

OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)

www.projectprospect.org

Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word-sense disambiguation? What is the best way of distinguishing molecules and samples?

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

Add a comment

Related pages

Chemical named entities recognition: a review on ...

Chemical named entities recognition: ... In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, ...
Read more

Named entity recognition - Brede Wiki

Named entity recognition ... a hybrid system for chemical named entity recognition; ... Using encyclopedic knowledge for named entity disambiguation;
Read more

A survey of named entity recognition and classification

A survey of named entity recognition and classification ... The term “Named Entity”, ... Chinese is studied in an abundant literature (e.g., ...
Read more

CheNER: chemical named entity recognizer.

... chemical named entity ... Chemical named entity recognition is used ... We evaluated different systems using an established literature corpus ...
Read more

OSIRISv1.2: A named entity recognition system for sequence ...

OSIRISv1.2 can be used to link literature references to ... in named entity recognition and ... and gene entity recognition. BMC Bioinformatics ...
Read more

Domain Independent Named Entity Recognition from ...

Domain Independent Named Entity Recognition from Biological Literature ... chemicals, events, physical ... Biomedical Named Entity Recognition, ...
Read more

Annotation of Chemical Named Entities - ACL Member Portal

Annotation of Chemical Named Entities ... Chemical named entity recognition ... w ould only mark up `fentan yl'). Fragments of chemicals such as `methyl ...
Read more

Identifying Chemical Entities based on ChEBI - CEUR-WS.org

Identifying Chemical Entities based ... integrates algorithms for chemical entity recognition in biomedical literature, resolution of named entities to the ...
Read more

Projects - infochem

Chemisches Zentralblatt. InfoChem is performing automatic chemical named entity recognition of Chemisches ... It covers the chemical literature from 1830 ...
Read more