ekaw2006 tutorial

50 %
50 %
Information about ekaw2006 tutorial
Entertainment

Published on October 21, 2007

Author: Reva

Source: authorstream.com

Slide1:  Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial Diana Maynard (University of Sheffield) Julien Nioche (University of Sheffield) Marta Sabou (Vrije Universiteit Amsterdam) Johanna Völker (AIFB) Atanas Kiryakov (Ontotext Lab, Sirma AI) EKAW 2006 [This work has been supported by SEKT (http://sekt.semanticweb.org/) and KnowledgeWeb (http://knowledgeweb.semanticweb.org/ ] Slide2:  Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation Structure of the Tutorial Aims of this tutorial:  Aims of this tutorial Investigates some technical aspects of HLT for the SW and brings this methodology closer to non-HLT experts Provides an introduction to an HLT toolkit (GATE) Demonstrates using HLT for automating SW-specific knowledge acquisition tasks such as: Semantic annotation Ontology learning Ontology population Some Terminology:  Some Terminology Semantic annotation – annotate in the texts all mentions of instances relating to concepts in the ontology Ontology learning – automatically derive an ontology from texts Ontology population – given an ontology, populate the concepts with instances derived automatically from a text Semantic Annotation: Motivation:  Semantic Annotation: Motivation Semantic metadata extraction and annotation is the glue that ties ontologies into document spaces Metadata is the link between knowledge and its management Manual metadata production cost is too high State-of-the-art in automatic annotation needs extending to target ontologies and scale to industrial document stores and the web Challenge of the Semantic Web:  Challenge of the Semantic Web The Semantic Web requires machine processable, repurposable data to complement hypertext Once metadata is attached to documents, they become much more useful and more easily processable, e.g. for categorising, finding relevant information, and monitoring Such metadata can be divided into two types of information: explicit and implicit. Metadata extraction:  Metadata extraction Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.) Implicit metadata extraction involves semantic information deduced from the text, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology. Ontology Learning and Population: Motivation:  Ontology Learning and Population: Motivation Creating and populating ontologies manually is a very time-consuming and labour-intensive task It requires both domain and ontology experts Manually created ontologies are generally not compatible with other ontologies, so reduce interoperability and reuse Manual methods are impossible with very large amounts of data Semantic Annotation vs Ontology Population:  Semantic Annotation vs Ontology Population Semantic Annotation Mentions of instances in the text are annotated wrt concepts (classes) in the ontology. Requires that instances are disambiguated. It is the text which is modified. Ontology Population Generates new instances in an ontology from a text. Links unique mentions of instances in the text to instances of concepts in the ontology. Instances must be not only disambiguated but also co-reference between them must be established. It is the ontology which is modified. Slide10:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation GATE : an open source framework for HLT:  GATE : an open source framework for HLT GATE (General Architecture for Text Engineering) is a framework for language processing (http://gate.ac.uk) Open Source (LGPL licence) Hosted on SourceForge http://sourceforge.net/projects/gate Ten years old (!), with 1000s of users at 100s of sites Current version 3.1 4 sides to the story:  4 sides to the story An architecture: A macro-level organisational picture for HLT software systems. A framework: For programmers, GATE is an object-oriented class library that implements the architecture. A development environment: For language engineers, computational linguists et al, a graphical development environment. A community of users and contributors Slide13:                                                                                                                              Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI All the world’s a Java Bean....:  All the world’s a Java Bean.... CREOLE: a Collection of REusable Objects for Language Engineering: GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Why bother? Allows the system to load arbitrary language processing components Slide15:  NOTES everything is a replaceable bean all communication via fixed APIs low coupling, high modularity, high extensibility GATE APIs Onto- logy Protégé Onto- logy Word- net Gaz- etteers Language Resource Layer (LRs) ... In short…:  In short… GATE includes: plugins for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... tools for visualising and manipulating ontologies ontology-based information extraction tools evaluation and benchmarking tools GATE Users:  GATE Users American National Corpus project Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs: Melandra, SG-MediaStyle, ... a large number of other UK, US and EU Universities UK and EU projects inc. SEKT, PrestoSpace, KnowledgeWeb, MyGrid, CLEF, Dot.Kom, AMITIES, CubReporter, … Past Projects using GATE:  Past Projects using GATE MUMIS: conceptual indexing: automatic semantic indices for sports video MUSE: multi-genre multilingual IE HSL: IE in domain of health and safety Old Bailey: IE on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research in e-science EMILLE: creation of S. Asian language corpus ACE / TIDES: IE competitions and collaborations in English, Chinese, Arabic, Hindi h-TechSight: ontology-based IE and text mining Current projects using GATE:  Current projects using GATE ETCSL: Language tools for Sumerian digital library SEKT: Semantic Knowledge Technologies PrestoSpace: Preservation of audiovisual data KnowledgeWeb: Semantic Web network of excellence MEDIACAMPAIGN: Discovering, inter-relating and navigating cross-media campaign knowledge TAO : Transitioning Applications to Ontologies MUSING : SW-based business intelligence tools NEON : Networked Ontologies GATE:  GATE Slide21:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation IE is not IR:  IE is not IR IE pulls facts and structured information from the content of large text collections. You analyse the facts. IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents. IE for Document Access:  IE for Document Access With traditional query engines, getting the facts can be hard and slow Where has the Queen visited in the last year? Which places on the East Coast of the US have had cases of West Nile Virus? Which search terms would you use to get this kind of information? How can you specify you want someone’s home page? IE returns information in a structured way IR returns documents containing the relevant information somewhere (if you’re lucky) HaSIE: an example application:  HaSIE: an example application Application developed by University of Sheffield, which aims to find out how companies report about health and safety information Answers questions such as: “How many members of staff died or had accidents in the last year?” “Is there anyone responsible for health and safety?” “What measures have been put in place to improve health and safety in the workplace?” HaSIE:  HaSIE Identification of such information is too time-consuming and arduous to be done manually. Each company report may be hundreds of pages long. IR systems can’t help because they return whole documents System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information This can then be analysed by an expert HASIE:  HASIE Named Entity Recognition: the cornerstone of IE:  Named Entity Recognition: the cornerstone of IE Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate Why is NE important?:  Why is NE important? NE provides a foundation from which to build more complex IE systems Relations between NEs can provide tracking, ontological information and scenario building Tracking (co-reference) “Dr Smith”, “John Smith”, “John”, he” Ontologies “Athens, Georgia” vs “Athens, Greece” Two kinds of approaches:  Two kinds of approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus Typical NE pipeline:  Typical NE pipeline Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging) Entity finding (gazeteer lookup, NE grammars) Coreference (alias finding, orthographic coreference etc.) Export to database / XML An Example:  An Example Ryanair announced yesterday that it will make Shannon its next European base, expanding its route network to 14 in an investment worth around €180m. The airline says it will deliver 1.3 million passengers in the first year of the agreement, rising to two million by the fifth year. Entities: Ryanair, Shannon Descriptions: European base Relations: Shannon base_of Ryanair Events: investment(€180m) Mentions: it=Ryanair, The airline=Ryanair, it=the airline System development cycle:  System development cycle Collect corpus of texts Manually annotate gold standard Develop system Evaluate performance against gold standard Return to step 3, until desired performance is reached Performance Evaluation:  Performance Evaluation 2 main requirements: Evaluation metric: mathematically defines how to measure the system’s performance against human-annotated gold standard Scoring program: implements the metric and provides performance measures For each document and over the entire corpus For each type of NE Evaluation Metrics:  Evaluation Metrics Most common are Precision and Recall Precision = correct answers/answers produced Recall = correct answers/total possible correct answers Trade-off between precision and recall F1 (balanced) Measure = 2PR / 2(R + P) Some tasks sometimes use other metrics, e.g. cost-based (good for application-specific adjustment) Ontology-based IE requires measures sensitive to the ontology GATE AnnotationDiff Tool:  GATE AnnotationDiff Tool Corpus-level Regression Testing:  Corpus-level Regression Testing Need to track system’s performance over time When a change is made we want to know implications over whole corpus Why: because an improvement in one case can lead to problems in others GATE offers corpus benchmark tool, which can compare different versions of the same system against a gold standard This operates on a whole corpus rather than a single document Corpus Benchmark Tool:  Corpus Benchmark Tool Slide38:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation GATE’s Rule-based System - ANNIE:  GATE’s Rule-based System - ANNIE ANNIE – A Nearly-New IE system A version distributed as part of GATE GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging GATE has a finite-state pattern-action rule language - JAPE, used by ANNIE A reusable and easily extendable set of components What is ANNIE?:  What is ANNIE? ANNIE is a vanilla information extraction system comprising a set of core PRs: Tokeniser Gazetteers Sentence Splitter POS tagger Semantic tagger (JAPE transducer) Orthomatcher (orthographic coreference) Slide41:  Core ANNIE Components Re-using ANNIE:  Re-using ANNIE Typically a new application will use most of the core components from ANNIE The tokeniser, sentence splitter and orthomatcher are basically language, domain and application-independent The POS tagger is language dependent but domain and application-independent The gazetteer lists and JAPE grammars may act as a starting point but will almost certainly need to be modified You may also require additional PRs (either existing or new ones) DEMO of ANNIE and GATE GUI:  DEMO of ANNIE and GATE GUI Loading ANNIE Creating a corpus Loading documents Running ANNIE on corpus Demo Gazetteers:  Gazetteers Gazetteers are plain text files containing lists of names (e.g rivers, cities, people, …) Information used by JAPE rules Each gazetteer set has an index file listing all the lists, plus features of each list (majorType, minorType and language) Lists can be modified either internally using Gaze, or externally in your favourite editor Gazetteers can also be mapped to ontologies Generates Lookup results of the given kind JAPE grammars:  JAPE grammars JAPE is a pattern-matching language The LHS of each rule contains patterns to be matched The RHS contains details of annotations (and optionally features) to be created The patterns in the corpus are identified using ANNIC Input specifications:  Input specifications The head of each grammar phase needs to contain certain information Phase name Inputs Matching style e.g. Phase: location Input: Token Lookup Number Control: appelt Slide50:  Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=“Company1” } => will match “Digital Pebble Ltd” NE Rule in JAPE LHS of the rule:  LHS of the rule LHS is expressed in terms of existing annotations, and optionally features and their values Any annotation to be used must be included in the input header Any annotation not included in the input header will be ignored (e.g. whitespace) Each annotation is enclosed in curly braces Each pattern to be matched is enclosed in round brackets and has a label attached Macros:  Macros Macros look like the LHS of a rule but have no label Macro: NUMBER (({Digit})+) They are used in rules by enclosing the macro name in round brackets ( (NUMBER)+):match Conventional to name macros in uppercase letters Macros hold across an entire set of grammar phases Contextual information:  Contextual information Contextual information can be specified in the same way, but has no label Contextual information will be consumed by the rule ({Annotation1}) ({Annotation2}):match ({Annotation3})  RHS of the rule:  RHS of the rule LHS and RHS are separated by  Label matches that on the LHS Annotation to be created follows the label (Annotation1):match  :match.NE = {feature1 = value1, feature2 = value2} Example Rule for Dates:  Example Rule for Dates Macro: ONE_DIGIT ({Token.kind == number, Token.length == "1"}) Macro: TWO_DIGIT ({Token.kind == number, Token.length == "2"}) Rule: TimeDigital1 // 20:14:25 ( (ONE_DIGIT|TWO_DIGIT){Token.string == ":"} TWO_DIGIT ({Token.string == ":"} TWO_DIGIT)? (TIME_AMPM)? (TIME_DIFF)? (TIME_ZONE)? ) :time --> :time.TempTime = {kind = "positive", rule = "TimeDigital1"} Identifying patterns in corpora:  Identifying patterns in corpora ANNIC – ANNotations In Context Provides a keyword-in-context-like interface for identifying annotation patterns in corpora Uses JAPE LHS syntax, except that + and * need to be quantified e.g. {Person}{Token}*3{Organisation} – find all Person and Organisation annotations within up to 3 tokens of each other To use, pre-process the corpus with ANNIE or your own components, then query it via the GUI ANNIC Demo:  ANNIC Demo Formulating queries Finding matches in the corpus Analysing the contexts Refining the queries Demo Using phases:  Using phases Grammars usually consist of several phases, run sequentially A definition phase (conventionally called main.jape) lists the phases to be used, in order Only the definition phase needs to be loaded Temporary annotations may be created in early phases and used as input for later phases Annotations from earlier phases may need to be combined or modified Matching algorithms and Rule Priority:  Matching algorithms and Rule Priority Rules compete within a single phase! 3 styles of matching: Brill (fire every rule that applies) First (shortest rule fires) Appelt (use of priorities) Appelt priority is applied in the following order Starting point of a pattern Longest pattern Explicit priority (default = -1) Slide61:  Named Entities in GATE Using co-reference:  Using co-reference Orthographic co-reference module matches proper names in a document Improves results by assigning entity type to previously unclassified names, based on relations with classified entities May not reclassify already classified entities Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM] Named Entity Coreference:  Named Entity Coreference GATE 4.0:  GATE 4.0 Before end 06 Faster and leaner! Nicer GUI ANNIC included Improved Machine Learning API (based on YALE) and more… Structure of the Tutorial:  Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation Structure of the Tutorial Information Extraction for the Semantic Web:  Information Extraction for the Semantic Web Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc. For the Semantic Web, we need information in a hierarchical structure Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology Richer NE Tagging:  Richer NE Tagging Attachment of instances in the text to concepts in the domain ontology Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK Magpie: an example:  Magpie: an example Developed by the Open University Plugin for standard web browser Automatically associates an ontology-based semantic layer to web resources, allowing relevant services to be linked Provides means for a structured and informed exploration of the web resources e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc. MAGPIE in action:  MAGPIE in action MAGPIE in action:  MAGPIE in action GATE and the Semantic Web:  GATE and the Semantic Web Supports ontologies as part of IE applications - Ontology-Based IE (OBIE) Supports semantic annotation and ontology population Can combine learning and rule-based methods Allows combination of IE and IR Enables use of large-scale linguistic resources for IE, such as WordNet Ontology Management in GATE:  Ontology Management in GATE Linking the Text to the Ontology:  Linking the Text to the Ontology Exported Database:  Exported Database Evaluation for OBIE:  Evaluation for OBIE Traditional IE is evaluated in terms of Precision, Recall and F-measure. But these are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong Similarity metrics need to be integrated so that items closer together in the hierarchy are given a higher score, if wrong Augmented Precision and Recall:  Augmented Precision and Recall Development of a new BDM (Balanced Distance Metric) which compares key and response concepts wrt a given ontology In the case of ontological mismatch, provides an indication of how serious the error is, and weights it accordingly BDM provides a score between 0 and 1 for each key/response match instead of a binary measure Augmented Precision and Recall:  Augmented Precision and Recall BDM is integrated with traditional Precision and Recall in the following way to produce a score at the corpus level: Examples of misclassification:  Examples of misclassification Slide79:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation Ontology Learning with Text2Onto http://ontoware.org/projects/text2onto/:  Ontology Learning with Text2Onto http://ontoware.org/projects/text2onto/ Johanna Völker voelker@aifb.uni-karlsruhe.de Institute AIFB University of Karlsruhe Agenda:  Agenda Ontology Learning Tasks Problems Text2Onto Overview Architecture Linguistic preprocessing Ontology learning approaches Summary Ontology Learning:  Ontology Learning Extraction of (domain) ontologies from natural language text Machine learning Natural language processing Tools: OntoLearn, OntoLT, ASIUM, Mo’K Workbench, JATKE, TextToOnto, … Ontology Learning – Tasks:  Ontology Learning – Tasks Slide84:  instance-of( Hewlett Packard, organization ) subclass-of( research, activity ) Slide85:  reach( information, people ) address_in( issue, article ) subclass-of( resource, knowledge ) Ontology Learning – Problems Text Understanding:  Ontology Learning – Problems Text Understanding Words are ambiguous ‘A bank is a financial institution. A bank is a piece of furniture.’  subclass-of( bank, financial institution ) ? Natural Language is informal ‘The sea is water.’  subclass-of( sea, water ) ? Sentences may be underspecified ‘Mary started the book.’  read( Mary, book_1 ) ? Anaphores ‘Peter lives in Munich. This is a city in Bavaria.’ instance-of( Munich, city ) ? Metaphores, … Ontology Learning – Problems Knowledge Modeling:  What is an instance / concept? ‘The koala is an animal living in Australia.’ instance-of( koala, animal ) subclass-of( koala, animal ) ? How to deal with opinions and quoted speech? ‘Tom thinks that Peter loves Mary.’ love( Peter, Mary ) ? Knowledge is changing instance-of( Pluto, planet ) ? Conclusion: Ontology learning is difficult. What we can learn is fuzzy and uncertain. Ontology maintenance is important. Ontology Learning – Problems Knowledge Modeling Text2Onto:  Text2Onto Support for (semi-)automatic ontology extraction from natural language text Support for ontology maintenance and data-driven ontology evolution by incremental ontology learning Model of Possible Ontologies (POM) Confidence / relevance values attached to all concepts, instances and relations Enhanced user interaction Maintenance of multiple modeling alternatives in parallel Independence of certain ontology language Slide89:  subclass-of( user, human ) / confidence 1.0 subclass-of( document, communication ) / confidence 0.75 Text2Onto – Evidence, Reference and Change Management:  Explicit modeling of evidences Algorithms provide different types of evidences Explanation component References for annotation and change detection Explicit modeling of changes Corpus, evidence, reference and ontology changes Future work: ontology change strategies Text2Onto – Evidence, Reference and Change Management Text2Onto – Workflow:  Text2Onto – Workflow Workflow composition Complex algorithms Different types of algorithms for each ontology learning task Flexible combination of results Combination strategies minimum, maximum, average, linear, classifier, … Slide92:  POM Visualization Workflow Manager API GATE Corpus Algorithm Controller OWL Writer RDFS Writer F-Logic Writer POM Evidence Store Reference Store Text2Onto Ontology Linguistic Preprocessing GATE:  Linguistic Preprocessing GATE Standard ANNIE components for Tokenization Sentence splitting POS tagging Stemming / lemmatizing Self-defined JAPE patterns and processing resources for Stop word detection Shallow parsing GATE applications for English, German and Spanish Ontology Learning Approaches Concept Classification:  Ontology Learning Approaches Concept Classification Heuristics ‘image processing software’ subclass-of( image processing software, software ) Patterns ‘animals such as dogs’ ‘dogs and other animals’ ‘a dog is an animal’  subclass-of( dog, animal ) JAPE Patterns for Ontology Learning:  JAPE Patterns for Ontology Learning rule: Hearst_1 ( (NounPhrase):superconcept {SpaceToken.kind == space} {Token.string=="such"} {SpaceToken.kind == space} {Token.string=="as"} {SpaceToken.kind == space} (NounPhrasesAlternatives):subconcept ):hearst1 --> :hearst1.SubclassOfRelation = { rule = "Hearst1" }, :subconcept.Domain = { rule = "Hearst1" }, :superconcept.Range = { rule = "Hearst1" } Ontology Learning Approaches Instance Classification:  Ontology Learning Approaches Instance Classification Context similarity ‘Columbus is the capital of the state of Ohio. Columbus has a population of about 700.000 inhabitants.’ Columbus ( capital (1), state (1), Ohio (1), population (1), inhabitant (1) ) city ( country (2), state (1), inhabitant (2), mayor (1), attraction (1) ) explorer( ship (1), sailor (2), discovery (1) )  instance-of( Columbus, city ) Ontology Learning Approaches Relation Extraction:  Ontology Learning Approaches Relation Extraction Subcategorization frames ‘Tina drives a Ford.’ instance-of( Tina, person ) instance-of( Ford, vehicle ) ‘Her father drives a bus.’ subclass-of( father, person ) subclass-of( bus, vehicle ) subcat: drive( subj: person, obj: vehicle ) drive( person, vehicle ) Slide98:  incluyen( ontologiás, definiciones ) / confidence 1.0 Other Ontology Learning Approaches:  Other Ontology Learning Approaches WordNet Hyponym( ‘bank’, ‘institution’ )  subclass-of( bank, institution ) ? Google ‘cities such as London’, ‘persons such as London’ … ‘such as London’  instance-of( London, city ) ? Instance clustering Hierarchical clustering of context vectors Formal Concept Analysis (FCA) breathe( animal ) breathe( human ), speak( human )  subclass-of( human, animal ) ? Summary:  Summary Ontology Learning is difficult, because Language is fuzzy Knowledge is changing Text2Onto targets these Problems Model of Possible Ontologies Heterogeneous sources of evidence Incremental ontology learning Thanks!:  Thanks! http://www.aifb.de/WBS/jvo/ontology-learning http://www.ontoware.org/projects/text2onto Slide102:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation Focused Ontology Learning with GATE :  Focused Ontology Learning with GATE Marta Sabou A Practical Report on Learning Web Service Ontologies Slide104:  Goal of the Talk The goal of this talk is: To describe a Semantic Web relevant task: Focused Ontology Learning. To exemplify this task in the context of Web Services. To show how focused ontology learning can be implemented in GATE. The focus of the talk is NOT ontology learning but the elements of GATE that helped to perform this task. Slide105:  Outline 1) Generic Problem: * Focused Ontology Learning (definition and characteristics) 2) Specific Problem: * Learning Web Service Ontologies (Context, Problem Scenario) 3) GATE support for: * writing extraction patterns * evaluating term extraction performance Slide106:  Ontology Learning in Restricted Domains Focused Ontology Learning: is Ontology Learning in a restricted domain, for a well-defined task therefore, simpler than Ontology Learning in general more and more frequent with the growth of the Semantic Web Previous Talk’s conclusion: Generic Ontology Learning is important but difficult because: Language is fuzzy Knowledge is changing However... The Semantic Web is increasingly used in specialized domains, where: Language exhibits (strong) domain characteristics e.g., mathematics, medicine The Knowledge to be extracted is defined by the task for which the ontology will be used e.g., searching patient records, accessing drug related articles Slide107:  Focused Ontology Learning Focused Ontology Learning characteristics: 1. (Small) corpus with special (domain/context) characteristics; 2. Well defined ontological knowledge to be extracted; 3. An easy to detect correspondence between text characteristics and ontology elements; 4. Usually an easy solution (adaptation of OL techniques); 5. Implemented/adapted by a non NLP-expert. What is needed to support domain experts? libraries of basic NLP tools/data structures; tools to easily adapt/combine these NLP elements; intuitive way to create and debug own applications; usability plays an important role; generic methodologies of ontology learning rather than hard-coded algorithms. Slide108:  Outline 1) Generic Problem: * Focused Ontology Learning (definition and characteristics) 2) Specific Problem: * Learning Web Service Ontologies (Context, Problem Scenario) 3) GATE support for: * writing extraction patterns (given) * evaluating term extraction performance (given) Slide109:  Context - Semantic Web Services * Semantic WS - semantically annotated WS * to automate discovery, composition, execution < rdf:ID=”WS1"> <owls:hasInput rdf:resource=” ”/> <owls:hasInput rdf:resource=” ”/> <owls:hasOutput rdf:resource=” ”/> </ > =>broad domain coverage But …increasing nr. of web services Slide110:  A real life story… Semantic Grid middleware to support in silico experiments in biology Bioinformatics programs are exposed as semantic web services 550 Concepts But only 125 (23%) used for SWS tasks 600 (Services) Our GOAL: Support Expert to learn: From more services In less time A “Better” ontology (for SWS descriptions) Slide111:  FOL Characteristics - 1 * Data Source: * short descriptions of service functionalities * characteristics: * small corpora (100/200 documents) * employ specific style (sublanguage) Replace or delete sequence sections. Find antigenic sites in proteins. Cai codon usage statistic. 1. (Small) corpus with special (domain/context) characteristics Slide112:  Web Service Ontologies contain: A Data Structure hierarchy A Functionality hierarchy 2. Well defined ontology structure to be extracted FOL Characteristics - 2 Slide113:  3. An easy to detect correspondence between text characteristics and ontology elements Replace or delete sequence sections. FOL Characteristics - 3 Slide114:  Generic Solution: Implementation: FOL Characteristics - 4 4. Usually an easy solution (adaptation of OL techniques).E.g. Pos Tagging Slide115:  FOL Characteristics - 4 4. Usually an easy solution (adaptation of OL techniques). E.g. Dependency Parsing Slide116:  Outline 1) Generic Problem: * Focused Ontology Learning (definition and characteristics) 2) Specific Problem: * Learning Web Service Ontologies (Context, Problem Scenario) 3) GATE support for: * writing extraction patterns * evaluating term extraction performance Slide117:  * Easy to follow extraction (step by step) * Easy to adapt for domain engineers GATE Implementation Slide118:  Pattern based rules – Example ( (DET)*:det ( (ADJ)|(NOUN))*:mods (NOUN):hn ):np :np.NP={} A noun phrase consists of: zero or more determiners; zero or more modifiers which can be adjectives or nouns; One noun which is the head-noun. Slide119:  Outline 1) Generic Problem: * Focused Ontology Learning (definition and characteristics) 2) Specific Problem: * Learning Web Service Ontologies (Context, Problem Scenario) 3) GATE support for: * writing extraction patterns (given) * evaluating term extraction performance (given) Slide120:  Performance Evaluation Linguistic Analysis Extraction Patterns Ontology Building Ontology Pruning A set of important terms are extracted. Terms are indicated by annotations of type: NP, Funct. * The correctness of these terms has a direct influence on the correctness of the OB step => evaluating them is important. The Corpus Benchmark Tool of GATE compares annotation types in 2 corpora, usually: the manually annotated Gold Standard corpus and the automatically annotated corpus. It identifies correct, missed and spurious annotations of a certain type and computes Precision and Recall per each document and the whole corpus. Slide121:  Gold Standard Annotations: Automatic Annotation: 105_profit.xml; Keys : 2Resp : 3 Scan a sequence or database with a matrix or profile. Funct(scan_sequence) Funct(scan_database) Funct(scan_sequence) Funct(scan_database) Funct(scan_profile) Correct = correctly identified annotations (true positives) Spurious = incorrect annotations (false positives) Example 1: Performance Evaluation Slide122:  Gold Standard Annotations: Automatic Annotations: 104_printsextract.xml; Keys : 1Resp : 0 Preprocess the prints database for use with the program pscan. Funct(preprocess_prints database) Missed = unidentified annotations (false negative) Example 2: Performance Evaluation Slide123:  Statistics GoldStandard_Terms Extracted_Terms correct missed spurious Performance Evaluation Precision= correct/(All_Extr) Recall= correct/(All_GS) Slide124:  PROS: It is very important when developing term extraction. It allows evaluating: 1) the performance of the linguistic analyses 2) the coverage of the patterns Allows comparing the performance of different tools: E.g. two different POS taggers Easy to use (both from GUI and command line) Possible improvement: * The current textual output does not allow to directly access all spurious or all missing annotations (these are important when fine-tuning the extraction). * We try to improve this usability issue through visualisation. Performance Evaluation Slide125:  Summary Focused Ontology Learning = OL in a restricted domain. GATE supports the development of FOL in many ways: allows easy reuse and combination of basic NLP modules; offers software libraries for fundamental NLP data structures (Documents, Corpora, Annotations); incorporates evaluation mechanisms; easy to debug and use for non-NLP experts. Example FOL = OL for Web Services. Slide126:  Structure of the Tutorial Motivation, background GATE overview Information Extraction GATE’s HLT components IE and the Semantic Web Ontology learning with Text2Onto Focused ontology learning Massive Semantic Annotation KIM Platform An Overview Atanas Kiryakov Ontotext Lab, Sirma AI naso@sirma.bg http://www.ontotext.com/kim/:  KIM Platform An Overview Atanas Kiryakov Ontotext Lab, Sirma AI naso@sirma.bg http://www.ontotext.com/kim/ Semantic Annotation: An example:  Semantic Annotation: An example XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company type HQ establOn City Country Location partOf type type type “03/11/1978” HQ partOf Semantic Annotation of NEs:  Semantic Annotation of NEs A Semantic Annotation of the named entities (NEs) in a text includes: a recognition of the type of the entities in the text out of a rich taxonomy of classes (not a flat set of 10 types); an identification of the entities, which is also a reference to their semantic description. The traditional (IE-style) NE recognition approach results in: <Person>Lama Ole Nydahl</Person> The Semantic Annotation of NEs results in: <ReligiousPerson ID=“http://..kim/Person111111”> Lama Ole Nydahl </ReligiousPerson> Platforms for Large-Scale Semantic Annotation:  Platforms for Large-Scale Semantic Annotation Allow use of corpus-wide statistics to improve metadata quality, e.g., disambiguation Automated alias discovery Generate SemWeb output (RDF, OWL) Stand-off storage and indexing of metadata Use large instance bases to disambiguate to Ontology servers for reasoning and access Architecture elements: Crawler, onto storage, doc indexing, query, annotators Apps: sem browsers, authoring tools, etc. The KIM Platform:  The KIM Platform A platform offering services and infrastructure for: (semi-) automatic semantic annotation and ontology population semantic indexing and retrieval of content query and navigation over the formal knowledge Based on an Information Extraction technology Aim: to arm Semantic Web applications by providing a metadata generation technology in a standard, consistent, and scalable framework KIM Architecture:  KIM Architecture Semantic Repository API Semantic Annotation API Query API Index API Document Persistence API KIM Web UI Annotation Server News Collector Any Web Browser Browser Plug-in Custom Applications Custom Back-end Custom IE Entity Ranking KIM Server RMI PROTON Ontology:  PROTON Ontology a light-weight upper-level ontology; 250 NE classes; 100 relations and attributes; 200.000 entity descriptions; covers mostly NE classes, and ignores general concepts; includes classes representing lexical resources. proton.semanticweb.org KIM Scaling on Data:  KIM Scaling on Data The Semantic Repository is based on Sesame. Our practical tests demonstrate a good performance on top of: 1.2M entity descriptions: about 15M explicit statements; above 30M statements after forward chaining. Document and annotation storage and indexing with Lucene: .5M docs, processed on a $1000-worth machine; retrieval in milliseconds. Simple Usage: Highlight, Hyperlink, and …:  Simple Usage: Highlight, Hyperlink, and … Simple Usage: … Explore and Navigate:  Simple Usage: … Explore and Navigate How KIM Searches Better:  How KIM Searches Better KIM can match a Query: Documents about a telecom company in Europe, John Smith, and a date in the first half of 2002. With a document containing: “At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO" The classical IR could not match: Vodafone with a "telecom in Europe“, because: Vodafone is a mobile operator, which is a sort of a telecom; Vodafone is in the UK, which is a part of Europe. 5th of May with a "date in first half of 2002“; “John G. Smith” with “John Smith”. Entity Pattern Search:  Entity Pattern Search Pattern Search: Entity Results:  Pattern Search: Entity Results Entity Pattern Search: KIM Explorer:  Entity Pattern Search: KIM Explorer Pattern Search, Referring Documents:  Pattern Search, Referring Documents Document Details:  Document Details Summary:  Summary KIM is a platform for: semantic annotation and ontology population, semantic indexing and retrieval, providing an API for remote access and integration, based on Information Extraction (IE) using GATE. KIM is: Robust Scalable General-purpose, off the shelf platform! THANK YOU! (for not snoring) The slides: http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006-tutorial.ppt [This work has been supported by SEKT (http://sekt.semanticweb.org/) and KnowledgeWeb (http://knowledgeweb.semanticweb.org/ )]:  THANK YOU! (for not snoring) The slides: http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006-tutorial.ppt [This work has been supported by SEKT (http://sekt.semanticweb.org/) and KnowledgeWeb (http://knowledgeweb.semanticweb.org/ )]

Add a comment

Related presentations

Related pages

GATE.ac.uk - sale/ekaw2006/hlt-tutorial.html

Home 〉 sale 〉 ekaw2006 〉 hlt-tutorial.html Products. The GATE Family; GATE Developer; GATE Embedded; GATE Teamware; The GATE Process ...
Read more

GATE.ac.uk - sale/ekaw2006/

Home 〉 sale 〉 ekaw2006 Products. The GATE Family; GATE Developer; GATE Embedded; GATE Teamware ... hlt-tutorial.html; Thumbs.db; top_1.jpg : Our partners:
Read more

Presentation "Human Language Technology (HLT) and ...

Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial Diana Maynard (University of Sheffield) Julien Nioche (University.
Read more

Automating Ontology Building: Ontologies for the Semantic ...

ekaw2006- tutorial – GATE Demonstrates using HLT for automating SW-specific knowledge GATE and the Semantic Web . Supports ontologies as part of IE ...
Read more

Semanticweb.Org | Knowledgeweb.Semanticweb.org ...

ekaw2006- tutorial - GATE KnowledgeWeb (http://knowledgeweb.semanticweb.org/ ] Aims of this tutorial Investigates some technical aspects of HLT for the SW ...
Read more

Semi-Automatic Content Extraction from Specifications ...

Semi-Automatic Content Extraction from Specifications 1 . Semi-Automatic Content Extraction from Specifications . Krishnaprasad Thirunarayan. Department of
Read more

Introduction to digital gazetteers and their development ...

Introduction to digital gazetteers and their development issues Alexandria Digital Library Project ... ekaw2006- tutorial - GATE A development environment: ...
Read more

Ontologies and Ontology Learning from Text | PPT Directory

Ontologies and Ontology Learning from Text Ontologies and Ontology Learning from Text . Philipp Cimiano. HCI Postgraduate Research School. Aalborg,
Read more

Ppt Special-concrete-ie-light-weight-concrete | Powerpoint ...

View and Download PowerPoint Presentations on SPECIAL CONCRETE IE LIGHT WEIGHT CONCRETE PPT. ... ekaw2006-tutorial.ppt - Gate - GATE.ac.uk - index.html PPT.
Read more