Vorl 03 Annotation cont

40 %
60 %
Information about Vorl 03 Annotation cont
Entertainment

Published on November 15, 2007

Author: Gulkund

Source: authorstream.com

Corpus Annotation II:  Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AG Overview:  Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary Recognition Part-of-Speech Tagging:  Part-of-Speech Tagging Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). The Tree-Tagger Is a Statistical Tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word. preserves pre-set tags. Tagging Correction:  Tagging Correction Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]'  ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden'  VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY  VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries:  Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb  insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging? Lemmatisation:  Lemmatisation Was done with Gertwol (von Lingsoft Oy, Helsinki) for adjectives, nouns, prepositions, and verbs. Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien)  feed last element to Gertwol Lemma Filtering (a project by Julian Käser):  Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Gertwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs  IBM) Case 2: Gertwol does not find a lemma  insert the word form as lemma (with '?') Lemma Filtering:  Lemma Filtering Case 3: Gertwol finds exactly one lemma for the given PoS  insert the lemma Case 4: Gertwol finds multiple lemmas for the given PoS  disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Example: Abteilungen  Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) Lemma Filtering:  Lemma Filtering Case 5: Gertwol finds a lemma but not for the given PoS  this indicates a tagger error (Gertwol is more reliable than the tagger.) Case 5.1: Gertwol finds a lemma for exactly one PoS  insert the lemma and exchange the PoS tag Case 5.2: Gertwol finds lemmas for more than one PoS  find closest PoS tag, or guess Lemma Filtering:  Lemma Filtering 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses. Limitations of Gertwol:  Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use corpus for lemmatizing remaining compounds: Example: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !! NP/PP Chunk Recognition (a project by Dominik A. Merz):  NP/PP Chunk Recognition (a project by Dominik A. Merz) Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!! Representation Format:  Representation Format The NEGRA export format is a line based format works with pointers for tree structure comprises node labels (constituents) and edge labels (grammatical functions) has no provision for semantic information. Therefore: We use the comment field. Recognition of temporal PPs (a project by Stefan Höfler):  Recognition of temporal PPs (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ... Recognition of temporal PPs:  Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76% Recognition of local PPs:  Recognition of local PPs Starting point: Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von Prepositions (30) that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ... Recognition of temporal and local PPs:  Recognition of temporal and local PPs A Word on Recall and Precision:  A Word on Recall and Precision The focus varies with the application! For my project: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct. Clause Boundary Recognition (a project by Gaudenz Lügstenmann):  Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things). Clauses Boundary Recognition:  Clauses Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Clauses Boundary Recognition:  Clauses Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Clauses Boundary Recognition:  Clauses Boundary Recognition The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. Clauses Boundary Recognition:  Clauses Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1% Using a PoS Tagger for Clause Boundary Recognition:  Using a PoS Tagger for Clause Boundary Recognition A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition:  Using a PoS Tagger for Clause Boundary Recognition Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!! Clause Boundary Recognition vs. Clause Recognition:  Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. Example: Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> Clause Recognition should be done with a recursive parsing approach because of clause nesting.

Add a comment

Related presentations

Related pages

Annotations - Autodesk Knowledge Network

Annotations Shown on Drawings Specifies which tables have callout annotations. Skip to main content. English. Deutsch; English; 日本語; Autodesk ...
Read more

Blatt 3 vom 04.05.2007 Aufgabe 1 Basenpaarungen durch ...

Aufgabe 2 Probiere die vorl¨aufige Version von Passta auf dem Bielefelder ... Welche der zur Annotation verwendeten Proteindom¨anen sind mit der ...
Read more

von BOETTICHER Rechtsanwälte – Dr. Ulrich Block LL.M. (Tulane)

Dr. Ulrich Block is listed in Best Lawyers 2015 for his achievements in corporate and ... Annotation on German Federal Court of ... VIZ 03/1994, pg. 104 ...
Read more

Universität Heidelberg [Hrsg.]: Personalverzeichnis und ...

SH 1930=Vorl.-Verz. WH 1930/31 - WH 1936/37=SH 1937. Bibliographische Information; Quellen zur und aus der Universität Heidelberg ... Annotationen ...
Read more

EJB3 Zugriff auf DB2 Installation und Konfiguration - cedix.de

mehr Zeit verbringen, um APIs für den EJB-Cont ainer zu schreiben, ... Diese ermöglichen innovative Techniken wie Metadata Annotations, Lifecycle
Read more

Univar USA Inc Material Safety Data Sheet - b2bcomposites.ca

Univar USA Inc Material Safety Data Sheet. MSDS ... ISSUE DATE:2011-02-11 VERSION:004 2011-03-03 Annotation: UNIVAR USA INC. ... information cont ained herein.
Read more

Ecuador - Nationalmannschaft 2013 - Fussballdaten - Die ...

Cont. Al Teatro Centro de Arte Guayaquil: ... Vorl. ø-Note Spiele Tore; ... 26.03. 22:00: Ecuador-Paraguay: 4:1 (1:1)
Read more

Ecuador - Copa América 2011 - Fussballdaten - Die ...

Cont. Al Teatro Centro de Arte Guayaquil: Internet: ... 24.03.1985: 1,80 m: 83 kg: 1: 0: 0 ... Vorl. ø-Note Spiele
Read more