Lect 05 Annotation cont

75 %
25 %
Information about Lect 05 Annotation cont
Education

Published on February 5, 2008

Author: Carmina

Source: authorstream.com

Corpus Annotation II:  Corpus Annotation II Martin Volk Stockholm University Overview:  Overview Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary Recognition Slide3:  Tokenizer and Sentence Boundary Recognizer Input Docs Abbreviations Proper Name Recognizer Persons, Locations, … First Name list Location list … Part-of-Speech Tagger and Lemmatiser Training Corpus SUC Swetwol Morph. Analyser for Lemmas, Tags, Compounds Morph. Rules Lexicon Part-of-Speech Tagging for German:  Part-of-Speech Tagging for German Was done with the Tree-Tagger (from Helmut Schmid, IMS Stuttgart). The Tree-Tagger is a statistical tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word form. preserves pre-set tags. A statistical Part-of-Speech tagger:  A statistical Part-of-Speech tagger learns tagging ”rules” from a manually Part-of-Speech annotated corpus (= training corpus). Vid/PR kliniken/NN i/PR Huddinge/PM övervakas/VB nu/AB Mijailovic/PM ständigt/AB av/PR två/RG vårdare/NN. applies the learned ”rules” to new sentences. Problems: words that were not in the training corpus. words with many possible tags. Two Swedish example word forms with multiple PoS tags in SUC:  Two Swedish example word forms with multiple PoS tags in SUC av adverb (AB) 48 times particle (PL) 407 times proper name (PM) 4 times preposition (PR) 14´580 times foreign word (UO) 2 times lagar (EN: laws or to make/repair) noun (NN) 43 times verb (VB) 5 times Part-of-Speech Tagging for Swedish:  Part-of-Speech Tagging for Swedish is done with the TreeTagger which is trained on SUC (Stockholm-Umeå-Corpus; 1 million words) with the SUC tag set (slightly enlarged) originally 22 tags plus VBFIN, VBINF, VBSUP, VBIMP has an estimated error rate of 4% (ie. every 25th word is incorrectly tagged!) Part-of-Speech Tagging with Lemmatisation:  Part-of-Speech Tagging with Lemmatisation The TreeTagger also assigns lemmas that it has learned from the training corpus. Rule: If word form W in the corpus has lemma L1 with tag T1 and lemma L2 with tag T2, then the TreeTagger will assign the lemma corresponding to the chosen tag. Example: Swedish: låg ligger (EN: to lay) and VVFIN (finite full verb) låg (EN: low) and JJ (adjective) nice example of PoS Tagging as word sense disambiguation PoS Tagging with Lemmatisation:  PoS Tagging with Lemmatisation But, it is possible that word form W has more than one lemma with tag T1 in the training corpus. Example: Swedish: kön kö (EN: queue) noun kön (EN: gender, sex) noun The TreeTagger will simply assign all lemmas to W that go with T1 (no lemma disambiguation). Tagging Correction in German:  Tagging Correction in German Correction of observed tagger problems: Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]'  ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden'  VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY  VVINF Unknown prepositions (a, via, innert, ennet) Correction of sentence boundaries:  Correction of sentence boundaries E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb  insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging? Lemmatisation for Swedish:  Lemmatisation for Swedish is (partly) done by the TreeTagger by re-using the lemmas from SUC (Stockholm-Umeå-Corpus) Limits: word forms that are not in SUC. In particular names  proper name recognition compounds  Swetwol neologisms, foreign expressions  ?? SUC lemmas have no compound boundaries (byskolan  byskola), (konstindustriskolan  konstindustriskola) elliptical compounds (e.g. kostnads- och tidseffektivt)  ?? TreeTagger ignores the hyphen. upper case / lower case (e.g. Bo vs. bo)  ?? TreeTagger treats them separately. Morphological information:  Morphological information such as case, number, gender etc. is important for correct linguistic analysis. could be taken from SUC based on the triple word form – PoS tag – lemma Examples: kön – NN – kön  NEUtrum SINgular INDefinite NOMinative kön – NN – kö  UTRum SINgular DEFinite NOMinative Limits: word forms that are not in SUC, and triples that have more than 1 set of morphological features. Lemmatisation for Swedish:  Lemmatisation for Swedish can be done with Swetwol (Lingsoft Oy, Helsinki) for adjectives (inflection lyckligt - lyckliga, gradation söt - sötare - sötaste), nouns (inflection hus – husen – huset ), verbs (inflection arbeta – arbetar - arbetat …). Swetwol is a two-level morphology analyzer for Swedish is lexicon-based returns all possible interpretations for each word form kön  kön N NEU INDEF SG/PL NOM kön  kö N UTR DEF SG NOM segments compound words dynamically if all parts are known cirkusskolan  cirkus#skola analyzes hyphenated compounds only if all parts are known FN-uppdraget  FN-uppdrag tPA-plantan  ?? although plantan  planta  feed last element to Swetwol Lemmatisation for German:  Lemmatisation for German can be done with Gertwol (Lingsoft Oy, Helsinki) for adjectives (inflection schöne - schönes, gradation schöner - schönste), nouns (inflection Haus – Hauses – Häuser – Häusern), prepositions (contraction zum – zur – zu), and verbs (inflection zeige – zeigst – zeigt – zeigte – zeigten …). Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known e.g. Software-Aktien but not Informix-Aktien  feed last element to Gertwol Lemma Filtering (a project by Julian Käser):  Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Ger/Swetwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs  IBM) Case 2: Ger/Swetwol does not find a lemma  insert the word form as lemma (mark it with '?') Lemma Filtering:  Lemma Filtering Case 3: Ger/Swetwol finds exactly one lemma for the given PoS  insert the lemma Case 4: Ger/Swetwol finds multiple lemmas for the given PoS  disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Examples: Abteilungen  Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) rådhusklockan  råd|hus#klocka (6 p.) vs. råd#hus#klocka (8 p.) Lemma Filtering:  Lemma Filtering Case 5: Ger/Swetwol finds a lemma but not for the given PoS  this indicates a tagger error (Ger/Swetwol is more reliable than the tagger.) Case 5.1: Ger/Swetwol finds a lemma for exactly one PoS  insert the lemma and exchange the PoS tag Case 5.2: Ger/Swetwol finds lemmas for more than one PoS  find „closest“ PoS tag, or guess Option: Check if the PoS tag in the corpus was licensed by SUC. If yes, ask the user for a decision. Lemma Filtering for German:  Lemma Filtering for German 0.74% of all PoS tags were exchanged (2% of Adjective tags, Noun tags, Verb tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses. Limitations of Gertwol:  Limitations of Gertwol Compounds are lemmatized only if all parts are known. Idea: Use a corpus for lemmatizing remaining compounds: Examples: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !! NP/PP Chunk Recognition (a project by Dominik A. Merz):  NP/PP Chunk Recognition (a project by Dominik A. Merz) adapted to Swedish by Jerker Hagman, 2004 Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJ --> AP ART AP NN --> NP PR NP --> PP Example tree Jerker Hagman’s results:  Jerker Hagman’s results 135 chunking rules Categories AdjP, AdvP, MPN, Coordinated_MPN, MPN_genitive NP, Coordinated_NP, NP_genitive PP VerbCluster (hade gått), InfinitiveGroup (att minska) Evaluation against a small treebank 75% precision 68% recall Recognition of temporal PPs in German (a project by Stefan Höfler):  Recognition of temporal PPs in German (a project by Stefan Höfler) A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ... Recognition of temporal PPs:  Recognition of temporal PPs Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76% Recognition of local PPs:  Recognition of local PPs Starting point: Prepositions that always introduce a local PP: fern, oberhalb, südlich von Prepositions that may introduce a local PP: ab, auf, bei, ... + additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <LOC>, ... Recognition of temporal and local PPs:  Recognition of temporal and local PPs A Word on Recall and Precision:  A Word on Recall and Precision The focus varies with the application! Often: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct. Clause Boundary Recognition (a project by Gaudenz Lügstenmann):  Clause Boundary Recognition (a project by Gaudenz Lügstenmann) Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things). Slide29:  <S> Mijailovic vårdas på sjukhus <S> Anna Lindhs mördare Mijailo Mijailovic är så sjuk <CB> att han förts till sjukhus. <S> Sedan i lördags vårdas han vid rättspsykiatriska kliniken på Karolinska universitetssjukhuset i Huddinge. <S> Dit fördes han <CB> sedan en läkare vid Kronobergshäktet i Stockholm konstaterat <CB> att han det fanns risk <CB> att han skulle försöka <CB> ta livet av sig i häktet. <S> Det skriver Aftonbladet och Expressen. <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <S> Enligt Kriminalvårdsstyrelsens bestämmelser ska i sådana fall en fånge föras till sjukhus. Dagens Nyheter, 20. Sept. 2004 Clause Boundary Recognition:  Clause Boundary Recognition Exceptions from the definition: Clauses with more than one verb: Coordinated verbs Daten können überführt und verarbeitet werden. Perception verb + infinitive verb (=AcI) die den Markt wachsen sehen. 'lassen' + infinitive verb lässt die Handbücher übertragen Clause Boundary Recognition:  Clause Boundary Recognition Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Examples: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Heute kann die Welt nur mehr knapp 30 dieser früher äusserst populären Riesenbilder bewundern, drei davon in der Schweiz. Clause Boundary Recognition:  Clause Boundary Recognition The German CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb ... + Conjunction + ... Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. This happens often in Swedish.? Clause Boundary Recognition for German:  Clause Boundary Recognition for German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1% Using a PoS Tagger for Clause Boundary Recognition in German:  Using a PoS Tagger for Clause Boundary Recognition in German A CB recognizer can be seen as a disambiguator over commas and CB-trigger-tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Using a PoS Tagger for Clause Boundary Recognition in German:  Using a PoS Tagger for Clause Boundary Recognition in German Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!! Clause Boundary Recognition vs. Clause Recognition:  Clause Boundary Recognition vs. Clause Recognition CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. It does not identify nesting. Example: <S> Mijailovic, <CB> som väntar på rättegången i Högsta domstolen <CB> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, <CB> ska enligt tidningarna ha slutat ta sina tabletter <CB> och blivit starkt förvirrad. <C> Mijailovic, <C> som väntar på rättegången i Högsta domstolen <C> efter att ha dömts till sluten psykiatrisk vård och inte till fängelse, </C></C> ska enligt tidningarna ha slutat ta sina tabletter </C><C> och blivit starkt förvirrad.</C> Clause Recognition should be done with a recursive parsing approach because of clause nesting. Summary:  Summary Part-of-Speech tagging based on statistical methods is robust and reliable. The TreeTagger assigns PoS tags and lemmas. Swetwol is a morphological analyser that given a word form outputs the PoS tag, the lemma and the morphological features for all its readings. Multiple knowledge sources (e.g. PoS-tagger and Swetwol) may lead to conflicting tags. Chunking (partial parsing) builds partial trees. Clause boundary detection can be realized as pattern matching over PoS tags.

Add a comment

Related presentations

Related pages

Lect-05 - Ace Recommendation Platform - 1

... 2006 Lect-05.2Quick Points• HW #1 due Thur ... CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.6LUT Mapping Techniques (cont.) ...
Read more

Lect 3- Proposition cont. (annotated) - 6 of 11 1/18/2011 ...

View Notes - Lect 3- Proposition cont. (annotated) from COT 3100 at UCF. 6 of 11 1/18/2011 9:43 PM lecture3-proposition
Read more

CprE 588 Embedded Computer Systems - ece.iastate.edu

Feb 10-12, 2009 CprE 588 – Embedded Computer Systems Lect-05.8 Specification Model Example (cont.) B1 v1 v2 e2 B1 B2 B3
Read more

lect 06_pop ecol cont - Population Ecology (cont.) BSCI ...

View Notes - lect 06_pop ecol cont from BSCI 233 at Vanderbilt. Population Ecology (cont.) BSCI 233 - 2014 Lecture 06 January 21 Demographic aspects of
Read more

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing ... September 4, 2007 CprE 583 – Reconfigurable Computing Lect-05.5 LUT Mapping Techniques (cont.) September 4, ...
Read more

Inter-Annotation Agreement

Inter-Annotation Agreement COSI 140 –Natural Language Annotation for Machine Learning James Pustejovsky February 23, 2016 Brandeis University
Read more