Published on January 19, 2014
KWIC corpora as a source of specialized deﬁnitional information: a pilot study Antonio San Martín University of Granada, Spain
Motivation: definition writing •Deﬁnitions in other resources •Corpus analysis http://ecolexicon.ugr.es What should I include in my deﬁnitions?
Assumption The lexical units that normally co-occur with another lexical unit are potentially important to deﬁne them.
Hypothesis Corpus of KWIC (Key Word In Context) concordances of the concept to deﬁne Term list: potentially deﬁnitional terms for the concept to deﬁne
2. Methods Analysis list Reference list
2.1. Reference list - Term list generated with TermoStat Web 3.0 (Drouin 2003): most frequent nouns, noun phrases and adjectives (+4 occurrences) - Source: English corpus of 133 specialized definitions of MAGMA.
2.1. Reference list - To minimize interference from terminological variation, terms in the reference list were categorized according to the conceptual proposition established with MAGMA. - Any categorization has a certain degree of subjectivity. The configuration of our reference list is the result of certain choices.
2.1. Reference list Conceptual proposition Instances from the list generated by TermoStat magma is a rock rock (163), molten rock (79), rock material (17), molten rock material (10), liquid rock (4) magma is a material material (37), rock material (17), molten rock material (10), molten material (8) magma is (a) liquid / magma is a >luid liquid (13), >luid (6), liquid rock (4) magma is a mixture / magma is made of a mixture mixture (6) magma is molten molten (105), molten rock (79), molten rock material (10), molten material (8), molten state (4) magma is hot hot (18), temperature (6) magma is mobile mobile (6) magma contains gas/bubbles gas (25), bubble (4) magma contains crystals crystal (24) magma contains silicate silicate (9) magma contains volatiles volatile (4) magma contains minerals mineral (4) magma undergoes solidi>ication solidi>ication (6), solid (5) magma undergoes (partial) melting melting (7), partial melting (6) magma causes intrusion intrusion (7) magma causes extrusion extrusion (6) magma becomes igneous rock / magma is the raw material of igneous rocks igneous (40), igneous rock (37), raw material (4) magma becomes lava lava (38) magma is found under the Earth’s or a planet’s surface earth (98), surface (63), planet (5), deep (6), depth (4), underground (5) magma is found deep in the Earth / at depth deep (6), depth (4) magma is found in the (Earth’s) crust crust (33) magma is found in the upper part of the (Earth’s) mantle. mantle (20), upper (5) magma is erupted from a volcano volcano (7), volcanic (7)
2.2. Analysis lists - An English corpus of environmental texts (PANACEA corpus + LexiCon corpus). 359 occurences of MAGMA. - Wordsmith Tools (Scott 2008) to generate KWIC concordance lines: 100c MAGMA 100c 250c MAGMA 250c 500c MAGMA 500c 750c MAGMA 750c Sentences
2.2. Analysis list -Each corpus was fed into TermoStat in order to obtain the most frequent nouns, noun phrases, and adjectives. -The 50 and 100 terms with the highest raw frequency were retained for comparison with the reference list. -Analysis lists: 50-term 100c 50-term 250c 50-term 500c 50-term 750c 50-term sentence 100-term 100c 100-term 250c 100-term 500c 100-term 750c 100-term sentence
2.3. Precision and recall P = TP / (TP+FP) R = TP / (TP+FN) -TP (true positive): a term in the analysis list that matches any of the categories in the reference list. The result is expressed as a percentage. - FP (false positive): a term in the analysis list that matches no category in the reference list. The result is expressed as a percentage. - FN (false negative): a category in the reference list that is not matched by any of the terms in the analysis list. The result is expressed as a percentage.
2.3. Precision and recall F2-measurement (Chinchor, 1992, 25), which gives twice the importance to recall as to precision. The formula used was the following: F2 = (5 · P ·R) / (5 · P + R)
3. Results -The 100-term 250C list performed the best (F2-M: 69.08 %). Also, its recall ratio was the highest (78.28 %). -The highest precision ratio corresponded to the 50term 100C list. But its recall ratio was 12 points below the 100-term 250C. -The SC list obtained a lower F2 score compared to any of the KWIC lists. -Once the threshold of the 250-character context was exceeded, longer contexts caused both precision and recall to decrease.
4. Conclusions and future work
Conclusions and future work ‣Although the scope of this pilot study was limited, results indicate that a 250-character KWIC corpus coupled with a 100-term list generated from it could be a useful tool for definition writing. ‣The inevitable bias caused by the use of a reference list based on a manual classification does not invalidate the results.
Conclusions and future work ‣This initial pilot study will subsequently be expanded to include new variables: ‣other kind of definienda ‣verbs and adverbs in the term lists ‣corpora of diﬀerent levels of specialization ‣more KWIC corpora with diﬀerent character counts. comparison of the output of TermoStat with other term extractors as well as a simple keyword generator
Conclusions and future work ‣Our ultimate objective is to combine our approach with the application of knowledge-pattern-based techniques (Pearson, 1998; Meyer, 2001; Malaisé et al., 2005; Marshman and L’Homme 2006; Auger and Barrière, 2008, inter alia) to create a system of semi-automatic definitional information extraction.
Thank you firstname.lastname@example.org http://lexicon.ugr.es/sanmartin
KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín LexiCon Research Group Department of Translation and ...
Page 1. KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín LexiCon Research Group Department of ...
KWIC corpora as a source of specialized deﬁnitional information: a pilot study Antonio San Martín
Antonio San Martín, ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study more.
by Antonio San Martín 12 . ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
Antonio San Martín Pizarro, PhD. ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
LexiCon Research Group, ... San Martín, Antonio. 2014. KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
San Martín luchando por ser un país libre de ... KWIC corpora as a source of specialized definitional information: a pilot study (by Antonio San Martín) ...