KWIC corpora as a source of specialized definitional information: a pilot study (by Antonio San Martín)

100 %
0 %
Information about KWIC corpora as a source of specialized definitional information: a...
Education

Published on January 19, 2014

Author: antsanmartin

Source: slideshare.net

Description

Presentation delivered at CEC-TAL '13 held in Montreal (Canada). September 2013.

Full paper to be published in the proceedings soon.

KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín University of Granada, Spain

1. Introduction

Motivation: definition writing •Definitions in other resources •Corpus analysis http://ecolexicon.ugr.es What should I include in my definitions?

Assumption The lexical units that normally co-occur with another lexical unit are potentially important to define them.

Hypothesis Corpus of KWIC (Key Word In Context) concordances of the concept to define Term list: potentially definitional terms for the concept to define

2. Methods

2. Methods Analysis list Reference list

2.1. Reference list - Term list generated with TermoStat Web 3.0 (Drouin 2003): most frequent nouns, noun phrases and adjectives (+4 occurrences) - Source: English corpus of 133 specialized definitions of MAGMA.

2.1. Reference list - To minimize interference from terminological variation, terms in the reference list were categorized according to the conceptual proposition established with MAGMA. - Any categorization has a certain degree of subjectivity. The configuration of our reference list is the result of certain choices.

2.1. Reference list Conceptual  proposition Instances  from  the  list  generated  by  TermoStat magma  is  a  rock rock  (163),  molten  rock  (79),  rock  material  (17),  molten  rock  material  (10),  liquid  rock  (4) magma  is  a  material material  (37),  rock  material  (17),  molten  rock  material  (10),  molten  material  (8) magma  is  (a)  liquid  /  magma  is  a  >luid   liquid  (13),  >luid  (6),  liquid  rock  (4) magma  is  a  mixture  /  magma  is  made  of  a  mixture mixture  (6) magma  is  molten molten  (105),  molten  rock  (79),  molten  rock  material  (10),  molten  material  (8),  molten  state  (4) magma  is  hot hot  (18),  temperature  (6) magma  is  mobile mobile  (6) magma  contains  gas/bubbles gas  (25),  bubble  (4) magma  contains  crystals crystal  (24) magma  contains  silicate silicate  (9) magma  contains  volatiles volatile  (4) magma  contains  minerals mineral  (4) magma  undergoes  solidi>ication solidi>ication  (6),  solid  (5) magma  undergoes  (partial)  melting melting  (7),  partial  melting  (6) magma  causes  intrusion intrusion  (7) magma  causes  extrusion extrusion  (6) magma  becomes  igneous  rock  /  magma  is  the  raw  material  of   igneous  rocks igneous  (40),  igneous  rock  (37),  raw  material  (4) magma  becomes  lava lava  (38) magma  is  found  under  the  Earth’s  or  a  planet’s  surface earth  (98),  surface  (63),  planet  (5),  deep  (6),  depth  (4),  underground  (5) magma  is  found  deep  in  the  Earth  /  at  depth deep  (6),  depth  (4) magma  is  found  in  the  (Earth’s)  crust crust  (33) magma  is  found  in  the  upper  part  of  the  (Earth’s)  mantle. mantle  (20),  upper  (5) magma  is  erupted  from  a  volcano volcano  (7),  volcanic  (7)

2.2. Analysis lists - An English corpus of environmental texts (PANACEA corpus + LexiCon corpus). 359 occurences of MAGMA. - Wordsmith Tools (Scott 2008) to generate KWIC concordance lines: 100c MAGMA 100c 250c MAGMA 250c 500c MAGMA 500c 750c MAGMA 750c Sentences

2.2. Analysis list -Each corpus was fed into TermoStat in order to obtain the most frequent nouns, noun phrases, and adjectives. -The 50 and 100 terms with the highest raw frequency were retained for comparison with the reference list. -Analysis lists: 50-term 100c 50-term 250c 50-term 500c 50-term 750c 50-term sentence 100-term 100c 100-term 250c 100-term 500c 100-term 750c 100-term sentence

2.3. Precision and recall P = TP / (TP+FP) R = TP / (TP+FN) -TP (true positive): a term in the analysis list that matches any of the categories in the reference list. The result is expressed as a percentage. - FP (false positive): a term in the analysis list that matches no category in the reference list. The result is expressed as a percentage. - FN (false negative): a category in the reference list that is not matched by any of the terms in the analysis list. The result is expressed as a percentage.

2.3. Precision and recall F2-measurement (Chinchor, 1992, 25), which gives twice the importance to recall as to precision. The formula used was the following: F2 = (5 · P ·R) / (5 · P + R)

3. Results

3. Results

3. Results -The 100-term 250C list performed the best (F2-M: 69.08 %). Also, its recall ratio was the highest (78.28 %). -The highest precision ratio corresponded to the 50term 100C list. But its recall ratio was 12 points below the 100-term 250C. -The SC list obtained a lower F2 score compared to any of the KWIC lists. -Once the threshold of the 250-character context was exceeded, longer contexts caused both precision and recall to decrease.

4. Conclusions and future work

Conclusions and future work ‣Although the scope of this pilot study was limited, results indicate that a 250-character KWIC corpus coupled with a 100-term list generated from it could be a useful tool for definition writing. ‣The inevitable bias caused by the use of a reference list based on a manual classification does not invalidate the results.

Conclusions and future work ‣This initial pilot study will subsequently be expanded to include new variables: ‣other kind of definienda ‣verbs and adverbs in the term lists ‣corpora of different levels of specialization ‣more KWIC corpora with different character counts. comparison of the output of TermoStat with other term extractors as well as a simple keyword generator

Conclusions and future work ‣Our ultimate objective is to combine our approach with the application of knowledge-pattern-based techniques (Pearson, 1998; Meyer, 2001; Malaisé et al., 2005; Marshman and L’Homme 2006; Auger and Barrière, 2008, inter alia) to create a system of semi-automatic definitional information extraction.

Thank you asanmartin@ugr.es http://lexicon.ugr.es/sanmartin

Add a comment

Related presentations

Related pages

KWIC corpora as a source of specialized definitional ...

KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín LexiCon Research Group Department of Translation and ...
Read more

KWIC corpora as a source of specialized definitional ...

Page 1. KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín LexiCon Research Group Department of ...
Read more

KWIC corpora as a source of specialized definitional ...

KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín
Read more

Antonio San Martín | University of Granada - Academia.edu

Antonio San Martín, ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study more.
Read more

Academia.edu | Documents in Terminological Definition ...

by Antonio San Martín 12 . ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
Read more

Antonio San Martín Pizarro

Antonio San Martín Pizarro, PhD. ... KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
Read more

LexiCon Research Group - Facebook

LexiCon Research Group, ... San Martín, Antonio. 2014. KWIC Corpora as a Source of Specialized Definitional Information: A Pilot Study.
Read more

Gral San Martín - Education

San Martín luchando por ser un país libre de ... KWIC corpora as a source of specialized definitional information: a pilot study (by Antonio San Martín) ...
Read more