advertisement

The CW Corpus PITR2013

50 %
50 %
advertisement
Information about The CW Corpus PITR2013
Technology

Published on February 20, 2014

Author: mattshardlow

Source: slideshare.net

Description

The CW Corpus is a medium sized resource containing examples of in-context difficult words along with suggested simplifications. It was produced by mining Simple Wikipedia edit histories for instances of simplification. This talk was given at the The Second Workshop on Predicting and Improving Text Readability for Target Reader Populations in Sofia, Bulgaria 2013. An associated paper is available at: http://aclweb.org/anthology/W/W13/W13-2908.pdf
advertisement

The CW Corpus A new resource for evaluating the identification of complex words Matthew Shardlow The University of Manchester http://lexicalsimplification.blogspot.co.uk 1

Lexical Simplification Complex Word Identification http://lexicalsimplification.blogspot.co.uk He profoundly changed. 2

Lexical Simplification Complex Word Identification Substitution Generation http://lexicalsimplification.blogspot.co.uk He profoundly changed. Profoundly: extremely, very, deeply, acutely 2

Lexical Simplification Complex Word Identification He profoundly changed. Profoundly: extremely, very, deeply, acutely Word Sense Disambiguation Profoundly: extremely, very, deeply, acutely ` Substitution Generation http://lexicalsimplification.blogspot.co.uk 2

Lexical Simplification Complex Word Identification He profoundly changed. Substitution Generation Profoundly: extremely, very, deeply, acutely Word Sense Disambiguation Profoundly: extremely, very, deeply, acutely Synonym Ranking http://lexicalsimplification.blogspot.co.uk #1) deeply #2) extremely #3) acutely 2

Complex Words ● How do we define a Complex Word? http://lexicalsimplification.blogspot.co.uk 3

Complex Words ● How do we define a Complex Word? ● Manual Definition – Any word which impedes a reader's comprehension of a text. http://lexicalsimplification.blogspot.co.uk 3

Complex Words ● How do we define a Complex Word? ● Manual Definition – ● Any word which impedes a reader's comprehension of a text. Heuristic Features – Frequency – Familiarity – Length – Context http://lexicalsimplification.blogspot.co.uk 3

Complex Word Identification ● Important to get it right: Propagation errors Correct: He profoundly changed He deeply changed Incorrect: He profoundly changed He profoundly turned http://lexicalsimplification.blogspot.co.uk 4

Complex Word Identification ● Important to get it right: Propagation errors Correct: He profoundly changed Incorrect: He profoundly changed ● He deeply changed He profoundly turned No evaluation data. http://lexicalsimplification.blogspot.co.uk 4

Complex Word Identification ● Important to get it right: Propagation errors Correct: He profoundly changed He deeply changed Incorrect: He profoundly changed He profoundly turned ● No evaluation data. ● Gold standard data required. http://lexicalsimplification.blogspot.co.uk 4

Gold Standard Data ● Criteria for corpus entries: – Annotated Sentences. – Coherent English. – One complex word per sentence. http://lexicalsimplification.blogspot.co.uk 5

Gold Standard Data ● Criteria for corpus entries: – Annotated Sentences. – Coherent English. – One complex word per sentence. ● Difficult to generate automatically. ● Expensive to generate manually. http://lexicalsimplification.blogspot.co.uk 5

Gold Standard Data ● Criteria for corpus entries: – Annotated Sentences. – Coherent English. – One complex word per sentence. ● Difficult to generate automatically. ● Expensive to generate manually. ● So, we mine Simple Wikipedia Edit Histories. http://lexicalsimplification.blogspot.co.uk 5

Simple Wikipedia Edit Histories ● Simple Wikipedia is: – An online encyclopedia. – Written in simplified English. – Collaboratively edited. – Available to download in XML format. http://lexicalsimplification.blogspot.co.uk 6

Simple Wikipedia Edit Histories ● Simple Wikipedia is: – An online encyclopedia. – Written in simplified English. – Collaboratively edited. – Available to download in XML format. ● Changes to articles recorded in edit histories. ● Some changes are simplifications. http://lexicalsimplification.blogspot.co.uk 6

Simple Wikipedia Edit Histories ● Advantages: – Fully automated – High throughput – Cost-effective http://lexicalsimplification.blogspot.co.uk 7

Simple Wikipedia Edit Histories ● Advantages: ● Disadvantages: – Fully automated – Content quality – High throughput – – Cost-effective Sparsity of simplifications – Data exhaustion http://lexicalsimplification.blogspot.co.uk 7

Mining – Extract Likely Candidates ● There are 2 stages to the mining process. ● Stage 1: – 2 adjacent revisions are selected. http://lexicalsimplification.blogspot.co.uk 8

Mining – Extract Likely Candidates ● There are 2 stages to the mining process. ● Stage 1: – 2 adjacent revisions are selected. – A similarity score (TF-IDF) is calculated at sentence level. http://lexicalsimplification.blogspot.co.uk 8

Mining – Extract Likely Candidates ● There are 2 stages to the mining process. ● Stage 1: – 2 adjacent revisions are selected. – A similarity score (TF-IDF) is calculated at sentence level. – High scoring pairs passed on. – All other pairs discarded. http://lexicalsimplification.blogspot.co.uk 8

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks http://lexicalsimplification.blogspot.co.uk 9

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks – One word difference. http://lexicalsimplification.blogspot.co.uk 9

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks – One word difference. – Real words. (not: spam / vandalism / nonsense) http://lexicalsimplification.blogspot.co.uk 9

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks – One word difference. – Real words. (not: spam / vandalism / nonsense) – Different stems. http://lexicalsimplification.blogspot.co.uk 9

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks – One word difference. – Real words. (not: spam / vandalism / nonsense) – Different stems. – Synonyms. http://lexicalsimplification.blogspot.co.uk 9

Mining – Validate Candidates ● There are 2 stages to the mining process. ● Stage 2: A series of checks – One word difference. – Real words. (not: spam / vandalism / nonsense) – Different stems. – Synonyms. – Simplifying. http://lexicalsimplification.blogspot.co.uk 9

Analysis ● Six Annotators ● Each given a 70 instance sample. http://lexicalsimplification.blogspot.co.uk 10

Analysis ● Six Annotators ● Each given a 70 instance sample. – 50 examples from the corpus (different for each). – 20 common examples as a validation set. http://lexicalsimplification.blogspot.co.uk 10

Analysis ● Six Annotators ● Each given a 70 instance sample. – 50 examples from the corpus (different for each). – 20 common examples as a validation set. ● 2 annotators ruled out by validation set. ● Final corpus accuracy of: 97.5%. http://lexicalsimplification.blogspot.co.uk 10

Experiments ● Several experiments performed so far. ● Presented at ACL Student Research Workshop. http://lexicalsimplification.blogspot.co.uk 11

Experiments ● Several experiments performed so far. ● Presented at ACL Student Research Workshop. ● 3 techniques for identification were compared. http://lexicalsimplification.blogspot.co.uk 11

Experiments ● Several experiments performed so far. ● Presented at ACL Student Research Workshop. ● 3 techniques for identification were compared. ● Sophisticated strategies gave little or no improvement over a baseline. http://lexicalsimplification.blogspot.co.uk 11

Summary ● Identifying Complex Words is important. ● The CW Corpus lets us evaluate methods. ● Preliminary results give little improvement. http://lexicalsimplification.blogspot.co.uk

References ● Corpus: http://tinyurl.com/cwcorpus S. Devlin and J. Tait. The use of a psycholinguistic database in the simplif cation of text for aphasic readers. i Linguistic Databases, p 161–173, 1998. M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee. For the sake of simplicity: unsupervised extraction of lexical simplif cations from Wikipedia. In HLT ’10 NAACL, i p 365–368, Stroudsburg, PA, USA, 2010. http://lexicalsimplification.blogspot.co.uk 12

Any Questions ● Corpus: http://tinyurl.com/cwcorpus http://lexicalsimplification.blogspot.co.uk 13

Annotator Agreement Annotator Index 1 Kappa 1 Sample Accuracy 98% 2 1 96% 3 0.4 70% 4 1 100% 5 0.6 84% 6 1 96% http://lexicalsimplification.blogspot.co.uk

Example Discarded Pairs ● It was a _____ evening. ● Nonsense Words (spelling correction) – ● Different Stems (sense correction) – ● Cooler → Cool Synonymy (meaning change) – ● Cuol → Cool Long → Cool Simplifying – Calm → Cool http://lexicalsimplification.blogspot.co.uk

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Predicting and improving text readability for target ...

Workshop on Predicting and improving text readability for target ... This paper describes the method used to produce the CW corpus and presents the ...
Read more

www.aclweb.org

@InProceedings{shardlow:2013:PITR2013, author = {Shardlow, Matthew}, title = {The CW Corpus: A New Resource for Evaluating the Identification of ...
Read more

www.aclweb.org

@Book{PITR2013:2013, editor = ... @InProceedings{shardlow:2013:PITR2013, author = {Shardlow, Matthew}, title = {The CW Corpus: A New ...
Read more