umist

50 %
50 %
Information about umist
Entertainment

Published on February 4, 2008

Author: Marco1

Source: authorstream.com

The METER Corpus: A corpus for analysing journalistic text reuse : The METER Corpus: A corpus for analysing journalistic text reuse Robert Gaizauskas1, Jonathan Foster2, Yorick Wilks1, John Arundel2, Paul Clough1, Scott Piao1 1Department of Computer Science, 2Department of Journalism University of Sheffield Outline of Talk: November 16, 2001 UMIST Seminar Outline of Talk The METER Project and the METER Corpus Text Reuse in the British Press Construction of the Corpus Structure of the Corpus Annotation of the Corpus Preliminary Experiments with the Corpus Conclusion/Discussion The METER Project and the METER Corpus: November 16, 2001 UMIST Seminar The METER Project and the METER Corpus The MEasuring TExt Reuse (METER) project aims to investigate how text is reused in the production of newspaper articles from newswire sources to determine whether algorithms can be discovered to detect and quantify such reuse automatically From this hope to gain broader insights into the nature of text derivation and paraphrase newspaper-newswire scenario provides an ideal initial case study newspaper-newswire scenario has considerable potential application To assist in this study have constructed the METER corpus containing newswire source texts newspaper articles reporting the same stories some derived from the newswire texts some not derived from the newswire texts The Text Derivation Game: November 16, 2001 UMIST Seminar The Text Derivation Game A? C? B? Text Reuse in the British Press: November 16, 2001 UMIST Seminar Text Reuse in the British Press The Press Agency (PA) is the national news agency for the UK and Ireland provides regional, national and international news 24 hours a day, 365 days a year to media customers throughout Britain + abroad daily sources 1,500 news, sport and feature stories also supplies finance, arts and entertainment and television listings, and materials for websites, magazines, and periodicals PA performs a critical function for the British media in setting the news agenda widely regarded as a credible, authoritative and trustworthy journalistic source PA is widely reused directly: cut and paste; paraphrase Indirectly: fact checking; “copy tasting” Text Reuse in the British Press: Example: November 16, 2001 UMIST Seminar Text Reuse in the British Press: Example The Times Eamon Reidy, 32, a drink-driver who rammed into Queen Elizabeth the Queen Mother's Daimler, was fined £700 and banned from driving for two years. The Queen Mother was not in car when the accident happened on July 4 in Surrey. The Telegraph A driver was almost three times over the limit when he crashed into Queen Elizabeth the Queen Mother's Daimler then fled, a court was told yesterday.  Eamon Reidy, 32, reversed away but crashed his Citroen BX into a wall at Egham, near Windsor Great Park, Surrey. He then ran off and was caught after a mile-and-a-half chase. The Mirror A BOOZY driver who smashed into the Queen Mum's chauffeur-driven Daimler minutes after she had been dropped off was banned for two years and fined £700 yesterday. Eamon Reidy, 32, fled across fields in Windsor Great Park after the crash, the court heard. Grandad John Horton, 56, head gardener on the royal estate, chased him in his slippers for one and a half miles as armed cops, dogs and helicopter joined in the pursuit. John caught up with the fugitive and grabbed his arm. But when Reidy threatened him - "he decided discretion was the better part of valour and let him go," Woking magistrates were told. Police discovered airport worker Reidy lying in undergrowth near the Queen Mum's Royal Lodge on the Crown estate. He was found to be two-and-a-half times over the legal limit. Reidy, of Langley, Berks, admitted drink- driving and failing to stop. The Sun A DRUNK driver who ploughed into the Queen Mother's limo was fined £700 and banned for two years yesterday. Eamon Reidy, 32, was 2½ times over the legal limit when he rammed the parked Daimler in a country lane. The Queen Mum - 99 last week - was not in the car at the time but her chauffeur was. Airport worker Reid sped off after the smash near Egham, Surrey, on July 4. He glanced off a wall and flattened some bushes before abandoning his Citroen. Chased Then he ran 1½ miles across fields chased by crash witness John Horton. Mr Horton finally cornered, him - but Reidy threatened him and fled. Reidy, of Langley, Berks, tried to hide in some undergrowth. But he was spotted by a police helicopter and arrested, magistrates in Woking, Surrey, heard. Defending, Lesley Barry said Reidy was trying to buy a house and had money worries. He had drunk two glasses of champagne at his parents' wedding anniversary party before drinking three pints of strong lager at a pub. PA version A drink-driver who ran into the Queen Mother's official Daimler was fined £700 and banned from driving for two years today. Eamon Reidy, 32, was two-and-a-half times over the drink-drive limit when he rammed the royal car, magistrates in Woking, Surrey, were told. The 99-year-old Queen Mother was not in the vehicle when the accident happened on July 4 in Bishopsgate, Egham, Surrey. Magistrates were told that Reidy sped off before abandoning his car, running across fields and hiding in undergrowth until he was spotted by the police helicopter. Prosecuting Robin Bowen said: ``At 8pm the defendant was driving towards Englefield Green in a black Citroen BX and collided with a Daimler limousine, a vehicle which was used on a daily basis by the Queen Mother. She was not in it at the time. It was being driven by a chauffeur + 11 sentences The Star A DRUNK driver who crashed into the back of the Queen Mum's limo was banned for two years yesterday. Airport worker Eamon Reidy, 32, was nearly three times the drink-drive limit when he hit the royal Daimler after a two-and-a-half hour session in the pub. He reversed his black E-reg Citroen BX after the crash and hit a wall before fleeing the crash scene. But he was chased for a mile-and-a-half by a passer-by who gave police a description of the Citroen driver. A helicopter and armed police were drafted into the search and Reidy was found hiding in bushes. The Queen Mother who uses the Daimler daily, was not in the car when it was hit . Reidy refused to comment after the case at Woking magistrates' court. He hit the chauffeur-driven car, registration NLT 2, in Bishopsgate Road, Egham, Surrey, last month.  Head gardener John Horton, 56, chased Reidy, who told his pursuer to leave him alone or he would "have him". Reidy was found in bushes by police, but ran off again before he was finally arrested. + 11 sentences Text Reuse in the British Press: Utility of Measuring Reuse: November 16, 2001 UMIST Seminar Text Reuse in the British Press: Utility of Measuring Reuse Like most newswire agencies, PA does not monitor uptake or dissemination of copy they release because they lack tools technologies conceptual framework for measuring reuse Potential applications of accurately measuring reuse include: monitoring of source take-up to identify unused or little used stories identifying the most reused stories within the British media determining customer dependencies on PA copy new methods for charging customers based upon the amount of copy reused Construction of the Corpus: November 16, 2001 UMIST Seminar Construction of the Corpus Texts of the METER corpus were collected manually from the PA online service the paper editions of nine British newspapers  The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The Daily Telegraph, The Guardian and The Independent Scope of corpus is limited to two domains British law court reporting show business stories Court stories substantial amount of data in newspapers and PA regular recurrence in British news revolve around “facts” -- name of the accused, charge, etc. -- limited scope for journalistic interpretation Show business more expansive style -- greater freedom of expression/interpretation more frivolous, light-hearted manner Construction of the Corpus (cont): November 16, 2001 UMIST Seminar Construction of the Corpus (cont) Temporal extent of corpus is limited to 24 days for court domain 13 days for show business domain Spread over 1 year period from July 1999 to June 2000 PA stories are classified Into broad categories: Courts, Showbiz Stories within these categories called catchlines – e.g. Courts(Axe), Courts(Strangle), Courts(Gamekeeper) Updates for each catchline, called PA pages, throughout the day For each selected catchline All PA pages downloaded Final Southern paper editions of 9 dailies from next day examined Selected newspaper articles were scanned and spell-corrected Construction of the Corpus: Statistics: November 16, 2001 UMIST Seminar Construction of the Corpus: Statistics Construction of the Corpus: Story Overlap: November 16, 2001 UMIST Seminar Construction of the Corpus: Story Overlap Structure of the Corpus: November 16, 2001 UMIST Seminar Structure of the Corpus Lowest level of alignment ... 21.06/00 ... Showbiz Courts Catch line N Catch line 1 … annotated Catch line N Catch line 1 ... meter corpus ... Page 1 Page N Newspaper 1 Newspaper N news papers ... annotated rawtext 21.06.00 12.07.99 ... Courts rawtext 12.07.99 PA Showbiz Annotation of the Corpus: November 16, 2001 UMIST Seminar Annotation of the Corpus The METER corpus is annotated at two levels: The document level – indicating degree of derivation from PA The word sequence level – indicating extent of text reuse All annotations were carried out by a single professional journalist Second judgments are being collected for 5% of the material to validate the annotations Annotation of the Corpus: Classification at the Document Level: November 16, 2001 UMIST Seminar Annotation of the Corpus: Classification at the Document Level Each document in the newspaper portion of the corpus is classified to indicate its derivational relation to the PA: Wholly derived (WD) – all content of the target text is derived only from the PA source text Partially derived (PD) – some content of the target text is derived from the source text. Other sources have also been used Non-derived (ND) – no content of the target text is derived from the source text. Although verbatim and rewritten text may appear in the target text, the context, overlap of entities or use of source text is not indicative of reuse Annotation of the Corpus: Classification at the Word Sequence Level: November 16, 2001 UMIST Seminar Annotation of the Corpus: Classification at the Word Sequence Level About ½ of the newspaper texts (~450) are annotated at the level of word sequences Verbatim: text that is reused from PA word-for-word in the same context Rewrite: text that is reused from PA, but paraphrased to create a different surface appearance. The context is still the same New: text not appearing in PA or apparently verbatim or rewritten, but used in a different context. Annotation of the Corpus: DTD: November 16, 2001 UMIST Seminar Annotation of the Corpus: DTD <METER document> (required) Attributes: filename: filename of the text (required) newspaper: the newspaper name (required) domain: courts or showbiz (required) classification:either wholly-derived, partially-derived or non-derived (optional) pagenumber: the newspaper page number (optional) date: the date of publication (required) catchline: the catchline as given by the journalist (required) Annotation of the Corpus: DTD -- Example: November 16, 2001 UMIST Seminar Annotation of the Corpus: DTD -- Example Original PA version BANKER'S BITTERNESS LED TO SYSTEMATIC THEFTS By Lyndsay Moss, PA News A middle-aged banker who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years today. Trusted Derek Boe, 48, used some of the money to splash out on holidays, buy a car and a caravan, and pay for expensive home improvements. Telegraph version: A BANKER who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years yesterday. Derek Boe, 48, used some of the money for holidays, to buy a car and a caravan, and to pay for home improvements. Annotated Telegraph version: <!DOCTYPE meterdocument SYSTEM "meter_corpus/dtds/meter.dtd"> <meterdocument filename="meter_corpus/newspapers/annotated/courts/16.07.99/banker/banker125_telegraph.sgml", newspaper="telegraph", domain = "courts", classification="wholly-derived", pagenumber="4", date="16.07.99", catchline="banker"> <body> <verbatim PA_src="">A </verbatim> <verbatim PA_src="">BANKER who stole more than </verbatim> <rewrite PA_src="">£270,000 </rewrite> <verbatim PA_src="">from his bosses because he resented younger staff being promoted over his head, was jailed for four years </verbatim> <rewrite PA_src="">yesterday. </rewrite> <verbatim PA_src="">Derek Boe, 48, used some of the money </verbatim> <rewrite PA_src="">for </rewrite> <verbatim PA_src="">holidays, </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">buy a car and a caravan, and </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">pay for </verbatim> <verbatim PA_src="">home improvements. </verbatim> </body> </meterdocument> Preliminary Experiments with the Corpus: November 16, 2001 UMIST Seminar Preliminary Experiments with the Corpus Initial experiments are underway to explore techniques for detecting whether a candidate reused text is wholly derived, partially derived or non-derived from a PA source text. Techniques being investigated include: Dotplot Information retrieval text similarity measures (tf.idf) Word n-gram overlap measures 50-70% correct identification of document level classification Statistical alignment techniques 80-90 % correct identification of document level classification Slide 19: November 16, 2001 UMIST Seminar Dotplot - visualising patterns of reuse (1) 1 J Helfman, Dotplot: a program for exploring self-similarity in millions of lines of text and code, Journal of Computational and Graphical Statistics, 2(2), pp(153-174), June 1993. Can be used to visually identify derived newspaper stories using patterns formed from matching verbatim text. Using simple combination of n-grams and Dotplot1. Dotplot immune to change in word order (not substitution). Useful for displaying relationships between long texts, biological subsequences or software programs. Specific Dotplot “patterns” indicate relationships between sequences analysed: long diagonal lines imply verbatim text in same order, blocks indicate verbatim blocks re-ordered. No quantitative score given and relies on human analysis. Slide 20: November 16, 2001 UMIST Seminar Dotplot - visualising patterns of reuse (2) This part will show self-reuse (of PA) UNRELATED TEXTS NON-DERIVED TEXTS DERIVED TEXTS This part will show any reuse between X and PA Quite obvious diagonals - implies verbatim reuse PA duplicated text Slide 21: November 16, 2001 UMIST Seminar Information Retrieval Approach Used Okapi1 - state-of-the-art Information Retrieval system. Probabilistic approach using Best Match (BM) set of similarity operators - tested BM25. Indexed all newspaper articles (removed function words and stemmed). Used PA copy as the query - removed function words and selected n words (using tf*idf weighting based upon the index). Calibrated the BM25 scores between PA query and returned newspaper documents - to enable classification. Would like to try vector-space approach for comparison. 1 S. Robertson et al., Okapi at TREC-7: automatic ad hoc, filtering VLC and interactive track, NIST Special Publication 500-242 at TREC-7, pp(253-264), 1998. Slide 22: November 16, 2001 UMIST Seminar N-grams Extracted unique word n-grams from newspaper and PA texts. Hypothesis: derived texts will share longer n-grams than non- derived texts. Compared PA and newspaper texts using set-theoretic association scores. Dice, Jaccard, resemblance, containment, cosine. Compared texts for n-grams between 1 and 10 words. Tried removing function words, morphological analysis and removing direct quotes. Found hypothesis to be true except quotes (direct and indirect) cause problems (unexpected 10-grams in non-derived texts) and rewriting causes no 10-grams in wholly-derived texts. Experimenting with approximate word matching e.g. edit distance to allow more 10-grams to match WD texts. Slide 23: November 16, 2001 UMIST Seminar Longest common substrings Extract the longest common substrings between PA and newspaper texts. Hypothesis: derived texts will have more longer substrings than non-derived texts Using Greedy String Tiling1 - an efficient and popular method used in plagiarism detection. Finds maximal tiling between two texts. Greedy because longer matches preferred. Also between matches of the same length will match the first it finds. Gets over limitations of longest common subsequence in that re- ordering does not affect the GST algorithm. Used the longest common substring length, the mean tile length, the standard deviation of tile lengths and the PA and newspaper file lengths as features to a classifier. 1 M. Wise, YAP3: improved detection of similarities in computer programs and other texts, Presented at SIGCSE’96, pp(130-134),1996. Slide 24: November 16, 2001 UMIST Seminar Alignment The METER task is construed as a translation task – from the source to derived texts. First align sentences between the candidate source and derived texts. Align by pair-wise comparison between all sentences using a similarity score – some sentences may fail to be aligned. Based on the successfully aligned sentences, estimate the probability that the document is derived from the PA. Combined with machine learning classifier, the candidate derived texts are classified into WD, PD or ND. Slide 25: November 16, 2001 UMIST Seminar Comparison of initial results - courts domain Randomly selected 166 files from all domains (used all WD files). Used same files for all approaches. Used results as features to Naïve Bayes classifier. Slide 26: November 16, 2001 UMIST Seminar Comments on the results Alignment gives the best results – but not clear if they are significantly better Although alignment the best, the very simple n-gram approach is close behind. N-gram intuition correct: WD share more long n-grams than ND, but for classification not enough examples of 10-grams exist. Therefore 1-grams give best classification. Direct quotes and rewriting cause problems with exact matching. Problem in classification is between the PD and ND classes. WD separates well whichever method used. Combining the features from different approaches together with some computational approximations to human judgements might work best, e.g. use longest strings, distribution of matches, types of matches etc. Conclusion/Discussion (1): November 16, 2001 UMIST Seminar Conclusion/Discussion (1) Have presented the METER corpus first corpus to attempt to support the study of (legitimate) text reuse first corpus to attempt to systematically align source/derived text in the journalistic world Texts are derived from two domains (Courts and Showbiz) over a period of one year Texts are annotated at two levels Document level – a coarse indication of derivation/reuse Word sequence level – a fine-grained indication of derivation/reuse Conclusion/Discussion (2): November 16, 2001 UMIST Seminar Conclusion/Discussion (2) Corpus is limited in terms of Scope (2 domains only) Temporal extent (36 days over 1 year only) Size (1717 stories in total) Annotation content (no links back to source texts) Annotation accuracy (one annotator; evolving conception of annotation guidelines) Primary purpose is to serve as a pilot – if useful/interesting subsequent versions or related corpora can be created Distribution through ELRA/LDC underway Conclusion/Discussion (3): November 16, 2001 UMIST Seminar Conclusion/Discussion (3) Experiments with various string matching algorithms have been carried out with a view to automatically classifying texts as wholly derived (WD) partially derived (PD) or non-derived (ND) from PA source Results suggest the WD-ND distinction can be quite accurately captured, with alignment techniques apparently performing best; the 3-way WD-PD-ND distinction is harder to capture Algorithms to test derivation at the word sequence level are currently being tested An open question is how to utilize richer linguistic models for this task

Add a comment

Related presentations