OHSummarize Sept2003

50 %
50 %
Information about OHSummarize Sept2003
Entertainment

Published on August 22, 2007

Author: Funtoon

Source: authorstream.com

Automatic text summarization:  Automatic text summarization Hercules Dalianis NADA-KTH Royal Institute of Technology 100 44 Stockholm ph: +46-8-790 91 05 mobile: +46 70 568 13 59 email: hercules@kth.se Overview of talk:  Overview of talk Background Other summarizers Technique Future improvements Applications Evaluation Automatic text summarization:  Automatic text summarization Automatic text summarization is the method where a computer summarizes a text. A text is given to the computer and it returns an non-redundant shorter text- An extract from a longer original text. The technique has it’s roots in the 60’s. With the Internet and the WWW it has been an awakening interest in summarization techniques. Summarization tools:  Summarization tools http://www.nada.kth.se/~hercules/HDbookmarks.htm http://www.ics.mq.edu.au/~swan/summarization/projects_full.htm Microsoft Word 97, 98 and Word 2000 have a summarizer for documents. Intelligent Miner for Text -Summarization tool IBM Inxight (XEROX) Datahammer (Glucose Development Corporation) Slide5:  Corporum Summarizer- Cognit AS (Norway) Pertinence (France) Copernic Summarizer MuST Prototype Automated Text Summarization (SUMMARIST) Columbia Newsblaster http://www1.cs.columbia.edu/nlp/newsblaster OracleContext Autonomy What is Automatic summarization good for?:  What is Automatic summarization good for? News paper setting and printing and#x8;Sydsvenska Dagbladet, Bergens Tidene Summarize Scientific texts Danmarks Elektroniske Forskningsbibliotek Telephone systems Read summarized news synthetically Slide7:  Search engines to summarize documents for hitlist c.f. Google, SiteSeeker. NewsAgent - Business Intelligence TDT Topic Detection Tracking and Columbia Newsblaster Slide8:  Slide9:  Slide10:  Summarization approaches:  Summarization approaches Extraction vs. Abstraction Generic vs. Query based Indicative vs. Informative Restricted vs. Unrestricted domain Background information vs. New information (TDT) Single-document vs. Multiple-document Monolingual vs. Multilingual Textual vs. Multimedia Text summarization:  Text summarization Extraction is much easier than abstraction Abstraction needs understanding and rewriting Techiques:  Techiques Find what the text is about Then decide what so say Then decide how to say it Text summarization (extraction) uses statistic, linguistic and heuristic methods Techiques:  Techiques A text is divided into sentences Sentence positions (News/Reports) Title words Bold text, Numerical values, Citations Named Entities (Frequence based) Keyword frequency and extraction (nouns, adverbs, adjectives) Use morphological information-lemma Key word lexicon:  Key word lexicon Key words in news domain Also called 'open class word lexicon' Key words can be noun, adjectives or adverbs Slide16:  Word which are present in all other sentences. User adaptation Use user keywords - Obtain slanted summaries Combination function of all rankings with different weights gives the rank of each sentence. Generate all high ranking sentences Voilá the summary ! SweSum:  SweSum The first text summarizer for Swedish Summarizes Swedish news paper text in HTML/text format on the WWW. Uses a Swedish key word lexicon that contains 40 000 words and their possible 700 000 inflections. During the text summarization are 5-10 key words produced which describes or categorizes the text - Key words - A miniature summary. The Swedish keyword lexicon:  The Swedish keyword lexicon 700 000 words 40 000 words Inflected version Lemma statsminister statsminister statsministern statsminister statsministerns statsminister statsministrarna statsminister statsministrarnas statsminister .. ... regeringen regeringen regeringens regeringen regeringarna regeringen regeringarnas regeringen ... .... Slide19:  SweSum:  SweSum SweSum is available to summarize news texts on Swedish, Danish, Norwegian, English, Spanish, French, German and in Farsi (Iranian). Slide21:  Slide22:  Textsammanfattningsbildspel Slide23:  Problems:  Problems Pronoun and other anafora referenser Kalle sprang. Han sprang fort. Pronoun resolution Clauses can be too long or too short Clause reductions- and clause combination rules Aggregation SweSum without PRM:  SweSum without PRM Analysera mera! Regi: Harold Ramis Medv: Robert De Niro, Billy Crystal, Lisa Kudrow Längd: 1 tim, 45 min … Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Han accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och han är tämligen oemotståndlig. Här har han åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination. SvD 99-10-08 SweSum with PRM:  SweSum with PRM Analysera mera! Regi: Harold Ramis Medv: Robert De Niro, Billy Crystal, Lisa Kudrow Längd: 1 tim, 45 min … Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Robert accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och Robert är tämligen oemotståndlig. Här har Harold åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination. SvD 99-10-08 Evaluation:  Evaluation We found that if one summarizes the text to 30 percent of original length one will obtain around 70-80 percent accuracy on 3-4 pages news articles. .. but query based evaluations are based on subjective opinions These evaluation need large human effort Small overlap of opinions We need man-made extracts to compare the machine made extracts automatically Slide28:  There are some man-made extracts for English news texts. We had to create such extract for Swedish news text. We created KTH Extract Corpus- Corpus created manually once by voting Then one can compare the texts from SweSum and KTH Extract Corpus manually or soon automatically KTH extract corpus:  KTH extract corpus http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php and http://www.nada.kth.se/iplab/hlt/kthxc/ Visa celltexten Slide30:  http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php?cutoff=30andamp;fileid=svenska-%3Etest-%3Etext001.htm Future improvements of SweSum:  Future improvements of SweSum Tagging instead of static lexicons Clause level summarization Improved Named Entity recognition Improved Pronominal Resolution Lexical chains using SIMPLE and/or EuroWordNet Automatic evaluation method Demonstrators:  Demonstrators SweSum – Standard version http://swesum.nada.kth.se/index-eng.html SweSum – Experimental NE version http://www.nada.kth.se/~xmartin/swesum_lab/index-eng.html (SweSum uses a Perl-CGI script, there is also a standalone version for plain text/html)

Add a comment

Related presentations