Effective Extraction of Thematically Grouped Key Terms From Text

40 %
60 %
Information about Effective Extraction of Thematically Grouped Key Terms From Text
Technology

Published on March 23, 2009

Author: maria.grineva

Source: slideshare.net

Description

"Effective Extraction of Thematically Grouped Key Terms From Text". Presentation for AAAI-SSS-09: Social Semantic Web: Where Web 2.0 Meets Web 3.0:

Effective Extraction of Thematically Grouped Key Terms From Text Maria Grineva Ph.D., research scientist at Institute for System Programming of RAS

Outline Key terms extraction: traditional approaches and applications Using Wikipedia as a knowledge base for Natural Language Processing Main techniques of our approach: Wikipedia-based semantic relatedness Network analysis algorithm to detect community structure in networks Our method Experimental evaluation

Key terms extraction: traditional approaches and applications

Using Wikipedia as a knowledge base for Natural Language Processing

Main techniques of our approach:

Wikipedia-based semantic relatedness

Network analysis algorithm to detect community structure in networks

Our method

Experimental evaluation

Key Terms Extraction B asic step for various NLP tasks : document classification document clustering text summarization inferring a more general topic of a text document C ore task of Internet content - based advertising systems , such as Google AdSense and Yahoo! Contextual Match .

B asic step for various NLP tasks :

document classification

document clustering

text summarization

inferring a more general topic of a text document

C ore task of Internet content - based advertising systems , such as Google AdSense and Yahoo! Contextual Match .

Approaches to Key Terms Extraction Based on statistical learning : use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document ( KEA ) compute statistical features over Wikipedia corpus ( Wikify! ) require training set Based on analyzing syntactic or semantic term relatedness within a document compute semantic relatedness between terms (using, for example, Wikipedia) modeling document as a semantic graph of terms and applying graph analysis techniques to it ( TextRank ) no training set required

Based on statistical learning :

use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document ( KEA )

compute statistical features over Wikipedia corpus ( Wikify! )

require training set

Based on analyzing syntactic or semantic term relatedness within a document

compute semantic relatedness between terms (using, for example, Wikipedia)

modeling document as a semantic graph of terms and applying graph analysis techniques to it ( TextRank )

no training set required

Using Wikipedia as a Knowledge Base for Natural Language Processing Wikipedia (www.wikipedia.org) – free open encyclopedia Today Wikipedia is the biggest encyclopedia ( more than 2 . 7 million articles in English Wikipedia ) It is always up-to-date thanks to millions of editors over the world Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages = > rich resource for bootstrapping NLP and IR tasks

Wikipedia (www.wikipedia.org) – free open encyclopedia

Today Wikipedia is the biggest encyclopedia ( more than 2 . 7 million articles in English Wikipedia )

It is always up-to-date thanks to millions of editors over the world

Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages = > rich resource for bootstrapping NLP and IR tasks

Basic Techniques of Our Method: Semantic Relatedness of Terms S emantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms Can be computed over dictionary or thesaurus. We use Wikipedia Wikipedia-based semantic relatedness for the two terms c an be computed using : the links found within their corresponding Wikipedia articles Wikipedia categories structure the article’s textual content

S emantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms

Can be computed over dictionary or thesaurus. We use Wikipedia

Wikipedia-based semantic relatedness for the two terms c an be computed using :

the links found within their corresponding Wikipedia articles

Wikipedia categories structure

the article’s textual content

Using Dice-measure for Wikipedia-based semantic relatedness Basic Techniques of Our Method: Semantic Relatedness of Terms Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008

Using Dice-measure for Wikipedia-based semantic relatedness

Denis Turdakov, Pavel Velikhov

“ Semantic Relatedness Metric for Wikipedia Concepts Based on

Link Analysis and its Application to Word Sense Disambiguation ”

SYRCoDIS, 2008

Basic Techniques of Our Method: Detecting Community Structure in Networks Community – densely interconnected group of nodes in a network Girvan-Newman algorithm for detection community structure in networks: betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them

Community – densely interconnected group of nodes in a network

Girvan-Newman algorithm for detection community structure in networks:

betweenness – how much is edge “in between” different communities

modularity - partition is a good one, if there are many edges within communities and only a few between them

Our Method Candidate t erms e xtraction Word sense disambiguation Building semantic graph Discovering community structure of the semantic graph Selecting valuable communities

Candidate t erms e xtraction

Word sense disambiguation

Building semantic graph

Discovering community structure of the semantic graph

Selecting valuable communities

Our Method: Candidate T erms E xtraction Goal: e xtract all terms from the document and f or each term prepare a set of Wikipedia articles that can describe its meaning P arse the input document and extract all possible n - grams For each n-gram (+ its morphological variations ) provide a set of Wikipedia article titles “ drinks ”, “ drinking ”, “ drink ” => [Wikipedia:] Drink ; Drinking

Goal: e xtract all terms from the document and f or each term prepare a set of Wikipedia articles that can describe its meaning

P arse the input document and extract all possible n - grams

For each n-gram (+ its morphological variations ) provide a set of Wikipedia article titles

“ drinks ”, “ drinking ”, “ drink ” => [Wikipedia:] Drink ; Drinking

Our Method: Word Sense Disambiguation Goal: choose the most appropriate W ikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step U se of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008

Goal: choose the most appropriate W ikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step

U se of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms

Denis Turdakov, Pavel Velikhov

“ Semantic Relatedness Metric for Wikipedia Concepts Based on

Link Analysis and its Application to Word Sense Disambiguation ”

SYRCoDIS, 2008

Our Method: Building Semantic Graph Goal: building document semantic graph using semantic relatedness between terms Semantic graph built from a news article "Apple to Make ITunes More Accessible For the Blind"

Goal: building document semantic graph using semantic relatedness between terms

Our Method: Detecting Community Structure of the Semantic Graph Dense communities represent main topics of the document Disambiguation mistakes become isolated vertices Modularity for semantic graphs: 0.3~0.5

Dense communities represent main topics of the document

Disambiguation mistakes become isolated vertices

Modularity for semantic graphs: 0.3~0.5

Our Method: Selecting Valuable Communities Goal: rank term communities in a way that: the highest ranked communities contain key terms the lowest ranked communities contain not important terms, and possible disambiguation mistakes Use: density of community – sum of inner edges of community divided by the number of vertices in this community informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms Community rank: density*informativeness Take 2-3 communities with highest rank

Goal: rank term communities in a way that:

the highest ranked communities contain key terms

the lowest ranked communities contain not important terms, and possible disambiguation mistakes

Use:

density of community – sum of inner edges of community divided by the number of vertices in this community

informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms

Community rank: density*informativeness

Take 2-3 communities with highest rank

Advantages of the Method No training . Instead of training the system with handcreated examples, we use semantic information derived from Wikipedia Thematically grouped key terms . Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph High accuracy . Evaluation using human judgments (futher in this presentation)

No training . Instead of training the system with handcreated examples, we use semantic information derived from Wikipedia

Thematically grouped key terms . Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph

High accuracy . Evaluation using human judgments (futher in this presentation)

Experimental Evaluation: Creating Test Set 30 blog posts from the technical blogs 5 persons took part in evaluation and was aksed to: identify from 5 to 10 key terms for each blog post each key term must present in the blog post, and must be identified using Wikipedia article names as the allowed vocabulary choose key terms should cover several main topics of the blog post Eventualy, key term was considered valid if at least two of the participants identified the same key term from th e blog post

30 blog posts from the technical blogs

5 persons took part in evaluation and was aksed to:

identify from 5 to 10 key terms for each blog post

each key term must present in the blog post, and must be identified using Wikipedia article names as the allowed vocabulary

choose key terms should cover several main topics of the blog post

Eventualy, key term was considered valid if at least two of the participants identified the same key term from th e blog post

Experimental Evaluation: Precision and Recall 30 blog posts, 180 key terms extracted manually, 297 key terms were extracted by our method, 123 of manually extracted key terms were also extracted by our method Recall equals to 68% Precision equals to 41%

30 blog posts, 180 key terms extracted manually, 297 key terms were extracted by our method, 123 of manually extracted key terms were also extracted by our method

Recall equals to 68%

Precision equals to 41%

Experimental Evaluation: Revision of Precision and Recall O ur method typically extracts more related terms in each thematic group than a human (possibly, our method produces better terms coverage for a specific topic than an average human) => revisit precision and recall Each participant reviewed key terms extracted automatically for every blog, and, if possible, extended his manually identified key terms with some from the automatically extracted set Recall after revision equals to 73% Precision after revision equals to 52%

O ur method typically extracts more related terms in each thematic group than a human (possibly, our method produces better terms coverage for a specific topic than an average human) => revisit precision and recall

Each participant reviewed key terms extracted automatically for every blog, and, if possible, extended his manually identified key terms with some from the automatically extracted set

Recall after revision equals to 73%

Precision after revision equals to 52%

Your Questions

Add a comment

Related presentations

Related pages

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text Maria Grineva Maxim Grinev and Dmitry Lizorkin Institute for System Programming of Russian ...
Read more

CiteSeerX — Effective Extraction of Thematically Grouped ...

@MISC{Grineva_effectiveextraction, author = {Maria Grineva and Maxim Grinev and Dmitry Lizorkin}, title = {Effective Extraction of Thematically Grouped Key ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text Maria Grineva, Maxim Grinev, and Dmitry Lizorkin We present a novel method for extraction ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text. Maria P. Grineva, Maxim N. Grinev, Dmitry Lizorkin; AAAISS; 2009; View PDF; Cite; Save ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text on ResearchGate, the professional network for scientists.
Read more

Effective Extraction of Thematically Grouped Key Terms ...

内容提示: Effective Extraction of Thematically Grouped Key Terms ... Effective Extraction of Thematically Grouped ... key terms froma text ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text. ... Effective Extraction of Thematically Grouped Key Terms From Text ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

Effective Extraction of Thematically Grouped Key Terms From Text. Authors. Grineva M., Grinev M., Lizorkin D. Abstract. We present a novel method for ...
Read more

Effective Extraction of Thematically Grouped Key Terms ...

| Get your Presentation Pack ... Effective Extraction of Thematically Grouped Key Terms From Text |
Read more