KRISTAL2006 04

67 %
33 %
Information about KRISTAL2006 04

Published on November 20, 2007

Author: Carlton


Web Document Clustering :  Web Document Clustering 2006. 8. 24 한국해양대학교 공과대학 IT공학부 김재훈 Contents:  Contents IR and Web Search Problems on Web Search Results Document Clustering? Why Cluster Documents on Web Search Results? Applications of Document Clustering Clustering Approaches More Applications Discussions Typical IR Task :  Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query. IR System:  IR System Relevance:  Relevance a subjective judgment may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need). Keyword Search:  Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). Intelligent IR:  Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Taking into account the authority of the source. Adapting to the user based on the direct or indirect feedback. Web Search:  Web Search Application of IR to HTML documents on the World Wide Web. Differences with typical IR: Must assemble document corpus by spidering the web. Can use the structural layout information in HTML (XML). Can use the link structure of the web. Documents change uncontrollably. Web Search System:  Web Search System Web users’ information needs:  Web users’ information needs classification based on the type of answers expected by the Web user 1. Question answering type of answer : a very short answer (single sentence, part of the document) query example “What is the population of Korea?” 2. Named page finding - Homepage finding type of answer : a single document containing specific information query example “Authors instructions for the KISS journal” 3. Topic relevance type of answer: a range of documents relevant to the topic of their information need query example “KOREA’s immigration policy” 4. Online services type of answer: a range of documents that allow the user to access a particular service query example “download mp3” 5. Topic distillation type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality) query example “highway safety” Web search results:  Web search results Big problem A long list of search results, ranked by their relevancies to the given query Engine : search engine, car part, Engine Corp. A time consuming task when multiple sub-topics of the given query are mixed together. Possible solution To (online) cluster web search results Evidence Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): “Closely related documents tend to be relevant to the same requests.” Potential results Aids user-engine interaction Browsing Help users express his need An example - Vivisimo:  An example - Vivisimo What is clustering?:  What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different the act of grouping similar objects into sets unsupervised classification no predefined classes Typical applications to get insight into data as a preprocessing step What is clustering? (cont.):  What is clustering? (cont.) What is document clustering?:  What is document clustering? Document clustering : To group similar documents into sets To group similar or related documents into a common cluster. Eg) Scatter/Gather Why not just document clustering?:  Why not just document clustering? Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.) Why cluster web documents?:  Why cluster web documents? 1. For whole corpus analysis/navigation Better user interface 2. For improving recall in search applications Better search results 3. For better navigation of search results Effective “user recall” will be higher 4. For speeding up vector space retrieval Faster search 1. Corpus analysis/navigation:  1. Corpus analysis/navigation Standard IR Index Aardvark, 15 Blueberry, 200 Capricorn, 1, 45-55 Dog, 79-99 Egypt, 65 Falafel, 78-90 Giraffes, 45-59 … Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing Document clusters Some time, TOC is very useful 1. Corpus analysis/navigation (cont.):  1. Corpus analysis/navigation (cont.) Document clustering Can induce a tree of topics To allow user to browse through corpus to find information Crucial need: meaningful labels for topic nodes. Yahoo!: manual hierarchy Often not available for new document collection For visualizing a document collection and its themes:  For visualizing a document collection and its themes Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size] 2. Improvement of search recall:  2. Improvement of search recall Cluster hypothesis Documents with similar text are related Therefore, to improve search recall: To cluster docs in corpus a priori (pre-clustering) To return other docs in the cluster containing D when a query matches a doc D. Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. 3. Better navigation of search results:  3. Better navigation of search results For grouping search results thematically / Vivisimo 3. Better navigation of search results - - :  3. Better navigation of search results - - 3. Better navigation of search results - - :  3. Better navigation of search results - - 3. Better navigation of search results - -:  3. Better navigation of search results - - 3. Better navigation of search results:  3. Better navigation of search results One can also view grouping documents with the same sense of a word as clustering Given the results of a search (say Jaguar, or NLP), partition into groups of related docs Can be viewed as a form of word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game 4. Speeding up vector space retrieval:  4. Speeding up vector space retrieval In vector space retrieval, To find nearest doc vectors to query vector To entail finding the similarity of the query to every doc – slow (for some applications) By clustering docs in corpus a priori To find nearest docs in cluster close to query To be inexact but avoid exhaustive similarity computation What Is A Good Clustering?:  What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes Assessable with gold standard data Main issues:  Main issues Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups? Components Of Clustering System:  Components Of Clustering System 1. Feature selection/extraction the most effective subset from patterns 2. Feature representation one or more transformations of the input features (attributes) to produce new salient features. 3. Pattern proximity (measure similarity) The distance of pairs of patterns (features). 4. Clustering (Grouping) Hard (a partition of the data into groups) Fuzzy (each pattern has a variable degree) 5. Data abstraction Extracting a compact representation of a data set Automatic analysis, Human-oriented. 6. Evaluation to measure the performance of competing algorithms 1. Feature Selection/Extraction:  1. Feature Selection/Extraction What is the representation of documents? document = a set of features What is the features? Linguistic structure in documents Co-occurrences of Terms : n-gram Semantic structure : case structure Meta-data of documents Authors Citation Co-citation document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents 1. Feature Selection/Extraction:  1. Feature Selection/Extraction Which terms to use as axes for vector space? Better is to use highest weight mid-frequency words – the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should “figure out” some of these 2. Feature representation:  2. Feature representation Documents are transformed into vectors of weighted terms Data matrix (two modes) How to compute xij tf*idf 3. Pattern proximity (measure similarity):  3. Pattern proximity (measure similarity) Dissimilarity matrix (one mode) How to compute dij Dissimilarity - Distance Similarity - inverse of distance Proximity Measure:  Proximity Measure Dissimilarity The Euclidean distance: The Manhattan distance: Weighted distance: Similarity Inner Product Cos Coefficient 4. Clustering Approach and/or Algorithm:  4. Clustering Approach and/or Algorithm (1) Using contents of documents (2) Using user’s usage logs (3) Using current search engines (4) Using hyperlinks (5) Other classical methods (1) Using Contents of Documents:  (1) Using Contents of Documents Based on snippets returned by web search engines. Be as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document “cleaning” (2) identifying base clusters using a suffix tree (3) combining these base clusters into clusters (2) Using user’s usage logs:  (2) Using user’s usage logs Relevancy information is objectively reflected by the usage logs An experimental result on (3) Using current web search engines - Metacrawler -:  (3) Using current web search engines - Metacrawler - (4) Using hyperlinks:  (4) Using hyperlinks Cluster web documents based on both the textual and hyperlink The hyperlink structure is used as the dominant factor in the similarity metric Kleinberg’s HITS (Hypertext Induced Topic Selection) algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones. Other classical approaches:  Other classical approaches Hierarchical Clustering:  Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down Hierarchical Clustering (Cent):  Hierarchical Clustering (Cent) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. agglomerative divisive Agglomerative Example:  Agglomerative Example B A E C D 4 Threshold of 2 3 5 1 A B C D E Distance Between Two Clusters:  Distance Between Two Clusters Single-Link Method : Nearest Neighbor Complete-Link : Furthest Neighbor Average-link : all cross-cluster pairs. Single-Link Method:  Single-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b,c c c d a,b d d a,b,c,d Complete-Link Method:  Complete-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b c c d a,b d c,d a,b,c,d Example of the Chaining Effect:  Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters) Effect of Bias towards Spherical Clusters:  Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters) Partitioning clustering:  Partitioning clustering Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster K-means algorithm:  K-means algorithm Initial center of cluster are randomly selected Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied K-means algorithm:  K-means algorithm Example What is the problem of k-Means?:  What is the problem of k-Means? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 5. Data Abstraction - Cluster labelling -:  5. Data Abstraction - Cluster labelling - How do we generate meaningful descriptions, or labels, for clusters? it is an open issue Possible solution To list 5-10 most frequent features from the cluster features: index terms, noun phrases, proper names, … To show features that occur frequently in this cluster and not frequently in others Differential labeling To summarize multiple documents in the cluster 6. Cluster validation:  6. Cluster validation How can we tell if a clustering is good or not? ask users whether they agree with the clusters agreement with human clustering is problematic (Macskassy et al., 1998) use statistical techniques to measure qualities of the cluster, e.g. the purity, or the divergence from random clustering For information retrieval do we manage to cluster relevant documents together? This is the “holy grail” for cluster-based IR can be evaluated through cluster-based retrieval Approaches to evaluating:  Approaches to evaluating Anecdotal “I wrote this clustering algorithm and look what it found!” No benchmarks, no comparison possible User inspection Experts evaluate the results and score them Expensive / time-consuming Ground “truth” comparison To compare clustering results to a known taxonomy like Yahoo The static prior taxonomy may be incomplete/wrong in places Microeconomic / utility Net economic gain produced by an approach (vs. another approach) How do various clustering methods affect the quality of what’s retrieved? Compare two IR algorithms 1. send query, present ranked results 2. send query, cluster results, present clusters Purely quantitative measures Probability of generating clusters found Average distance between cluster members Cluster Validity Assessment:  Cluster Validity Assessment (Ci, Cj) : inter-cluster distance (Ck) : intra-cluster distance D ( C K ) d ( C i , C j ) Inter-cluster distance : (Ci, Cj):  Inter-cluster distance : (Ci, Cj) Average linkage Centroid linkage Complete linkage Single linkage Average to centroids linkage Intra-cluster distance: (Ck) :  Intra-cluster distance: (Ck) Complete diameter Average diameter Centroid diameter DB(Davies-Bouldin) Index:  DB(Davies-Bouldin) Index K : the number of clusters Sij : Distance between the center zi of cluster i Ci & the center zj of cluster j Cj Si : Scatter between Ci and Cj Davies-Bouldin (DB) Index (2):  Davies-Bouldin (DB) Index (2) The objective: To minimize the index the number of clusters optimal number of clusters, K The clusters with minimized DB index Small values of DB correspond to good clusters Dunn’s Index:  Dunn’s Index The objective To identify sets of clusters that are compact and well separated to maximize the inter-cluster distances and minimize the intra-cluster distances Large values of VD correspond to good clusters Other Cluster Validity Measures:  Other Cluster Validity Measures Calinski Harabasz (CH) Index Index I Xie and Beni F-Measure -Measure Silhouette Symmetry Graph-Based Boundary Analysis Partition Coefficient Separation Index Classification Entropy (CE) Etc. Other clustering applications:  Other clustering applications Relevance feedback (Iwayama, 2000) Search engine query logs cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000) Recommender systems cluster users based on preferences, purchases, etc. (Sarwar et al., 2002) Topic detection and tracking (TDT) & Novelty Detection cluster news stories together and try to find emerging new topics (Allan, 2001) cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003) Discussion:  Discussion In general, clustering presents problems and challenges: Selection of attributes for clustering Selection of clustering method Generation of cluster representations Validity/quality of the generated clustering Updating clustering structures e.g. inserting or deleting a document from a hierarchy Effectiveness gains are not always evident Computational expenses Looking back:  Looking back Document clustering has been applied to IR for over 30 years the research focus has shifted over the years efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s), …? But, there are still many unresolved issues: Cluster representation Cluster-based search strategies Dynamic clustering (per-query basis) Algorithmic aspects Document representation for clustering (e.g. reduce noise by using only the most important part of a document) New applications for clustering in IR As a conclusion:  As a conclusion Clustering is a useful tool for IR It is theoretically solid and intuitively plausible Although it has been used for over 30 years there are still open issues It is still an interesting area to research Some references:  Some references If you only read one article/reference: Tombros, A., PhD Thesis, 2002. Chapter 3 (optionally 4 & 5) available at: Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577-597, 1988. More than worth to have a look at: van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2nd Edition, 1979; available at Also recommended : Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR ‘96, pages 76-84, 1996. Some references:  Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR ‘09, pages 46-54, 1998. Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR ’00, pages 10-16, 2000. Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6th International Conference on Knowledge Discovery in Data, pages 407-416, 2000. B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5th International Conference on Computer and Information Technology, 2002. J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001. Some references:  Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4th Knowledge Discovery and Data Mining Conference (KDD-98), pp. 264-268, 1998. Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998. Available at Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, 1998. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR ‘92, pages 126-135, 1992. Acknowledgement:  Acknowledgement C. Manning and P. Raghavan Why cluster documents I. Corna, Cluster Validation, T. Tombros, Clustering for IR, External Evaluation of Cluster Quality:  External Evaluation of Cluster Quality Assesses clustering with respect to ground truth Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with ni members. Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi Others are entropy of classes in clusters (or mutual information between classes and clusters) Purity:                   Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Purity

Add a comment

Related presentations

Related pages

СПЕЦОДЯГ, БОРИСЛАВСЬКА ШВЕЙНА ... Про ... 04.12.2015: Tweet. Бізнес- ...
Read more

Dokumente pdf,word,excel - Këshilltari Juaj

Test+pergjigje_Kadare_Kristal2006.docx Visualizza Scarica: ... 18 mar 2016, 11:04: Aleksander Mita XH. XHOJSI; Selection File type icon File name Description
Read more

Hash Killer

... 081147159800512 6f762a4afa62c390a3f9a71bf2d0ec36 endomorphismen 6f762a7c8a472553557176bcd0b6b25c kristal2006 ... 04/15/1922 ...
Read more

Правління Розділ: Робочі органи ...

235-55-04 , 097-968-15-16 . Михайло Коропецький ... 504305111, 5-06-52, 5-05-41 ...
Read more

다국어문자표현 - KRISTAL IRMS Home - GIIS

2007-04-17 동국대-DBLAB 3 머릿말 z문자는 말(정보)을 담는 그릇 z컴퓨터는 숫자만을 다루며, 음성기호 또는 문 자(character ...
Read more

1. Доповни речення словами ...

от kristal2006 21.10.2015. ... 2015-10-21T18:04:52+00:00. 1)Водять літаки ...
Read more

СПЕЦОДЕЖДА, БОРИСЛАВСКАЯ ... О ... 04.12.2015: Tweet. Бизнес- ...
Read more

Web Document Clustering - KRISTAL IRMS Home - GIIS

KRISTAL 2006 2 Contents zIR and Web Search zProblems on Web Search Results zDocument Clustering? zWhy Cluster Documents on Web Search Results ...
Read more