rastogi slides

33 %
67 %
Information about rastogi slides

Published on January 23, 2008

Author: Penelope

Source: authorstream.com

Data Mining meets the Internet: Techniques for Web Information Retrieval :  Data Mining meets the Internet: Techniques for Web Information Retrieval Rajeev Rastogi Internet Management Research Bell laboratories, Murray Hill The Web:  The Web Over 1 billion HTML pages, 15 terabytes Wealth of information Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes, yellow & white pages, maps, markets, ......... Diverse media types: text, images, audio, video Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 Highly Dynamic 1 million new pages each day Average page changes in a few weeks Graph structure with links between pages Average page has 7-10 links Hundreds of millions of queries per day Why is Web Information Retrieval Important?:  Why is Web Information Retrieval Important? According to most predictions, the majority of human information will be available on the Web in ten years Effective information retrieval can aid in Research: Find all papers that use the primal dual method to solve the facility location problem Health/Medicene: What could be reason for symptoms of “yellow eyes”, high fever and frequent vomitting Travel: Find information on the tropical island of St. Lucia Business: Find companies that manufacture digital signal processors (DSPs) Entertainment: Find all movies starring Marilyn Monroe during the years 1960 and 1970 Arts: Find all short stories written by Jhumpa Lahiri Web Information Retrieval Model:  Web Information Retrieval Model Web Server Crawler Clustering Classification Indexer Storage Server Inverted Index Topic Hierarchy Search Query Business Root News Science Computers Automobiles Plants Animals The jaguar, a cat, can run at speeds reaching 50 mph The jaguar has a 4 liter engine engine jaguar cat jaguar Repository Documents in repository Why is Web Information Retrieval Difficult?:  Why is Web Information Retrieval Difficult? The Abundance Problem (99% of information of no interest to 99% of people) Hundreds of irrelevant documents returned in response to a search query Limited Coverage of the Web (Internet sources hidden behind search interfaces) Largest crawlers cover less than 18% of Web pages The Web is extremely dynamic 1 million pages added each day Very high dimensionality (thousands of dimensions) Limited query interface based on keyword-oriented search Limited customization to individual users How can Data Mining Improve Web Information Retrieval?:  How can Data Mining Improve Web Information Retrieval? Latent Semantic Indexing (LSI) SVD-based method to improve precision and recall Document clustering to generate topic hierarchies Hypergraph partitioning, STIRR, ROCK Document classification to assign topics to new documents Naive Bayes, TAPER Exploiting hyperlink structure to locate authoritative Web pages HITS, Google, Web Trawling Collaborative searching SearchLight Image Retrieval QBIC, Virage, Photobook, WBIIS, WALRUS Latent Semantic Indexing :  Latent Semantic Indexing Problems with Inverted Index Approach:  Problems with Inverted Index Approach Synonymy Many ways to refer to the same object Polysemy Most words have more than one distinct meaning Doc 1 Doc 2 Doc 3 automobile animal jaguar speed car engine porsche X Polysemy Synonymy X X X X X X X X X LSI - Key Idea [DDF 90]:  LSI - Key Idea [DDF 90] Apply SVD to terms by documents (t x d) matrix X X = T0 S0 D0’ T0 , D0 have orthonormal columns and S0 is diagonal Ignoring very small singular values in S (keeping only the first k largest values) X X = T S D New matrix X of rank k is closest to X in the least squares sense t x d t x d m x d t x m m x m t x k k x k k x d Comparing Documents and Queries:  Comparing Documents and Queries Comparing two documents Essentially dot product of two column vectors of X X’ X = D S D’ So one can consider rows of DS matrix as coordinates for documents and take dot products in this space Finding documents similar to query q with term vector Xq Derive a representation Dq for query Dq = Xq’ T S Dot product of DqS and appropriate row of DS matrix yields similarity between query and specific document 2 -1 LSI - Benefits:  LSI - Benefits Reduces Dimensionality of Documents From tens of thousands (one dimension per keyword) to a few 100 Decreases storage overhead of index structures Speeds up retrieval of documents similar to a query Makes search less “brittle” Captures semantics of documents Addresses problems of synonymy and polysemy Transforms document space from discrete to continuous Improves both search precision and recall Document Clustering :  Document Clustering Improve Search Using Topic Hierarchies:  Improve Search Using Topic Hierarchies Web directories (or topic hierarchies) provide a hierarchical classification of documents (e.g., Yahoo!) Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Clustering can be used to generate topic hierarchies Recreation Science Business News Yahoo home page Sports Travel Companies Finance Jobs Clustering:  Clustering Given: Data points (documents) and number of desired clusters k Group the data points (documents) into k clusters Data points (documents) within clusters are more similar than across clusters Document similarity measure Each document can be represented by vector with 0/1 value along each word dimension Cosine of angle between document vectors is a measure of their similarity, or (euclidean) distance between the vectors Other applications: Customer segmentation Market basket analysis k-means Algorithm:  k-means Algorithm Choose k initial means Assign each point to the cluster with the closest mean Compute new mean for each cluster Iterate until the k means stabilize Agglomerative Hierarchical Clustering Algorithms:  Agglomerative Hierarchical Clustering Algorithms Initially each point is a distinct cluster Repeatedly merge closest clusters until the number of clusters becomes k Closest: dmean (Ci, Cj) = dmin (Ci, Cj) = Likewise dave (Ci, Cj) and dmax (Ci, Cj) Agglomerative Hierarchical Clustering Algorithms (Continued):  Agglomerative Hierarchical Clustering Algorithms (Continued) (a) Centroid (b) MST (c) Correct Clusters dmean: Centroid approach dmin: Minimum Spanning Tree (MST) approach Drawbacks of Traditional Clustering Methods:  Drawbacks of Traditional Clustering Methods Traditional clustering methods are ineffective for clustering documents Cannot handle thousands of dimensions Cannot scale to millions of documents Centroid-based method splits large and non-hyperspherical clusters Centers of subclusters can be far apart MST-based algorithm is sensitive to outliers and slight change in position Exhibits chaining effect on string of outliers Using other similarity measures such as Jaccard coefficient instead of euclidean distance does not help Example - Centroid Method for Clustering Documents:  Example - Centroid Method for Clustering Documents As cluster size grows The number of dimensions appearing in mean go up Their value in the mean decreases Thus, very difficult to distinguish two points that differ on few dimensions ripple effect {1,4} and {6} are merged even though they have no elements in common! Itemset Clustering using Hypergraph Partitioning [HKK 97] :  Itemset Clustering using Hypergraph Partitioning [HKK 97] Build a weighted hypergraph with frequent itemsets Hyperedge: each frequent item Weight of hyperedge: average of confidences of all association rules generated from itemset Hypergraph partitioning algorithm is used to cluster items Minimize sum of weights of cut hyperedges Label customers with Item clusters by scoring Assume that items defining clusters are disjoint! STIRR - A System for Clustering Categorical Attribute Values [GKR 98] :  STIRR - A System for Clustering Categorical Attribute Values [GKR 98] Motivated by spectral graph partitioning, a method for clustering undirected graphs Each distinct attribute value becomes a separate node v with weight w(v) Node weights w(v) updated in each iteration as follows: For each tuple do Update set of weights so that it is orthonormal Positive and negative weights in non-principal basins tend to represent good partitions of the data ROCK [GRS 99]:  ROCK [GRS 99] Hierarchical clustering algorithm for categorical attributes Example: market basket customers Use novel concept of links for merging clusters sim(pi, pj): similarity function that captures the closeness between pi and pj pi and pj are said to be neighbors if sim(pi, pj) link(pi, pj): the number of common neighbors At each step, merge clusters/points with the most number of links Points belonging to a single cluster will in general have a large number of common neighbors Random sampling used for scale up In final labeling phase, each point on disk is assigned to cluster with maximum neighbors ROCK:  ROCK {1, 2, 6} and {1, 2, 7} have 5 links. {1, 2, 3} and {1, 2, 6} have 3 links. <1, 2, 3, 4, 5> {1, 2, 3} {1, 4, 5} {1, 2, 4} {2, 3, 4} {1, 2, 5} {2, 3, 5} {1, 3, 4} {2, 4, 5} {1, 3, 5} {3, 4, 5} <1, 2, 6, 7> {1, 2, 6} {1, 2, 7} {1, 6, 7} {2, 6, 7} Clustering Algorithms for Numeric Attributes :  Clustering Algorithms for Numeric Attributes Scalable Clustering Algorithms (From Database Community) CLARANS DBSCAN BIRCH CLIQUE CURE Above algorithms can be used to cluster documents after reducing their dimensionality using SVD ……. BIRCH [ZRL 96]:  BIRCH [ZRL 96] Pre-cluster data points using CF-tree CF-tree is similar to R-tree For each point CF-tree is traversed to find the closest cluster If the cluster is within epsilon distance, the point is absorbed into the cluster Otherwise, the point starts a new cluster Requires only single scan of data Cluster summaries stored in CF-tree are given to main memory hierarchical clustering algorithm CURE [GRS 98]:  CURE [GRS 98] Hierarchical algorithm for dicovering arbitrary shaped clusters Uses a small number of representatives per cluster Note: Centroid-based: uses 1 point to represent a cluster => Too little information..Hyper-spherical clusters MST-based: uses every point to represent a cluster =>Too much information..Easily mislead Uses random sampling Uses Partitioning Labeling using representatives Cluster Representatives:  Cluster Representatives A Representative set of points: Small in number : c Distributed over the cluster Each point in cluster is close to one representative Distance between clusters: smallest distance between representatives Computing Cluster Representatives:  Computing Cluster Representatives Finding Scattered Representatives We want to Distribute around the center of the cluster Spread well out over the cluster Capture the physical shape and geometry of the cluster Use farthest point heuristic to scatter the points over the cluster Shrink uniformly around the mean of the cluster Computing Cluster Representatives (Continued):  Computing Cluster Representatives (Continued) Shrinking the Representatives Why do we need to alter the Representative Set? Too close to the boundary of cluster Shrink uniformly around the mean (center) of the cluster Document Classification :  Document Classification Classification:  Classification Given: Database of tuples (documents), each assigned a class label Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) Example profile (automobile): Document contains a word from {car, truck, van, SUV, vehicle, scooter} Other applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor) Naive Bayesian Classifier:  Naive Bayesian Classifier Class c for new document d is the one for which Pr[c/d] is maximum Assume independent term occurrences in document - fraction of documents in class c that contain term t Then, by Bayes rule Hierarchical Classifier (TAPER) [CDA 97]:  Hierarchical Classifier (TAPER) [CDA 97] Class of new document d is leaf node c such that Pr[c/d] is maximum Topic Hierarchy can be computed using Bayes rule Problem of computing c reduces to finding leaf node c with the least cost path from the root to c c k-Nearest Neighbor Classifier:  k-Nearest Neighbor Classifier Assign to a point the label for majority of the k-nearest neighbors For k=1, error rate never worse than twice the Bayes rate (unlimited number of samples) Scalability issues Use index to find k-nearest neighbors R-tree family works well up to 20 dimensions Pyramid tree for high-dimensional data Use SVD to reduce dimensionality of data set Use clusters to reduce the dataset size Decision Trees:  Decision Trees accept reject salary < 20000 no yes no yes accept education in graduate Credit Analysis Decision Tree Algorithms:  Decision Tree Algorithms Building phase Recursively split nodes using best splitting attribute for node Pruning phase Smaller imperfect decision tree generally achieves better accuracy Prune leaf nodes recursively to prevent over-fitting Decision Tree Algorithms :  Decision Tree Algorithms Classifiers from machine learning community: ID3 C4.5 CART Classifiers for large databases: SLIQ, SPRINT PUBLIC SONAR Rainforest, BOAT Decision Trees:  Decision Trees Pros Fast execution time Generated rules are easy to interpret by humans Scale well for large data sets Can handle high dimensional data Cons Cannot capture correlations among attributes Consider only axis-parallel cuts Feature Selection:  Feature Selection Choose a collection of keywords that help discriminate between two or more sets of documents Fewer keywords help to speed up classification Improves classification accuracy by eliminating noise from documents Fischer’s discriminant (ratio of between-class to within-class scatter) where and if d contains t Exploiting Hyperlink Structure :  Exploiting Hyperlink Structure HITS (Hyperlink-Induced Topic Search) [Kle 98]:  HITS (Hyperlink-Induced Topic Search) [Kle 98] HITS uses hyperlink structure to identify authoritative Web sources for broad-topic information discovery Premise: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: Authorities: highly-referenced pages on a topic Hubs: pages that “point” to authorities A good authority is pointed to by many good hubs; a good hub points to many good authorities Hubs Authorities HITS - Discovering Web Communities :  HITS - Discovering Web Communities Discovering the community for a specific topic/query involves the following steps Collect seed set of pages S (returned by search engine) Expand seed set to contain pages that point to or are pointed to by pages in seed set Iteratively update hub weight h(p) and authority weight a(p) for each page: After a fixed number of iterations, pages with highest hub/authority weights form core of community Extensions proposed in Clever Assign links different weights based on relevance of link anchor text Google [BP 98]:  Google [BP 98] Search engine that uses link structure to calculate a quality ranking (PageRank) for each page PageRank Can be calculated using a simple iterative algorithm, and corresponds to principal eigenvector of the normalized link matrix Intuition: PageRank is the probability that a “random surfer” visits a page Parameter p is probability that the surfer gets bored and starts on a new random page (1-p) is the probability that the random surfer follows a link on current page Google - Features:  Google - Features In addition to PageRank, in order to improve search Google also weighs keyword matches Anchor text Provide more accurate descriptions of Web pages Anchors exist for un-indexable documents (e.g., images) Font sizes of words in text Words in larger or bolder font are assigned higher weights Google v/s HITS Google: PageRanks computed initially for Web Pages independent of search query HITS: Hub and authority weights computed for different root sets in the context of a particular search query Trawling the Web for Emerging Communities [KRR 98]:  Trawling the Web for Emerging Communities [KRR 98] Co-citation: pages that are related are frequently referenced together Web communities are characterized by dense directed bipartite subgraphs Computing (i,j) Bipartite cores Sort edge list by source id and detect all source pages s with out-degree j (let D be the set of destination pages that s points to) Compute intersection S of sets of source pages pointing to destination pages in D (using an index on dest id to generate each source set) Output Bipartite (S,D) Bipartite Core Using Hyperlinks to Improve Classification [CDI 98]:  Using Hyperlinks to Improve Classification [CDI 98] Use text from neighbors when classifying Web page Ineffective because referenced pages may belong to different class Use class information from pre-classified neighbors Choose class ci for which Pr(ci/Ni) is maximum (Ni is class labels of all the neighboring documents) By Bayes rule, we choose ci to maximize Pr(Ni/ci) Pr(ci) Assuming independence of neighbor classes, Collaborative Search :  Collaborative Search SearchLight :  SearchLight Key Idea: Improve search by sharing information on URLs visited by members of a community during search Based on the concept of “search sessions” A search session is the search engine query (collection of keywords) and the URLs visited in response to the query Possible to extract search sessions from the proxy logs SearchLight maintains a database of (query, target URL) pairs Target URL is heuristically chosen to be last URL in search session for the query In response to a search query, SearchLight displays URLs from its database for the specified query Image Retrieval :  Image Retrieval Similar Images:  Similar Images Given: A set of images Find: All images similar to a given image All pairs of similar images Sample applications: Medical diagnosis Weather predication Web search engine for images E-commerce Similar Image Retrieval Systems:  Similar Image Retrieval Systems QBIC, Virage, Photobook Compute feature signature for each image QBIC uses color histograms WBIIS, WALRUS use wavelets Use spatial index to retrieve database image whose signature is closest to the query’s signature QBIC drawbacks Computes single signature for entire image Thus, fails when images contain similar objects, but at different locations or in varying sizes Color histograms cannot capture shape, texture and location information (wavelets can!) WALRUS Similarity Model [NRS 99]:  WALRUS Similarity Model [NRS 99] WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they have enough similar region pairs WALRUS (Step 1):  WALRUS (Step 1) Generation of Signatures for Sliding Windows Each image is broken into sliding windows For the signature of each sliding window, use coefficients from lowest frequency band of the Haar wavelet Naive Algorithm: Dynamic Programming Algorithm: N - number of pixels in the image S - - max window size WALRUS (Step 2):  WALRUS (Step 2) Clustering Sliding Windows Cluster the windows in the image using pre-clustering phase of BIRCH Each cluster defines a region in the image. For each cluster, the centroid is used as a signature. (c.f. bounding box) WALRUS - Retrieval Results:  WALRUS - Retrieval Results Query image Automated Schema Extraction for XML Data: The XTRACT System :  Automated Schema Extraction for XML Data: The XTRACT System XML Primer I:  XML Primer I Standard for data representation and data exchange Unified, self-describing format for publishing/exchanging management data across heterogeneous network/NM platforms Looks like HTML but it isn’t Collection of elements Atomic (raw character data) Composite (sequence of nested sub-elements) Example <book> <title>A relational Model for Large Shared Data Banks </title> <author> <name> E.F. Codd </name> <affiliation> IBM Research </affiliation> </author> </ book> XML Primer II:  XML Primer II XML documents can be accompanied by Document Type Descriptors (DTDs) DTDs serve the role of the schema of the document Specify a regular expression for every element Example <!ELEMENT book (title, author*)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (name, affiliation)> <!ELEMENT name (#PCDATA)> <!ELEMENT affiliation (#PCDATA)> The XTRACT System [GGR 00]:  The XTRACT System [GGR 00] DTDs are of great practical importance Efficient storage of XML data collections Formulation and optimization of XML queries However, DTDs are not mandatory => XML data may not be accompanied by a DTD Automatically-generated XML documents (e.g., from relational databases or flat files) DTD standards for many communities are still evolving Goal of the XTRACT system: Automated inference of DTDs from XML-document collections Problem Formulation:  Problem Formulation Element types Þ alphabet Infer DTD for each element type separately Example sequences: instances of nested sub-elements Þ Only one level down in the hierarchy Problem statement: Given a set example sequences for element e Infer a “good” regular expression for e Hard problem!! DTDs can comprise general, complex regular expressions quantify notion of “goodness” for regular expressions Example XML Documents:  Example XML Documents <book> <book> <title></title> <title></title> <author> <author> <name></name> <name></name> <affiliation></affiliation> <address></address> </author> </author> </book> <author> <name></name> <address></address> </author> <editor></editor> <paper> <book> <title></title> <author> <name></name> <affiliation></affiliation> </author> <conference></conference> <year></year> </paper> Example (Continued):  Example (Continued) Simplified example sequences <book> : - <title><author>, <title><author><author><editor> <paper> :- <title><author><conference><year> <author> :- <name><affiliation>, <name><affiliation>, <name><address>, <name><address> Desirable solution: <!ELEMENT book (title, author*, editor?)> <!ELEMENT paper (title, author, conference, year)> <!ELEMENT author (name, affiliation?, address?)> DTD Inference Requirements:  DTD Inference Requirements Requirements for a good DTD: Generalizes to intuitively correct but previously unseen examples It should be concise (i.e., small in size) It should be precise (i.e., not cover too many sequences not contained in the set of examples) Example: Consider the case p :- ta, taa, taaa, ta, taaaa Candidate DTD The XTRACT Approach: MDL Principle:  The XTRACT Approach: MDL Principle Minimum Description Length (MDL) quantifies and resolves the tradeoff between DTD conciseness and preciseness MDL principle: The best theory to infer from a set of data is the one which minimizes the sum of (A) the length of the theory, in bits, plus (B) the length of the data, in bits, when encoded with the help of the theory. Part (A) captures conciseness, and Part (B) captures preciseness Overview of the XTRACT System:  Overview of the XTRACT System XTRACT consists of 3 subsystems Input Sequences I = { ab, abab, ac, ad, bc, bd, bbd, bbbe } SG= I U { (ab)*, (a|b)*, b*d, b*e } SF = SG U { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)*| (a|b)(c|d) | b*(d|e) MDL Subsystem:  MDL Subsystem MDL principle: Minimize the sum of Theory description length, plus Data description length given the theory In order to use MDL, need to: Define theory description length (candidate DTD) Define data description length (input sequences) given the theory (candidate DTD) Solve the resulting minimization problem MDL Subsystem - Encoding Scheme:  MDL Subsystem - Encoding Scheme Description Length of a DTD: Number of bits required to encode the DTD |Size of DTD| * log | U {(,),|,*}| Description length of a sequence given a candidate DTD: Number of bits required to specify the sequence given DTD Use a sequence of encoding indices Encoding of a given a is the empty string Î Encoding of a given (a|b|c) is the index 0 Encoding of aaa given a* is the index 3 Example: Encoding of ababcabc given ((ab)*c)* is the sequence 2,2,1 MDL Encoding Example:  MDL Encoding Example Consider again the case p :- ta, taa, taaa, taaaa Theory Description ta | taa | taaa| taaaa (t|a)* ta* 0, 1,0 2, 3 17 + 7 = 24 Data Description (given the theory) 201, 3011, 40111, 501111 1, 2, 3, 4 6 + 21 = 27 3 + 7 = 10 MDL Subsystem - Minimization:  MDL Subsystem - Minimization Maps to the Facility Location Problem (NP-hard) XTRACT employs fast heuristic algorithms proposed by the Operations Research community ta taaa taa taaaa ta w11 w12 c1 c2 c3 ta ta* taa Input Sequences Candidate DTDs References:  References [BP 98] S. Brin, and L. Page. The anatomy of a large-scale hypertextual Web search engine. WWW7, 1998. [CDA 97] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. ACM SIGMOD, 1998. [CDI 98] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal, 1998. [CGR 00] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. VLDB, 2000. [DDF 90] S. Deerwater, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 1990. [GGR 00] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A System for Extracting Document Type Descriptors from XML Documents. ACM SIGMOD, 2000. References (Continued):  References (Continued) [GKR 98] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB, 1998. [GRS 99] S. Guha, K. Shim, and R. Rastogi. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD, 1998. [GRS 98] S. Guha, K. Shim, and R. Rastogi. ROCK: A robust clustering algorithm for categorical attributes. Data Engineering, 1999. [HKK 97] E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. DMKD Workshop, 1997. [Kle 98] J. Kleinberg. Authoritative sources in a hyperlinked environment. SODA, 1998. [KRR 98] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. WWW8, 1999. [ZRL 96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD, 1996.

Add a comment

Related presentations