advertisement

radev

67 %
33 %
advertisement
Information about radev
Science-Technology

Published on May 7, 2008

Author: Heather

Source: authorstream.com

advertisement

Information Retrieval:  Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005 About the instructor:  About the instructor Dragomir R. Radev Associate Professor, University of Michigan School of Information Department of Electrical Engineering and Computer Science Department of Linguistics Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan Treasurer, North American Chapter of the ACL Ph.D., 1998, Computer Science, Columbia University Email: radev@umich.edu Home page: http://tangra.si.umich.edu/~radev Introduction:  Introduction IR systems:  IR systems Google Vivísimo AskJeeves NSIR Lemur MG Nutch Examples of IR systems:  Examples of IR systems Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language Need for IR:  Need for IR Advent of WWW - more than 8 Billion documents indexed on Google How much information? 200TB according to Lyman and Varian 2003. http://www.sims.berkeley.edu/research/projects/how-much-info/ Search, routing, filtering User’s information need Some definitions of Information Retrieval (IR):  Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).” Sample queries (from Excite):  Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running Mappings and abstractions:  Mappings and abstractions Reality Data Information need Query From Korfhage’s book Typical IR system:  Typical IR system (Crawling) Indexing Retrieval User interface Key Terms Used in IR:  Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query Documents:  Documents Documents:  Documents Not just printed paper collections vs. documents data structures: representations Bag of words method document surrogates: keywords, summaries encoding: ASCII, Unicode, etc. Document preprocessing:  Document preprocessing Formatting Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc) Casing (cat vs. CAT) Stemming (computer, computation) Soundex Document representations:  Document representations Term-document matrix (m x n) term-term matrix (m x m x n) document-document matrix (n x n) Example: 3,000,000 documents (n) with 50,000 terms (m) sparse matrices Boolean vs. integer matrices Document representations:  Document representations Term-document matrix Evaluating queries (e.g., (AB)C) Storage issues Inverted files Storage issues Evaluating queries Advantages and disadvantages IR models:  IR models Major IR models:  Major IR models Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent semantic indexing Major IR tasks:  Major IR tasks Ad-hoc Filtering and routing Question answering Spoken document retrieval Multimedia retrieval Venn diagrams:  Venn diagrams x w y z D1 D2 Boolean model:  Boolean model A B Boolean queries:  restaurants AND (Mideastern OR vegetarian) AND inexpensive Boolean queries What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal) Boolean queries:  Boolean queries Weighting (Beethoven AND sonatas) precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses Use of negation: potential problems Conjunctive and Disjunctive normal forms Full CNF and DNF Transformations:  Transformations De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) CNF or DNF? Reference librarians prefer CNF - why? Boolean model:  Boolean model Partition Partial relevance? Operators: AND, NOT, OR, parentheses Exercise:  Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information  ¬computer” Exercise:  Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare)) Stop lists:  Stop lists 250-300 most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63 Vector models:  Vector models Term 1 Term 2 Term 3 Doc 1 Doc 2 Doc 3 Vector queries:  Vector queries Each document is represented as a vector non-efficient representations (bit vectors) dimensional compatibility The matching process:  The matching process Document space Matching is done between a document and a query (or between two documents) distance vs. similarity Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc. Miscellaneous similarity measures:  Miscellaneous similarity measures The Cosine measure  (D,Q) = =  (di x qi)  (di)2 *  (qi)2 |X  Y| |X| * |Y|  (D,Q) = |X  Y| |X  Y| The Jaccard coefficient Exercise:  Exercise Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients. Evaluation:  Evaluation Relevance:  Relevance Difficult to change: fuzzy, inconsistent Methods: exhaustive, sampling, pooling, search-based Contingency table:  Contingency table w x y z n2 = w + y n1 = w + x N relevant not relevant retrieved not retrieved Precision and Recall:  Precision and Recall Recall: Precision: w w+y w+x w Exercise:  Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries. Slide41:  [From Salton’s book] Slide42:  Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5? Issues:  Issues Why not use accuracy A=(w+z)/N? Average precision Average P at given “document cutoff values” Report when P=R F measure: F=(b2+1)PR/(b2P+R) F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R Kappa:  Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators Kappa example (from Manning, Schuetze, Raghavan):  Kappa example (from Manning, Schuetze, Raghavan) Kappa (cont’d):  Kappa (cont’d) P(A) = 370/400 P (-) = (10+20+20+70)/800 = 0.2125 P (+) = (10+20+300+300)/800 = 0.7878 P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665 K = (0.925-0.665)/(1-0.665) = 0.776 Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good Relevance collections:  Relevance collections TREC ad hoc collections, 2-6 GB TREC Web collections, 2-100GB Sample TREC query:  Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208 Slide49:  <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC> TREC (cont’d):  TREC (cont’d) http://trec.nist.gov/tracks.html http://trec.nist.gov/presentations/presentations.html Word distribution models:  Word distribution models Shakespeare:  Shakespeare Romeo and Juliet: And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html The BNC (Adam Kilgarriff):  The BNC (Adam Kilgarriff) 1 6187267 the det 2 4239632 be v 3 3093444 of prep 4 2687863 and conj 5 2186369 a det 6 1924315 in prep 7 1620850 to infinitive-marker 8 1375636 have v 9 1090186 it pron 10 1039323 to prep 11 887877 for prep 12 884599 i pron 13 760399 that conj 14 695498 you pron 15 681255 he pron 16 680739 on prep 17 675027 with prep 18 559596 do v 19 534162 at prep 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135--155 Stop lists:  Stop lists 250-300 most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63 Zipf’s law:  Zipf’s law Rank x Frequency  Constant Zipf's law is fairly general!:  Zipf's law is fairly general! Frequency of accesses to web pages in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 Words in the English language for instance, in Shakespeare’s play Hamlet with s approximately 0.5 Sizes of settlements Income distributions amongst individuals Size of earthquakes Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law Zipf’s law (cont’d):  Zipf’s law (cont’d) Limitations: Low and high frequencies Lack of convergence Power law with coefficient c = -1 Y=kxc Li (1992) – typing words one letter at a time, including spaces Heap’s law:  Heap’s law Size of vocabulary: V(n) = Knb In English, K is between 10 and 100, β is between 0.4 and 0.6. n V(n) http://en.wikipedia.org/wiki/Heaps%27_law Heap’s law (cont’d):  Heap’s law (cont’d) Related to Zipf’s law: generative models Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335. Indexing:  Indexing Methods:  Methods Manual: e.g., Library of Congress subject headings, MeSH Automatic LOC subject headings:  LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) Medicine:  Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine Finding the most frequent terms in a document:  Finding the most frequent terms in a document Typically stop words: the, and, in Not content-bearing Terms vs. words Luhn’s method Luhn’s method:  Luhn’s method WORDS FREQUENCY E Computing term salience:  Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF) Applications of TFIDF:  Applications of TFIDF Cosine similarity Indexing Clustering Vector-based matching:  Vector-based matching The cosine measure sim (D,C) = (dk . ck . idf(k)) (dk)2 . (ck)2 k S S S k k IDF: Inverse document frequency:  IDF: Inverse document frequency N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 TF * IDF is used for automated indexing and for topic discrimination: Asian and European news:  Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which Other topics:  Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president Compression:  Compression Compression:  Compression Methods Fixed length codes Huffman coding Ziv-Lempel codes Fixed length codes:  Fixed length codes Binary representations ASCII Representational power (2k symbols where k is the number of bits) Variable length codes:  Variable length codes Alphabet: A .-  N -.  0 ----- B -...  O ---  1 .---- C -.-.  P .--.  2 ..--- D -..  Q --.-  3 ...— E .  R .-. 4 ....- F ..-. S ... 5 ..... G --. T -  6 -.... H .... U ..-  7 --... I ..  V ...-  8 ---.. J .---  W .--  9 ----. K -.-  X -..- L .-..  Y -.— M --  Z --.. Demo: http://www.babbage.demon.co.uk/morse.html http://www.scphillips.com/morse/ Most frequent letters in English:  Most frequent letters in English Most frequent letters: E T A O I N S H R D L U http://www.math.cornell.edu/~mec/modules/cryptography/subs/frequencies.html Demo: http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm Also: bigrams: TH HE IN ER AN RE ND AT ON NT http://www.math.cornell.edu/~mec/modules/cryptography/subs/digraphs.html Useful links about cryptography:  Useful links about cryptography http://world.std.com/~franl/crypto.html http://www.faqs.org/faqs/cryptography-faq/ http://en.wikipedia.org/wiki/Cryptography Huffman coding:  Huffman coding Developed by David Huffman (1952) Average of 5 bits per character (37.5% compression) Based on frequency distributions of symbols Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols Slide80:  0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 c b d f g i j h e a Exercise:  Exercise Consider the bit string: 01101101111000100110001110100111000110101101011101 Use the Huffman code from the example to decode it. Try inserting, deleting, and switching some bits at random locations and try decoding. Ziv-Lempel coding:  Ziv-Lempel coding Two types - one is known as LZ77 (used in GZIP) Code: set of triples <a,b,c> a: how far back in the decoded text to look for the upcoming text segment b: how many characters to copy c: new character to add to complete segment Slide84:  <0,0,p> p <0,0,e> pe <0,0,t> pet <2,1,r> peter <0,0,_> peter_ <6,1,i> peter_pi <8,2,r> peter_piper <6,3,c> peter_piper_pic <0,0,k> peter_piper_pick <7,1,d> peter_piper_picked <7,1,a> peter_piper_picked_a <9,2,e> peter_piper_picked_a_pe <9,2,_> peter_piper_picked_a_peck_ <0,0,o> peter_piper_picked_a_peck_o <0,0,f> peter_piper_picked_a_peck_of <17,5,l> peter_piper_picked_a_peck_of_pickl <12,1,d> peter_piper_picked_a_peck_of_pickled <16,3,p> peter_piper_picked_a_peck_of_pickled_pep <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers Links on text compression:  Links on text compression Data compression: http://www.data-compression.info/ Calgary corpus: http://links.uwaterloo.ca/calgary.corpus.html Huffman coding: http://www.compressconsult.com/huffman/ http://en.wikipedia.org/wiki/Huffman_coding LZ http://en.wikipedia.org/wiki/LZ77 Relevance feedback and query expansion:  Relevance feedback and query expansion Relevance feedback:  Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space Relevance feedback:  Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N| Example:  Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant R = ? S = ? Q’ = ? Pseudo relevance feedback:  Pseudo relevance feedback Automatic query expansion Thesaurus-based expansion (e.g., using latent semantic indexing – later…) Distributional similarity Query log mining Examples:  Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper Examples (query logs):  Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn Problems with automatic query expansion:  Problems with automatic query expansion Adding frequent words may dilute the results (example?) Stemming:  Stemming Goals:  Goals Motivation: Computer, computers, computerize, computational, computerization User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation Methods:  Methods Manual (tables) Achievement  achiev Achiever  achiev Etc. Affix removal (Harman 1991, Frakes 1992) if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” If a word ends in “s” but not “us” or “ss” then “s”  NULL (apply only the first applicable rule) Porter’s algorithm (Porter 1980):  Porter’s algorithm (Porter 1980) Home page: http://www.tartarus.org/~martin/PorterStemmer Reading assignment: http://www.tartarus.org/~martin/PorterStemmer/def.txt Consonant-vowel sequences: CVCV ... C CVCV ... V VCVC ... C VCVC ... V Shorthand: [C]VCVC ... [V] Porter’s algorithm (cont’d):  Porter’s algorithm (cont’d) [C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). Slide99:  Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file Slide100:  Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible Slide101:  Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler Slide102:  Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll Porter’s algorithm (cont’d):  Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf % ./stem.pl computers computers comput Porter’s algorithm:  Porter’s algorithm Stemming:  Stemming Not always appropriate (e.g., proper names, titles) The same applies to casing (e.g., CAT vs. cat) String matching:  String matching String matching methods:  String matching methods Index-based Full or approximate E.g., theater = theatre Index-based matching:  Index-based matching Inverted files Position-based inverted files Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55 Inverted index (trie):  Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n Sequential searching:  Sequential searching No indexing structure given Given: database d and search pattern p. Example: find “words” in the earlier example Brute force method try all possible starting positions O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) Typical runtime is actually O(n) given that mismatches are easy to notice Knuth-Morris-Pratt:  Knuth-Morris-Pratt Average runtime similar to BF Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern Knuth-Morris-Pratt (cont’d):  Knuth-Morris-Pratt (cont’d) Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0 1234567 ABCDABD ABCDABD Knuth-Morris-Pratt (cont’d):  Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ Boyer-Moore:  Boyer-Moore Used in text editors Demos http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html http://www.blarg.com/~doyle/pages/bmi.html Word similarity:  Word similarity Hamming distance - when words are of the same length Levenshtein distance - number of edits (insertions, deletions, replacements) color --> colour (1) survey --> surgery (2) com puter --> computer ? Longest common subsequence (LCS) lcs (survey, surgery) = surey Levenshtein edit distance:  Levenshtein edit distance Examples: Theatre-> theater Ghaddafi->Qadafi Computer->counter Edit distance (inserts, deletes, substitutions) Edit transcript Done through dynamic programming Recurrence relation:  Recurrence relation Three dependencies D(i,0)=i D(0,j)=j D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: t(i,j) = 0 iff S1(i)=S2(j) Example:  Example Gusfield 1997 Example (cont’d):  Example (cont’d) Gusfield 1997 Tracebacks:  Tracebacks Gusfield 1997 Weighted edit distance:  Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics Homology information BLAST Blosum http://eta.embl-heidelberg.de:8000/misc/mat/blosum50.html Slide122:  Web sites: http://www.merriampark.com/ld.htm http://odur.let.rug.nl/~kleiweg/lev/ Clustering:  Clustering Clustering:  Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis Documents in the same cluster are relevant to the same query Representations for document clustering:  Representations for document clustering Typically: vector-based Words: “cat”, “dog”, etc. Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed) Hierarchical clustering Dendrograms:  Hierarchical clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html E.g., language similarity: Another example:  Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon) Clustering using dendrograms:  Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A Methods:  Methods Single-linkage One common pair is sufficient disadvantages: long chains Complete-linkage All pairs have to match Disadvantages: too conservative Average-linkage Centroid-based (online) Look at distances to centroids Demo: /clair4/class/ir-w05/clustering k-means:  k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka k-means:  k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while Example:  Example Cluster the following vectors into two groups: A = <1,6> B = <2,2> C = <4,0> D = <3,3> E = <2,5> F = <2,1> Complexity:  Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids. Weka:  Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank www.cs.waikato.ac.nz/ml/weka Demos:  Demos http://vivisimo.com/ http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster http://www.cc.gatech.edu/~dellaert/html/software.html http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://www.ece.neu.edu/groups/rpl/projects/kmeans/ Human clustering:  Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998). Lexical networks:  Lexical networks Lexical Networks:  Lexical Networks Used to represent relationships between words Example: WordNet - created by George Miller’s team at Princeton Based on synsets (synonyms, interchangeable words) and lexical matrices Lexical matrix:  Lexical matrix Synsets:  Synsets Disambiguation {board, plank} {board, committee} Synonyms substitution weak substitution synonyms must be of the same part of speech Slide141:  $ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something Slide142:  Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something Slide143:  Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Antonymy:  Antonymy “x” vs. “not-x” “rich” vs. “poor”? {rise, ascend} vs. {fall, descend} Other relations:  Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. Hyponymy: {tree} is a hyponym of {plant}. Hierarchical structure based on hyponymy (and hypernymy). Other features of WordNet:  Other features of WordNet Index of familiarity Polysemy Familiarity and polysemy:  board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) Familiarity and polysemy Compound nouns:  Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees Overview of senses:  Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board") Top-level concepts:  Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time} WordNet and DistSim:  WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /data2/tools/relatedwords/relate reason System comparison:  System comparison Comparing two systems:  Comparing two systems Comparing A and B One query? Average performance? Need: A to consistently outperform B [this slide: courtesy James Allan] The sign test:  The sign test Example 1: A > B (12 times) A = B (25 times) A < B (3 times) p < 0.035 (significant at the 5% level) Example 2: A > B (18 times) A < B (9 times) p < 0.122 (not significant at the 5% level) [this slide: courtesy James Allan] Other tests:  Other tests The t test: Takes into account the actual performances, not just which system is better http://nimitz.mcs.kent.edu/~blewis/stat/tTest.html The sign test: http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html Techniques for dimensionality reduction: SVD and LSI:  Techniques for dimensionality reduction: SVD and LSI Techniques for dimensionality reduction:  Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices Vectors Matrices Matrix multiplication SVD: Singular Value Decomposition:  SVD: Singular Value Decomposition A=USVT This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A) Term matrix normalization:  Term matrix normalization D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 Example (Berry and Browne):  Example (Berry and Browne) T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & children’s room (for your home) D3: child safety at home D4: your baby’s health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collector’s guide Document term matrix:  Document term matrix Decomposition:  Decomposition u = -0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356 v = -0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407 Decomposition:  Decomposition s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spread on the v1 axis Rank-4 approximation:  Rank-4 approximation s4 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Rank-4 approximation:  Rank-4 approximation u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 Rank-4 approximation:  Rank-4 approximation u*s4 -1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 Rank-4 approximation:  Rank-4 approximation s4*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Rank-2 approximation:  Rank-2 approximation s2 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Rank-2 approximation:  Rank-2 approximation u*s2*v' 0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 Rank-2 approximation:  Rank-2 approximation u*s2 -1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 Rank-2 approximation:  Rank-2 approximation s2*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Documents to concepts and terms to concepts:  Documents to concepts and terms to concepts A(:,1)'*u*s -0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095 >> A(:,1)'*u*s4 -0.4238 0.6784 -0.8541 0.1446 0 0 0 >> A(:,1)'*u*s2 -0.4238 0.6784 0 0 0 0 0 >> A(:,2)'*u*s2 -1.1233 0.3650 0 0 0 0 0 >> A(:,3)'*u*s2 -0.6762 0.6807 0 0 0 0 0 Documents to concepts and terms to concepts:  Documents to concepts and terms to concepts >> A(:,4)'*u*s2 -0.9972 0.6478 0 0 0 0 0 >> A(:,5)'*u*s2 -1.1809 -0.4914 0 0 0 0 0 >> A(:,6)'*u*s2 -0.7918 -0.8121 0 0 0 0 0 >> A(:,7)'*u*s2 -1.1809 -0.4914 0 0 0 0 0 Cont’d:  Cont’d >> (s2*v'*A(1,:)')' -1.7523 -0.1530 0 0 0 0 0 0 0 >> (s2*v'*A(2,:)')' -0.6585 0.4768 0 0 0 0 0 0 0 >> (s2*v'*A(3,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(4,:)')' -0.2831 0.2291 0 0 0 0 0 0 0 >> (s2*v'*A(5,:)')' -0.6585 0.4768 0 0 0 0 0 0 0 Cont’d:  Cont’d >> (s2*v'*A(6,:)')' -0.4730 0.6078 0 0 0 0 0 0 0 >> (s2*v'*A(7,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(8,:)')' -0.5306 0.5395 0 0 0 0 0 0 0 >> (s2*v'*A(9,:)')‘ -0.4730 0.6078 0 0 0 0 0 0 0 Properties:  Properties A*A' 1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 A'*A 1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082 A is a document to term matrix. What is A*A’, what is A’*A? Latent semantic indexing (LSI):  Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space Useful pointers:  Useful pointers http://lsa.colorado.edu http://lsi.research.telcordia.com/ http://www.cs.utk.edu/~lsi/ http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm http://citeseer.nj.nec.com/deerwester90indexing.html http://www.pcug.org.au/~jdowling/ Models of the Web:  Models of the Web Size:  Size The Web is the largest repository of data and it grows exponentially. 320 Million Web pages [Lawrence & Giles 1998] 800 Million Web pages, 15 TB [Lawrence & Giles 1999] 8 Billion Web pages indexed [Google 2005] Amount of data roughly 200 TB [Lyman et al. 2003] Bow-tie model of the Web:  Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page Power laws:  Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3. Small-world networks:  Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) Kansas/Omaha --> Boston (42/160 letters) diameter = 6 Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89. Six degrees of separation Clustering coefficient:  Clustering coefficient Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors. Examples: Models of the Web:  Models of the Web Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology Self-triggerability across hyperlinks:  Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? Evolving Word-based Web:  Evolving Word-based Web Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank Social network analysis for IR:  Social network analysis for IR Social networks:  Social networks Induced by a relation Symmetric or not Examples: Friendship networks Board membership Citations Power grid of the US WWW Slide190:  Krebs 2004 Prestige and centrality:  Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs. Graph-based representations:  Graph-based representations Square connectivity (incidence) matrix Graph G (V,E) Markov chains:  Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x0, x1, …, xn). Xi = xi-1*E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find Xj given x0, E, and j. Stationary solutions:  Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: E is stochastic E is irreducible E is aperiodic To make these conditions true: All rows of E add up to 1 (and no value is negative) Make sure that E is strongly connected Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation” Slide195:  Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph. Eigenvectors:  Eigenvectors An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, ETp = p. Computing the stationary distribution:  Computing the stationary distribution function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; until L <  return p(i) end Solution for the stationary distribution Slide198:  Example How Google works:  How Google works Crawling Anchor text Fast query processing Pagerank More about PageRank:  More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank. HITS:  HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm obtain root set using input query expanded the root set by radius one run iterations on the hub and authority scores together report top-ranking authorities and hubs The link-content hypothesis:  The link-content hypothesis Topical locality: page is similar () to the page that points to it (). Davison (TF*IDF, 100K pages) 0.31 same domain 0.23 linked pages 0.19 sibling 0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 1=1.8, 2=0.6, Measuring the Web:  Measuring the Web Bharat and Broder 1998: 

Add a comment

Related presentations

Related pages

Tierarztpraxis M. Radev Münster - Home

(0251) 4 19 07 14. Montag bis Freitag 10.00 - 12.00 Uhr und 16:00 - 20.00 Uhr | Mittwoch Vormittag geschlossen | Samstag 10.00 - 12.00 Uhr und 17:00 - 20 ...
Read more

Tierarzt M. Radev - Tierarzt - Münster, Nordrhein ...

8 Beiträge zu Tierarzt M. Radev "Herr Radev ist ein Tierarzt mit einem großen Herz für Tiere und einer absolut kompetenten und freundlichen Arbeitsweise.
Read more

Tierarztpraxis M. Radev Münster - Praxis

Tierarztpraxis. Hier gibt es bald Informationen rund um die Tierarztpraxis. Team. Praxisinhaber und behandelnder Tierarzt: Miroslav Radev. Tierarzthelferin ...
Read more

Nik Radev - Wikipedia, the free encyclopedia

Nikolai "The Russian" Radev (29 January 1959 – 15 April 2003) was a Bulgarian refugee who became known as a career criminal in Melbourne, Victoria, Australia
Read more

Simeon Radev - Wikipedia, the free encyclopedia

Radev Point on Rugged Island in the South Shetland Islands, Antarctica is named after Simeon Radev. [5] Works. Early Memories; What I saw in the Balkan Wars
Read more

Zlati Radev - berufliche Tätigkeit, Wohnort und Alter

Im Handelsregister gibt es 2 Verbindungen zu Zlati Radev: Bei welchen Firmen ist er Geschäftsführer? Welche gehören ihm? Ardello Bautenschutz GmbH?
Read more

Tierarztpraxis Radev - 8 Bewertungen - Münster ...

"Lieber Dr Radev wir sagen DANKE für die beruhigende und schnelle Hilfe nachdem Jolly spät ..." Jetzt klicken und komplette Bewertung bei golocal lesen!
Read more

Radev Inc • Full-stack apps made to scale

(415) 915-0400 projects@radev.com. Since 1999, Radev Inc. has been providing businesses of all shapes and sizes with powerful applications tailored exactly ...
Read more

Stoyan Radev | Facebook

Stoyan Radev ist bei Facebook. Tritt Facebook bei, um dich mit Stoyan Radev und anderen Nutzern, die du kennst, zu vernetzen. Facebook ermöglicht den...
Read more

Kyrill Radev - Info zur Person mit Bilder, News & Links ...

14 Ergebnisse zu Kyrill Radev: Berlin, Maximov, kostenlose Person-Info bei Personsuche Yasni.de, alle Infos zum Namen im Internet
Read more