introToIR

67 %
33 %
Information about introToIR
Education

Published on January 17, 2008

Author: Marco1

Source: authorstream.com

Introduction to Information Retrieval (IR):  Introduction to Information Retrieval (IR) Mark Craven craven@cs.wisc.edu craven@biostat.wisc.edu 5730 Medical Sciences Center Documents and Corpora:  Documents and Corpora document: a passage of free text or hypertext Usenet posting Web page newswire story MEDLINE abstract journal article corpus (pl. corpora): a collection of documents MEDLINE Reuters stories from 1999 the Web The Ad-Hoc Retrieval Problem:  The Ad-Hoc Retrieval Problem given: a document collection (corpus) an arbitrary query do: return a list of relevant documents this is the problem addressed by Web search engines Typical IR System:  Typical IR System inverted index The Index and Inverse Index:  The Index and Inverse Index index: a relation mapping each document to the set of keywords it is about inverse index where do these come from? Inverted Index:  Inverted Index index corpus A Simple Boolean Query:  A Simple Boolean Query to answer query “hungry AND zebra”, get intersection of documents pointed to by “hungry” and documents pointed to by “zebra” Other Things to Consider:  Other Things to Consider How wan we search on phrases? Should we treat these queries differently? “a hungry zebra” “the hungry zebra” “hungry as a zebra” If we query on “laugh zebra” should we return documents containing the following? “laughing zebra” “laughable zebra” Boolean queries are too coarse - return too many or too few relevant documents. Handling Phrases:  Handling Phrases store position information in the inverted index to answer query “hungry zebra”, look for documents having “hungry” at position i and “zebra” at position i + 1 95 40 25 38 26 Handling Phrases:  Handling Phrases but this is a primitive notion of phrase we might want “zebras that are hungry” to be considered a match to the phrase “hungry zebra” this requires doing sentence analysis; determining parts of speech for words, etc. Stop Words:  Stop Words Should we treat these queries differently? “a hungry zebra” “the hungry zebra” “hungry as a zebra” Some systems employ a list of stop words (a.k.a. function words) that are probably not informative for most searches. a, an, the, that, this, of, by, with, to … stop words in a query are ignored but might be handled differently in phrases Stop Words:  Stop Words a able about above according accordingly across actually after afterwards again against all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by ... A Special Purpose Stop List:  A Special Purpose Stop List Bos taurus Botrytis cinerea C. elegans Chicken Goat Gorilla Guinea pig Hamster Human Mouse Pig Rat Spinach unknown gene cDNA DNA clone BAC PAC cosmid clone genomic sequence potentially degraded Stemming:  Stemming If we query on “laugh zebra” should we return documents containing the following? “laughing zebra” “laughable zebra” Some systems perform stemming on words; truncating related words to a common stem. laugh laugh- laughs laugh- laughing laugh- laughed laugh- Stemming:  Stemming the Lovins stemmer 260 suffix patterns iterative longest match procedure (.*)SSES $1SS (.*[AEIOU].*)ED $1 the Porter stemmer about 60 patterns grouped into sets apply patterns in each set before moving to next Stemming:  Stemming May be helpful reduces vocabulary 10-50% may increase recall May not be helpful for some queries, the sense of a word is important stemming algorithms are heuristic; may conflate semantically different words (e.g. gall and gallery) As with stop words, might want to handle stemming differently in phrases The Vector Space Model:  The Vector Space Model Boolean queries are too coarse - return too many or too few relevant documents. Most IR systems are based on the vector space model The Vector Space Model:  The Vector Space Model documents/queries represented by vectors in a high-dimensional space each dimension corresponds to a word in the vocabulary most relevant documents are those whose vectors are closest to query vector Vector Similarity:  Vector Similarity one way to determine vector similarity is the cosine measure: if the vectors are normalized, we can simply take their dot product Determining Word Weights:  Determining Word Weights lots of heuristics one well established one is TFIDF (term frequency, inverse document frequency) weighting numerator includes , number of occurrences of word in document denominator includes , total number of occurrences of in corpus TFIDF: One Form:  TFIDF: One Form (N = total number of words in the corpus) The Probability Ranking Principle:  The Probability Ranking Principle most IR systems are based on the premise that ranking documents in order of decreasing probability is the right thing to do assumes documents are independent does wrong thing with duplicates doesn’t promote diversity in returned documents

Add a comment

Related presentations

Related pages

Members – Intro to IR

Members. All Members 11; All Instructors 2; Order By: Viewing 1 - 5 of 11 active members ... © SIR and IntroToIR.org . Intro to IR Course;
Read more

IntrotoIR - Ace Recommendation Platform - 1

Basics of Information Retrieval - Focus: the WebLillian N. CasselFebruary 2008For CSC 2500 : Survey of Information ScienceA number of these slides are ...
Read more

Activity – Meghana Konanur – Intro to IR

© SIR and IntroToIR.org . Intro to IR Course; SIR RFS MSC; About Us; Contact Us ...
Read more

IntrotoIR - Ace Recommendation Platform - 11

IntrotoIR. We found 20 results related to this asset. Document Information; Type: Other; Total # of pages: 58. Avg Rating: Price: ...
Read more

Class1-IntrotoIR-1 - Ace Recommendation Platform - 1

Class1-IntrotoIR-1. We found 20 results related to this asset. Document Information; Type: Other; Total # of pages: 19. Avg Rating: Price: ...
Read more

Class1-IntrotoIR - Ace Recommendation Platform - 1

Class1-IntrotoIR-1 Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source: ...
Read more

discussion paper9 - AlecPosner IntrotoIR 11/13/14 ...

View Essay - discussion paper9 from POL 103 at Gettysburg. AlecPosner IntrotoIR 11/13/14 DiscussionPaper9 ThisweeksresearchfocusedonthecivilwarinSyria.PBSd
Read more

Discussion paper4 - AlecPosner IntrotoIR 10/16/14 ...

View Essay - Discussion paper4 from POL 103 at Gettysburg. AlecPosner IntrotoIR 10/16/14 DiscussionPaper4 Theroleofreligionistoallocatevalues,throughspirit
Read more

Karvy Insurance Repository - Demat your insurance policies

Welcome to KARVY Insurance Repository : Life insurance policies are about protecting your dear ones with adequate cover and ensuring that the ‘cover ...
Read more

Allies - Introduction To IR

On November 1, 1936, Germany and Italy, reflecting their common interest in destabilizing the European order, announced a Rome-Berlin Axis one week after ...
Read more