sample solution

50 %
50 %
Information about sample solution
Education

Published on January 9, 2008

Author: Bianca

Source: authorstream.com

CS/Info 430: Information Retrieval:  CS/Info 430: Information Retrieval Sample Midterm Examination Notes on the Solutions Midterm Examination -- Question 1:  Midterm Examination -- Question 1 1(a) Define the terms inverted file, inverted list, posting. Inverted file: a list of the words in a set of documents and the documents in which they appear. Inverted list: All the entries in an inverted file that apply to a specific word. Posting: Entry in an inverted list 1(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval? Q1 (continued):  Q1 (continued) Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource. Q1 (continued):  Q1 (continued) 1(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.   (i) What file structure(s) would you use?   (ii) How well does your design satisfy the criteria listed in Part (b)? Q1 (continued):  Q1 (continued) Term Pointer to list of postings ant bee cat dog elk fox gnu hog Lists of postings Separate inverted index from lists of postings inverted index postings file Question 1 (continued):  Question 1 (continued) (a) Postings file may be stored sequentially as a linked list. (b) Index file is best stored as a tree. Binary trees provide fast searching but have problems with updating. B-trees are better, with B+-trees as the best. Note: Other answers are possible to this part of the question. Question 1 (continued):  Question 1 (continued) 1(c)(ii) How well does your design satisfy the criteria listed in Part (b)? • Sequential list for each term is efficient for storage and for processing Boolean queries. The disadvantage is a slow update time for long inverted lists. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full. Midterm Examination -- Question 2:  Midterm Examination -- Question 2 2(b) You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta (i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting. Incidence array:  Incidence array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 1 D3 1 1 1 1 D4 1 1 1 1 7 3 4 4 Document similarity matrix:  Document similarity matrix D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65 0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87 0.25 Question 2 (continued):  Question 2 (continued) 2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency. Frequency Array :  Frequency Array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 3 D3 3 1 1 1 D4 2 1 1 2 Inverse Document Frequency Weighting:  Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: wik = fik / dk Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k Frequency Array with Weights:  Frequency Array with Weights D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 0.33 0.50 0.50 0.33 0.50 0.33 0.33 D2 0.33 0.33 1.00 D3 1.50 0.50 0.50 0.33 D4 0.67 0.33 0.33 0.67 dk 3 2 2 3 2 3 3 length 1.09 1.11 1.69 1.05 Lengths have been corrected. Document similarity matrix:  Document similarity matrix D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46 0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86 0.06 Question 3:  Question 3 3(a) Define the terms recall and precision. 3(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned. (i) How in a practical experiment would you calculate the precision? Have an expert examine each of the 200 documents and decide whether it is relevant. Precision is number judged relevant divided by 200. (ii) How in a practical experiment would you calculate the recall? It is not feasible to examine 1,000,000 records. Therefore sampling must be used ... Question 3 (continued):  Question 3 (continued) 3(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant. (i) What is the precision? 50/200 = 0.25 (ii) What is the recall? 50/100 = 0.5 3(d) Explain in general terms the method used by TREC to estimate the recall. Question 3 (continued):  Question 3 (continued) For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. [In TREC-8: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant]

Add a comment

Related presentations

Related pages

Mktg, Inc. Programming & Hosting -- Sample Frame Experts

The sample you need. Call us toll-free: 1 866-519-6343 : Home; Services. ... Sample Frame Management Solutions; White Papers; Articles; Press Releases; Awards;
Read more

dict.cc | sample solution | Wörterbuch Englisch-Deutsch

Übersetzung für sample solution im Englisch-Deutsch-Wörterbuch dict.cc.
Read more

Sample Solutions Europe

Having worked with Sample Solutions Europe since 2011, they have been an invaluable resource for providing nationally representative and B2B sample.
Read more

Sample Solutions LLC

Your value-added supplier for contact manufacturing and packaging services. ISO 13485:2003 & ISO 9001:2008 certified. FDA registered Medical Devices and ...
Read more

Solution - Wikipedia, the free encyclopedia

In chemistry, a solution is a homogeneous mixture composed of only one phase. In such a mixture, a solute is a substance dissolved in another substance ...
Read more

Sample: Work with solutions

This sample code is for Microsoft Dynamics CRM 2015 and Microsoft Dynamics CRM Online 2015 Update. Download the Microsoft Dynamics CRM SDK package. It can ...
Read more

Example of Solution

Chemistry: A solution is a mixture of dissolved materials in a liquid. Common usage: A solution is the answer to a problem, conceptual or literal.
Read more

Home | Sample Answers

Sample Answers is a leading global sampling agency, providing consumer/business, online/offline samples to market research firms. Information is high ...
Read more

Sample Solutions White Paper - Mktg, Inc. Programming ...

Consulting: Sample Source Auditors; Sample Solutions ... The solution: An analysis of ...
Read more

Problem Solution Essay - Samples & Examples

Choosing a Topic for a Problem Solution Essay. The variety of problem solution essay topics is very big, from environment to religion, from technological ...
Read more