07 document file

40 %
60 %
Information about 07 document file

Published on October 4, 2007

Author: demirel

Source: authorstream.com

Building an Index:  Building an Index By: Ryan Knowles “building the automatic index is as important as any other component of search engine development” Building an Index Requires Two Lengthy Steps:  Building an Index Requires Two Lengthy Steps Document analysis and purification Token analysis or term extraction Example:  Example There once was a searcher named Hanna, (1) Who needed some info on manna. (2) She put “rye” and “wheat” in her query (3) Along with “potato” or “cranbeery,” (4) But no mention of “sourdough” or “banana.” (5) Instead of rye, cranberry, or wheat, (6) The results had more spiritual meat. (7) So Hanna was not pleased, (8) Nor was her hunger eased, (9) ‘Cause she was looking for something to eat. (10) Document Analysis and Purification:  Document Analysis and Purification Why is document analysis needed? Hypertext documents are more than just text. (photos, tables, charts, audio clips) Looks at how each document is organized and what it is composed of. Decides what information will be indexed and what will not. Token Analysis or Term Extraction:  Token Analysis or Term Extraction Decides which words should be used to represent the meaning of documents. Why would it not be necessary to extract every word? Stop words-(able, about, after, allow, became, been, before, certainly, clearly, enough…) Stemming-removing suffixes and sometimes prefixes to reduce a word to its root form Example: Terms Extracted:  Example: Terms Extracted Doc No. Terms/ Keywords 1 searcher, Hanna 2 manna 3 rye, wheat, query 4 potato, cranbeery  cranb 5 sourdough, banana 6 rye, cranberry, wheat  cranb 7 spiritual, meat 8 Hanna 9 hunger 10 No terms Manual Indexing:  Manual Indexing Why is this no longer practical? What are some upsides to this strategy? Do you think any companies still do this? Yahoo 2002 Small companies National Library of Medicine H.W. Wilson Company Cinahl Automatic Indexing:  Automatic Indexing The dominant method for processing documents from large web databases Why is this more efficient? What are some downsides? Spamming Intent of searcher Item Normalization:  Item Normalization Taking the smallest unit of the document and constructing searchable data structures What needs to be done in order to create an inverted file structure Why is this normalization necessary? Inverted File Structures:  Inverted File Structures The document file Each doc is given a unique ID All terms identified The dictionary Sorted list of all the unique terms The inversion list Points from term to which docs contain it Example: Dictionary List:  Example: Dictionary List Banana 1 Cranb 2 Hanna 2 Hunger 1 Manna 1 Meat 1 Potato 1 Query 1 Rye 2 Sourdough 1 Spiritual 1 Wheat 2 Example: Inversion List:  Example: Inversion List Banana (5,7) Cranb (4,5); (6,4) Hanna (1,7); (8,2) Hunger (9,4) Manna (2,6) Meat (7,6) Potato (4,3) Query (3,8) Rye (3,3); (6,3) Sourdough (5,5) Spiritual (7,5) Wheat (3,5); (6,6) Other File Structures:  Other File Structures Signature Files Eliminates all non-matches rather than matching the query with the term Other Questions:  Other Questions How frequently should crawlers go through a certain page? A question that is still being looked into

Add a comment

Related presentations

Related pages

Document collaboration made easy - Office Blogs

Steve Chew is a senior product marketing manager and Joey Masterson is a senior program manager on the Exchange team. When people send files as attachments ...
Read more

Document Management: Attachments Overview

Document Management: Attachments Step 2 (Upload new document) Save your attachment to your computer, using a unique file name (for instructions, click Why ...
Read more

Recover lost pdf document - qinogup

Recover lost pdf document 2012-повідомлень: 3-авторів: 2Adobe Acrobat X Pro autosave set to 5 minutes. Coworker was working on a very ...
Read more

ABAP Gallery: Store Document / File to SAP

To store document / image in SAP, we can use Business Document Navigator. To go to the Business Document Navigator, choose Office -> Business ...
Read more

GL Modul 07 - bpb.de

Seite 49 Fakten Mit dem Anstieg der Weltbevölkerung von 2,53 Milliarden Menschen im Jahr 1950 auf 6,83 Milliarden 2009 und weiter auf schätzungsweise
Read more

Recover unsaved pdf document - qinogup

Recover unsaved pdf document Microsoft Office has a fantastic feature built in where it can recover unsaved new. Here are the steps to take to recover your ...
Read more

Document Management: Consent Forms Overview - UNC Research

Document Management: Consent Forms. Step 2 (Upload new document) After you have edited your consent template: 1. Save the file to your computer, using a
Read more

Portable Document Format – Wikipedia

Portable Document Format (PDF) Dateiendung:.pdf: MIME-Type: application/pdf: Magische Zahl: %PDF. Entwickelt von: Adobe Systems: Erstveröffentlichung: 1993
Read more

Ag 101 | Agriculture | US EPA

Ag 101 | Agriculture | US EPA
Read more

Peak Oil – Fördermaximum von konventionellem Erdöl Peak ...

1950 60 70 80 85 90 95 2000 07 08 10 15 20 30 40 50 60 70 80 90 2100 Jahr 3,9 4,1 4,2 4,4 4,3 3,7 4,0 3,9 3,6 3,4 3,1 3,8 4,5 4,6 4,1 3,7 2,5 1,6 0,8 0,5 0 ...
Read more