advertisement

lecnew05 11

50 %
50 %
advertisement
Information about lecnew05 11
Entertainment

Published on December 3, 2007

Author: Goldie

Source: authorstream.com

advertisement

Chapter 7: Document Preprocessing (textbook):  Chapter 7: Document Preprocessing (textbook) Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations): (1) Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters. (2) Elimination of stop-words with the objective of filtering out words with very low discrimination values for retrieval purposes. Document Preprocessing:  Document Preprocessing (3) Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc). (4) Selection of index terms to determine which words/stems (or groups of words) will be used as an indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact , noun words frequently carry more semantics than adjectives, adverbs, and verbs. Document Preprocessing:  Document Preprocessing (5) Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure). Lexical Analysis of the text:  Lexical Analysis of the text Task: convert strings of characters into sequence of words. Main task is to deal with spaces, e.g, multiple spaces are treated as one space. Digits—ignoring numbers is a common way. Special cases, 1999, 2000 standing for specific years are important. Mixed digits are important, e.g., 510B.C. 16 digits numbers might be credit card #. Hyphens: state-of-the art and “state of the art” should be treated as the same. Punctuation marks: remove them. Exception: 510B.C Lower or upper case of letters: treated as the same. Many exceptions: semi-automatic. Elimination of Stopwords:  Elimination of Stopwords words appear too often are not useful for IR. Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words. Stemming:  Stemming A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes or suffixes). Example: connect is the stem for {connected, connecting connection, connections} Porter algorithm: using suffix list for suffix stripping. S, sses ss, etc. Slide7:  Index terms selection Identification of noun groups Treat nouns appear closely as a single component, e.g., computer science Thesaurus:  Thesaurus Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words. Vocabulary control in an information retrieval system Thesaurus construction Manual construction Automatic construction Vocabulary control:  Vocabulary control Standard vocabulary for both indexing and searching (for the constructors of the system and the users of the system) Objectives of vocabulary control:  Objectives of vocabulary control To promote the consistent representation of subject matter by indexers and searchers ,thereby avoiding the dispersion of related materials. To facilitate the conduct of a comprehensive search on some topic by linking together terms whose meanings are related paradigmatically. Thesaurus:  Thesaurus Not like common dictionary Words with their explanations May contain words in a language Or only contains words in a specific domain. With a lot of other information especially the relationship between words Classification of words in the language Words relationship like synonyms, antonyms On-Line Thesaurus:  On-Line Thesaurus http://www.thesaurus.com http://www.dictionary.com/ http://www.cogsci.princeton.edu/~wn/ Dictionary vs. Thesaurus:  Dictionary vs. Thesaurus in·for·ma·tion ( n f r-m sh n) n. Knowledge derived from study, experience, or instruction. Knowledge of specific events or situations that has been gathered or received by communication; intelligence or news. See Synonyms at knowledge. ...... Check Information use http://www.thesaurus.com Dictionary Thesaurus [Nouns] information, enlightenment, acquaintance …… [Verbs] tell; inform, inform of; acquaint, acquaint with; impart, …… [Adjectives] informed; reported; published Use of Thesaurus:  Use of Thesaurus To control the term used in indexing ,for a specific domain only use the terms in the thesaurus as indexing terms Assist the users to form proper queries by the help information contained in the thesaurus Construction of Thesaurus:  Construction of Thesaurus Stemming can be used for reduce the size of thesaurus Can be constructed either manually or automatically WordNet: manually constructed :  WordNet: manually constructed WordNet® is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. Relations in WordNet:  Relations in WordNet Automatic Thesaurus Construction:  Automatic Thesaurus Construction A variety of methods can be used in construction the thesaurus Term similarity can be used for constructing the thesaurus Complete Term Relation Method:  Complete Term Relation Method Term – Document Relationship can be calculated using a variety of methods Like tf-idf Term similarity can be calculated base on the term – document relationship for example: Complete Term Relation Method:  Complete Term Relation Method Set threshold to 10 Complete Term Relation Method:  Complete Term Relation Method Group T1,T3,T4,T6 T1,T5 T2,T4,T6 T2,T6,T8 T7 Indexing:  Indexing Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Creating inverted files:  Creating inverted files Original Documents Document IDs Word Extraction Word IDs W1:d1,d2,d3 W2:d2,d4,d7,d9 Wn :di,…dn Inverted Files Creating Inverted file:  Creating Inverted file Map the file names to file IDs Consider the following Original Documents Creating Inverted file:  Creating Inverted file Red: stop word Creating Inverted file:  Creating Inverted file After stemming, make lowercase (option), delete numbers (option) Creating Inverted file (unsorted):  Creating Inverted file (unsorted) Creating Inverted file (sorted):  Creating Inverted file (sorted) Searching on Inverted File:  Searching on Inverted File Binary Search Using in the small scale Create thesaurus and combining techniques such as: Hashing B+tree Pointer to the address in the indexed file Huffman codes:  Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: The first way needs 1003=300 bits. The second way needs 45 1+13 3+12 3+16 3+9 4+5 4=232 bits. Variable-length code:  Variable-length code Need some care to read the code. 001011101 (codeword: a=0, b=00, c=01, d=11.) Where to cut? 00 can be explained as either aa or b. Prefix of 0011: 0, 00, 001, and 0011. Prefix codes: no codeword is a prefix of some other codeword. (prefix free) Prefix codes are simple to encode and decode. Using codeword in Table to encode and decode:  Using codeword in Table to encode and decode Encode: abc = 0.101.100 = 0101100 (just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe Slide33:  Encode: abc = 0.101.100 = 0101100 (just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe (use the (right)binary tree below:) Tree for the fixed length codeword Tree for variable-length codeword Binary tree:  Binary tree Every nonleaf node has two children. The fixed-length code in our example is not optimal. The total number of bits required to encode a file is f ( c ) : the frequency (number of occurrences) of c in the file dT(c): denote the depth of c’s leaf in the tree Constructing an optimal code:  Constructing an optimal code Formal definition of the problem: Input: a set of characters C={c1, c2, …, cn}, each cC has frequency f[c]. Output: a binary tree representing codewords so that the total number of bits required for the file is minimized. Huffman proposed a greedy algorithm to solve the problem. Slide36:  a:45 d:16 e:9 f:5 b:13 c:12 (a) (b) Slide37:  (c) (d) Slide38:  (e) (f) Slide39:  HUFFMAN(C) 1 n:=|C| 2 Q:=C 3 for i:=1 to n-1 do 4 z:=ALLOCATE_NODE() 5 x:=left[z]:=EXTRACT_MIN(Q) 6 y:=right[z]:=EXTRACT_MIN(Q) 7 f[z]:=f[x]+f[y] 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q) The Huffman Algorithm:  The Huffman Algorithm This algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. C is a set of n characters, and each character c in C is a character with a defined frequency f[c]. Q is a priority queue, keyed on f, used to identify the two least-frequent characters to merge together. The result of the merger is a new object (internal node) whose frequency is the sum of the two objects. Time complexity:  Time complexity Lines 4-8 are executed n-1 times. Each heap operation in Lines 4-8 takes O(lg n) time. Total time required is O(n lg n). Note: The details of heap operation will not be tested. Time complexity O(n lg n) should be remembered. Another example::  Another example: e:4 a:6 c:6 b:9 d:11 Slide43:  d:11 Correctness of Huffman’s Greedy Algorithm (Fun Part, not required):  Correctness of Huffman’s Greedy Algorithm (Fun Part, not required) Again, we use our general strategy. Let x and y are the two characters in C having the lowest frequencies. (the first two characters selected in the greedy algorithm.) We will show the two properties: There exists an optimal solution Topt (binary tree representing codewords) such that x and y are siblings in Topt. Let z be a new character with frequency f[z]=f[x]+f[y] and C’=C-{x, y}{z}. Let T’ be an optimal tree for C’. Then we can get Topt from T’ by replacing z with z x y Proof of Property 1:  Proof of Property 1 Look at the lowest siblings in Topt, say, b and c. Exchange x with b and y with c. B(Topt)-B(Tnew)0 since f[x] and f[y] are the smallest. 1 is proved. Topt Tnew Slide47:  Let z be a new character with frequency f[z]=f[x]+f[y] and C’=C-{x, y}{z}. Let T’ be an optimal tree for C’. Then we can get Topt from T’ by replacing z with Proof: Let T be the tree obtained from T’ by replacing z with the three nodes. B(T)=B(T’)+f[x]+f[y]. … (1) (the length of the codes for x and y are 1 bit more than that of z.) Now prove T= Topt by contradiction. If TTopt, then B(T)>B(Topt). …(2) From 1, x and y are siblings in Topt . Thus, we can delete x and y from Topt and get another tree T’’ for C’. B(T’’)=B(Topt) –f[x]-f[y]<B(T)-f[x]-f[y]=B(T’). using (2) using (1) Thus, T(T’’)<B(T’). Contradiction! --- T’ is optimum. z x y

Add a comment

Related presentations

Related pages

Chapter 7: Document Preprocessing (textbook) | PPT Directory

http://www.cs.cityu.edu.hk/~lwang/ccs4485/lecnew05-11.ppt. Preview. Download. ... Chapter 11: Data Mining and Data Visualization Chapter 11 - 7 .
Read more

Extended Boolean Model: - CityUCS - CityU CS

Title: Extended Boolean Model: Author: Fengws Last modified by: cswangl Created Date: 3/10/2004 6:13:11 AM Document presentation format: On-screen Show
Read more

Slide 1 - CityUCS - CityU CS

Title: Slide 1 Author: CS Last modified by: cswangl Created Date: 2/18/2004 2:44:24 AM Document presentation format: On-screen Show Company: City ...
Read more

Topic-Sensitive PageRank | Many PPT

Topic-Sensitive PageRank: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search . Taher H. Haveliwala . Presented By: Saptarshi Kar
Read more

Which Of The Following Is Not An Example Of A Metasearch ...

Which Of The Following Is Not An Example Of A Metasearch Engine downloads at Ebookmarket.org - Download free ppt files,ebooks and documents -
Read more