advertisement

TakagiBISC2004

60 %
40 %
advertisement
Information about TakagiBISC2004
Travel-Nature

Published on March 14, 2008

Author: ozturk

Source: authorstream.com

advertisement

Conceptual Fuzzy Sets and Context Sensitive Information Retrieval:  Conceptual Fuzzy Sets and Context Sensitive Information Retrieval Tomohiro Takagi Meiji University, UC Berkeley Outline:  Outline coping with context dependent meanings toward conceptual fuzzy sets from IR point of view Trial 1: TREC Novelty Track Trial 2: TREC Web track Trial 3: Enhancing Google Image Search Trial 4: Detection of illegal websites Sparse Cording Model Coping with context dependent meanings:  Coping with context dependent meanings Ordinary approach: Cases cases×cases×cases× ……. Impossible Ex) heavy: (elephant or human or dog or cat or mouse or …) ×(old or middle or young or child or baby…) ×(Europe or Asia or Africa or ….) × Humans do not memorize things in that way. Slide4:  Our approach: Fusion of fractions of knowledge “Heavy” means bigger weight than usual. Usually “middle” and “young” is bigger than “child”. Usually “baby” is smaller than “child”. … Fractions of knowledge:  Fractions of knowledge “meaning representation from use” proposed by Wittgenstein:  “meaning representation from use” proposed by Wittgenstein According to Wittgenstein the various meanings of a label (word) can be represented by other labels (words) in its use. In this spirit, conceptual fuzzy sets, in which meaning of a word is represented by the distribution of the activation of other words depending on context, are proposed. toward conceptual fuzzy sets:  toward conceptual fuzzy sets Heavy word A (with grade) Word B (with grade) Word E (with grade) Word D (with grade) Word F (with grade) Word C (with grade) toward conceptual fuzzy sets:  toward conceptual fuzzy sets Heavy word A (with grade) Word B (with grade) Word E (with grade) Word D (with grade) Word F (with grade) Word C (with grade) toward conceptual fuzzy sets:  toward conceptual fuzzy sets Heavy word A (with grade) Word B (with grade) Word E (with grade) Word D (with grade) Word F (with grade) Word C (with grade) Conceptual fuzzy set (possibility distribution supporting concept “heavy”) But how to generate possibility distribution reflecting context? Meanings of JAVA in three deferent contexts :  Meanings of JAVA in three deferent contexts Java coffee Island Programming language Activated Fraction of Knowledge:  Activated Fraction of Knowledge coffee Island Slide12:  Topics (Concepts) Clustering words (word – document matrix) + optimization using corpus Dmoz (ODP) Artificial Brain Word vector Context Word vector IN OUT Slide13:  Java & Mocha Java & Windows Relational matrix CFS with 3 prototype vector CFS with 15 prototype vector coffee S/W travel H/W coffee S/W travel H/W Simulations using actual home pages:  Simulations using actual home pages Randomly selected 45 home pages Extracted 247 words from the pages Built 247 x 247 relational matrix based on co-occurrence Slide15:  Results expanded from keyword input java & application coffee computer travel 3 times iterations 10 times iterations Co-occurrence Clustering 60 web pages:  Clustering 60 web pages Co-occurrence CFSs :Movie :Music :Travel :Cooking from IR point of view:  from IR point of view Exact word matching Un-match from IR point of view:  from IR point of view Exact word matching Expansion Un-match Soft match .. but low precision from IR point of view:  from IR point of view Exact word matching Expansion Context aware focused expansion Un-match Soft match .. but low precision Better quality match From both point of view:  From both point of view Fuzzy sets Information retrieval Information Retrieval based on meaning representation using CFS, which is possibility distribution of words reflecting context. TRIAL 1 :  TRIAL 1 10,000 words 800 fractions = 800 clusters Optimized weights Slide22:  ・ ・ ・ ・ ・ ・ ・ ・ ・ X fraction c1 Similarity (x, c1) Similarity (x, c2) Similarity (x, cm) amn am1 a1n a12 a11 Examples of expansion:  Examples of expansion WORLD SPORTS AT 0000 GMT WORLD CUP. PARIS _ FIFA bans Laurent Blanc for two games, confirming that the French defender is out of Sunday's World Cup final against Brazil. TREC Novelty Track:  TREC Novelty Track Tasks Relevancy Detection Novelty Detection Learning data Reuter (TREC 2002) corpus 810,000 documents Indexed words: 10,000 Prototypes: 800 Relevancy Detection System:  Relevancy Detection System Result of Task 1 and Task 3 :  Result of Task 1 and Task 3 Task 1, Relevant and Novel F Scores:  Task 1, Relevant and Novel F Scores TRIAL 2 :  TRIAL 2 Case 1: 120,000 fractions = docs Case 2: 70,000 fractions = clusters of docs. TREC Web track Topic Distillation Task :  document Modified vector query Modified vector output matching TREC Web track Topic Distillation Task Gov collection (1.2 million HTML docs.) Example of Expansion:  Example of Expansion Physical fitness (0.0392 → 0.1362) 0.111806 fit 0.107622 physic 0.031421 sport 0.023926 exercis 0.020036 aerob 0.018505 heart 0.018082 obes 0.017206 particl 0.015366 walk computer virus (0.0105  → 0.0982) 0.098488 viru 0.086169 comput 0.036659 softwar 0.031507 encrypt 0.029903 vulner 0.027442 hacker 0.026835 virus 0.024170 intrus 0.024154 secur 0.022238 password Results (R-Precision):  Results (R-Precision) Case 2 0.1733 The best last year 0.1636 Case 1 0.1612 2nd best last year 0.1485 Enhancing Google Image Search - 20,000 index words - 60,000 prototypes:  Enhancing Google Image Search - 20,000 index words - 60,000 prototypes TRIAL 3 Slide33:  “gates” Bill Gates Experimental results Slide34:  Query User relevance feedback Meaning representation using CFS Query refinement Focus reflecting context Slide35:  Experimental result - 1 With feedback Without feedback Query = cat Slide36:  Experimental result - 2 With feedback Without feedback Query = apple Slide38:  Text based Image Search Content (Image) based Search Enhanced Image Search Next Step of Image Search Detection of illegal websites :  Detection of illegal websites TRIAL 4 Illegal sites:  Illegal sites Warez Illegal distribution and sale of commercial software Emulation Illegal distribution of software, such as video games Music Distribution of music data that infringes on copyrights Adult Pornographic depictions and expressions Hacking & Cracking Distribution of illegal hacking and cracking software sharing of technical know-how Drugs & Guns Sale of drugs and guns sharing of acquisition routes Killing Descriptions of murder and other violent acts Illustration of illegal site:  Illustration of illegal site Many suspicious words Many commercial software names High link rate to compressed files Looks like Illegal distribution and sale of commercial software Concept Description of “Warez”, “Music” and “Emulation” :  Concept Description of “Warez”, “Music” and “Emulation” Warez Music Emulator Suspicious words Commercial software Software maker Suspicious words Compressed file types URL List CFS System:  CFS System HTML document TF-IDF values Types of linked files and URLs Names (software, makers, music, artists) Support Vector Machine Evaluation:  Evaluation Randomly selected 300 actual Web sites (including 85 illegal sites) Compared CFS system with plain TF-IDF system Results:  CFS system TF-IDF system precision 0.9878 1.0000 recall 0.9529 0.8706 E measure 0.0299 0.0692 precision 0.9817 0.9556 recall 0.9953 1.0000 E measure 0.0115 0.0227 illegal pages legal pages 300 pages Results CFS based on Sparse Cording:  CFS based on Sparse Cording Training corpus: 200,000 Reuters news articles (1996/08/20 - 1997/08/19) Sparse Cording:  Sparse Cording In human brain One information : one neuron (grandmother cell) One information : several neurons (cell assembly) Information A Information B Information C Information A Information B Information C Interconnection based on Mutual Information:  Interconnection based on Mutual Information Term Layer Context Layer Slide49:  Meaning of a word is encoded as a activation pattern of neurons. Fractions of knowledge are encoded as interconnections of neurons. Get the most appropriate word as a result. Operating cell assembly:  Operating cell assembly 1. term input 2. propagating activation to related context 3. detecting the context 4. propagating activation to related word 5. term output Examples of expansion:  Input “child” + “seat” Input “child” Input “seat” Examples of expansion

Add a comment

Related presentations