SIGIR04

50 %
50 %
Information about SIGIR04
Entertainment

Published on November 20, 2007

Author: Danior

Source: authorstream.com

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR):  Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) :  Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan Outline:  Outline Introduction The Proposed Approaches Anchor-Text-Based Approach Search-Result-Based Approach Experiments Applications LiveTrans (http://livetrans.iis.sinica.edu.tw/lt.html) Discussions & Conclusion Query Translation for CLIR:  Query Translation for CLIR Query Translation Source Query Translated Query Mono-Lingual IR Translation Dictionaries S T Problem Problem Most queries are proper nouns:  Problem Most queries are proper nouns Problem Query Translation Source Query Translated Query Mono-Lingual IR George Bush S T Sheffield Yahoo Document Classification Observation from Query Logs:  Observation from Query Logs Most real queries are Short (2.3 English words [Silverstein’98] & 3.18 Chinese characters [Pu’02]) Out-of-dictionary (82.9% of high frequent query terms ) Problem 12.4% unknown English queries for Chinese documents Most of their Chinese translations also found in the logs Demand for translation The Web as Corpora:  The Web as Corpora Query Translation Source Query Translated Query Mono-Lingual IR S T Web Anchor Texts [Lu TOIS’04] Search Result Pages Idea Purpose:  Purpose To increase translation coverage Unknown queries General domains To improve CLIR performance Query expansion Combination of multiple translation approaches To benefit cross-language Web search Speed Idea Difference from Conventional Approaches:  Difference from Conventional Approaches Idea Our Ideas:  Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea Anchor Text in Multiple Languages:  Anchor Text in Multiple Languages [Lu’04] Anchor text: the descriptive part of a link of a Web page Idea Probabilistic Inference Model:  Probabilistic Inference Model [Lu’04] Page Authority Co-occurrence Approach Slide13:  Limited domains Powerful spiders required Large training corpora More network bandwidth & storage Drawbacks of Anchor-Text-Based Approach Approach Our Ideas:  Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea Multilingual Search-Result Pages:  Multilingual Search-Result Pages The search-result page in Chinese of the English query “Yahoo” Snippet Snippet Idea Correct Translations:  Correct Translations Mixed-language characteristic in Chinese pages Idea Relevant Translations:  Relevant Translations Effective query expansion Idea Observation:  Observation 95% Popular queries 70% Random queries Coverage of top-ranked translation candidates in search-result pages Many relevant translations found Idea Slide19:  To extract translation candidates with correct lexical boundaries To select correct or relevant translation candidates To integrate extracted translations from different approaches into improve CLIR performance Challenges Challenges Search-Result-Based Approach:  Search-Result-Based Approach Search Engine(s) Source Query Translated Query Search-Result Pages Term Extraction … Translation Candidates Translation Selection S T Approach Challenge 1: Term Extraction:  Challenge 1: Term Extraction SCP (Symmetric Conditional Probability) Cohesion holding the words together Low frequency or long terms tend to be discarded [Silva’99] CD (Context Dependency) Dependence on the left- or right- adjacent word/character Low frequency or long terms can be extracted [Chien’97] Approach Term Extraction (II):  Term Extraction (II) Performance: SCPCD: A combination of SCP and CD – PAT-tree as data structure – LocalMaxs as key term selection algorithm – No threshold Approach Challenge 2: Translation Selection:  Challenge 2: Translation Selection S . . . T1 T2 Tn Translation candidates: 雅虎(Yahoo!) 奇摩(Kimo) 雅虎台灣(Yahoo! Taiwan) Similarity Query term: Yahoo Similarity estimation S and Ti frequently co-occur in the same pages – Not true for synonym S and Ti have similar co-occurring context terms Approach Chi-Square Test:  Chi-Square Test A statistical method based on co-occurrence Approach Each translation only needs 3 Web searches Slide25:  Boolean Query Approach Context Vector Analysis:  Context Vector Analysis A vector space model based on co-occurring context terms as feature vectors Weighting scheme: Similarity measure: Approach Comparison of Chi-Square and Context Vector Methods:  Comparison of Chi-Square and Context Vector Methods FE: feature extraction N: # of translation candidates Approach Slide28:  Challenge 3: CLIR Retrieval model [Xu’01]: Approach Slide29:  Estimation of P(s|t) Consider various ranges of similarity values score ranking in method m : assigned weight for each m Approach Experiments:  Experiments Experiments on the NTCIR-2 English-Chinese task Experiments on translating Web-query terms Experiments on translating scientists’ names and disease names (English-to-Chinese/Japanese/Korean) Evaluation Experiments on the NTCIR-2 English-Chinese Task:  Experiments on the NTCIR-2 English-Chinese Task Evaluation Translation Performance:  Translation Performance Hong Kong law parallel text collection (238K para.) [Kwok’01] Evaluation Translation Performance:  Translation Performance Web corpora Evaluation Translation Performance:  Translation Performance Search results Evaluation Translation Performance:  Translation Performance Anchor-text collection (109K URLs) [Lu’04] Evaluation Translation Performance:  Translation Performance Search result + anchor text Evaluation Performance Metric:  Performance Metric Top-k inclusion rate The percentage of queries whose translations could be found in the first k extracted translations Evaluation Translation Performance (II):  Translation Performance (II) CV has higher precision rates than X2 CV+X2 has better performance than CV or X2 Evaluation Translation Performance (III):  Translation Performance (III) AT has higher precision rates than CV+X2 CV+X2 has higher coverage rates than AT Complementary Evaluation Translation Performance (III):  Translation Performance (III) CV+X2+AT has the best performance Evaluation Extracted Correct Translations:  Extracted Correct Translations Evaluation Extracted Relevant Translations:  Extracted Relevant Translations Evaluation CLIR Performance:  CLIR Performance Evaluation CLIR Performance:  CLIR Performance Evaluation CLIR Performance:  CLIR Performance Evaluation Dic: LDC English-Chinese lexicon (102K entries) CLIR Performance:  CLIR Performance Evaluation SR: X2+CV CLIR Performance:  CLIR Performance Evaluation SR+AT: X2+CV+AT CLIR Performance:  CLIR Performance Evaluation All: X2+CV+AT+Dictionary CLIR Performance (II):  CLIR Performance (II) Dic has higher precision rates than SR and SR+AT at K = 1 50.3% 61.2% Top-1 inclusion rate Evaluation CLIR Performance (III):  CLIR Performance (III) 68.0% 78.1% Top-3 inclusion rate SR or SR+AT has higher precision rates than Dic when K > 3 Evaluation CLIR Performance (III):  CLIR Performance (III) Starting converging Evaluation CLIR Performance (IV):  CLIR Performance (IV) Using only dictionary Using dictionary + our approaches Improvement: 0.043 0.061 0.064 0.059 0.063 0.064 OOV Inclusion rate: 68.1% 81.8% 86.3% – CLIR performance improvement by translating OOV terms Evaluation Experiments on Translating of Web-Query Terms:  Experiments on Translating of Web-Query Terms Web-query logs: Test query sets: Evaluation Slide54:  Web-Query Translation Performance Evaluation Slide55:  Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide56:  Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide57:  Web-Query Translation Performance AT performs worse for random Web queries Evaluation Slide58:  Web-Query Translation Performance in Different Types Place > People > Computer & Network > Others > Organization Popular query set: (search-result-based approach) Evaluation Common Nouns and Verbs:  Common Nouns and Verbs The proposed search-result-based approach is less reliable to common terms Evaluation Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean):  Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean) Evaluation An Example of Multilingual Translation:  An Example of Multilingual Translation Evaluation Applications:  LiveTrans http://livetrans.iis.sinica.edu.tw/lt.html A cross-language meta-search engine To provide online translation service of query terms for cross-language Web search Applications Application Slide63:  Application Sheffield Transliteration Slide64:  Application Industry City in Mid U.K. Sheffield Univ. Sheffield Hallam Univ. Discussion and Conclusions:  Discussion and Conclusions Advantages Can translate unknown queries to improve CLIR performance Can provide query expansion for CLIR Can extract translations with multiple meanings Be flexible for query specification Be useful for online cross-language Web search Disadvantages Be Dependent on employed search engines Not perform good for common terms Not applicable to the language pairs without mixed language characteristic Conclusion Slide66:  Jaguar Jaguar Car Jaguar Animal Conclusion Slide67:  Have a temperature 38”C pneumonia SARS, severe acute respiratory symptom Conclusion Thank you for your attention!:  Thank you for your attention! Q&A

Add a comment

Related presentations

Related pages

Parsimonious Language Models for Information Retrieval

Parsimonious Language Models for Information Retrieval Djoerd Hiemstra University of Twente Enschede, The Netherlands d.hiemstra@utwente.nl Stephen Robertson
Read more

SIGIR 2004 Workshop

SIGIR 2004 Workshop New Directions For IR Evaluation: Online Conversations Sheffield, UK, July 29 2004
Read more

Read Microsoft Word - sigir04.doc text version - Readbag

Readbag users suggest that Microsoft Word - sigir04.doc is worth reading. The file contains 3 page(s) and is free to view, download or print.
Read more

Block-based Web Search - uni-muenchen.de

Block-based Web Search Deng Cai1* Shipeng Yu2* Ji-Rong Wen* Wei-Ying Ma* *Microsoft Research Asia Beijing, China {jrwen, wyma}@microsoft.com 1Tsinghua ...
Read more

A Nonparametric Hierarchical Bayesian Framework for ...

A Nonparametric Hierarchical Bayesian Framework for Information Filtering Kai Yu†, Volker Tresp†, Shipeng Yu‡ †Corporate Technology, Siemens AG ...
Read more

Information Retrieval for Language Tutoring: An Overview ...

Title: ACM FrameMaker Template for SIG Site Author: Pekka Nikander Keywords: FrameMaker, ACM, proceedings, template Created Date: 8/19/2002 1:55:34 PM
Read more

Focused Named Entity Recognition using Machine Learning

Focused Named Entity Recognition using Machine Learning Li Zhang IBM China Research Laboratory No. 7, 5th Street, ShangDi Beijing 100085, P.R.China
Read more

A Study of Methods for Normalizing User Ratings in ...

A Study of Methods for Normalizing User Ratings in Collaborative Filtering Rong Jin Dept. of Computer Science and Engineering Michigan State University
Read more

Dawei Song - Knowledge Media Institute | The Open University

Dawei Song's profile document Description for Dawei Song Dawei Song Dawei Song Dawei Song Senior Lecturer My research interests include various theoretical ...
Read more

Document Clustering via Adaptive Subspace Iteration

Document Clustering via Adaptive Subspace Iteration Tao Li Computer Science Dept. University of Rochester Rochester, NY 14627-0226 taoli@cs.rochester.edu
Read more