Information about CLIR

Published on December 6, 2007

Author: Baverly_gal


Cross Language Information Retrieval (CLIR):  Cross Language Information Retrieval (CLIR) Modern Information Retrieval Sharif University of Technology Fall 2005 Mohsen Jamali The General Problem:  The General Problem Find documents written in any language Using queries expressed in a single language The General Problem (cont):  The General Problem (cont) Traditional IR identifies relevant documents in the same language as the query (monolingual IR) Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment Why is CLIR important?:  Why is CLIR important? Characteristics of the WWW:  Characteristics of the WWW Country of Origin of Public Web Sites, 2001 (% of Total) (OCLC Web Characterization, 2001) Global Internet User Population:  Source: Global Reach English English 2000 2005 Global Internet User Population Chinese Importance of CLIR:  Importance of CLIR CLIR research is becoming more and more important for global information exchange and knowledge sharing. National Security Foreign Patent Information Access Medical Information Access for Patients CLIR is Multidisciplinary:  CLIR is Multidisciplinary CLIR involves researchers from the following fields: information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction User Needs :  User Needs Search a monolingual collection in a language that the user cannot read. Retrieve information from a multilingual collection using a query in a single language. Select images from a collection indexed with free text captions in an unfamiliar language. Locate documents in a multilingual collection of scanned page images.  Why Do Cross-Language IR?:  Why Do Cross-Language IR? When users can read several languages Eliminates multiple queries Query in most fluent language Monolingual users can also benefit If translations can be provided If it suffices to know that a document exists If text captions are used to search for images Language Identification:  Language Identification Can be specified using metadata Included in HTTP and HTML Determined using word-scale features Which dictionary gets the most hits? Determined using subword features Letter n-grams in electronic and printed text Phoneme n-grams in speech Language Encoding Standards:  Language Encoding Standards Language (alphabet) specific native encoding: Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251 UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte CLIR Experimental System:  CLIR Experimental System 2 systems: SMART Information retrieval system modified to work with 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish) Generation of restricted bigrams Pseudo-Relevance feedback TAPIR is a language model IR system written by M. Srikanth. It has been adated to work with 12 different European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish) Stemming using Porter’s stemmer Translation using Intertran ( Approaches to CLIR:  Approaches to CLIR Design Decisions:  Design Decisions What to index? Free text or controlled vocabulary What to translate? Queries or documents Where to get translation knowledge? Dictionary, ontology, training corpus Slide16:  Term-aligned Sentence-aligned Document-aligned Unaligned Parallel Comparable Knowledge-based Corpus-based Controlled Vocabulary Free Text Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Ontology-based Dictionary-based Thesaurus-based Early Development:  Early Development 1964 International Road Research Documentation English, French and German thesaurus 1969 Pevzner Exact match with a large Russian/English thesaurus 1970 Salton Ranked retrieval with small English/German dictionary 1971 UNESCO Proposed standard for multilingual thesauri Controlled Vocabulary Matures:  Controlled Vocabulary Matures 1977 IBM STAIRS-TLS Large-scale commercial cross-language IR 1978 ISO Standard 5964 Guidelines for developing multilingual thesauri 1984 EUROVOC thesaurus Now includes all 9 EC languages 1985 ISO Standard 5964 revised Free Text Developments:  Free Text Developments 1970, 1973 Salton Hand coded bilingual term lists 1990 Latent Semantic Indexing 1994 European multilingual IR project First precision/recall evaluation 1996 SIGIR Cross-lingual IR workshop 1998 EU/NSF digital library working group Query vs. Document Translation:  Query vs. Document Translation Query translation Very efficient for short queries Not as big an advantage for relevance feedback Hard to resolve ambiguous query terms Document translation May be needed by the selection interface And supports adaptive filtering well Slow, but only need to do it once per document Poor scale-up to large numbers of languages Document Translation Example:  Document Translation Example Approach Select a single query language Translate every document into that language Perform monolingual retrieval Long documents provide enough context And many translation errors do not hurt retrieval Much of the generation effort is wasted And choosing a single translation can hurt Text Translation:  Text Translation One weakness of present fully automatic machine translation systems is that they are able to produce high quality translations only in limited domains Text retrieval systems are typically more tolerant of syntactic than semantic translation errors but that semantic accuracy suffers when insufficient domain knowledge is encoded into a translation system In fact some of the work done by a machine translation system could actually reduce some measures of retrieval effectiveness Query Translation Example:  Query Translation Example Select controlled vocabulary search terms Retrieve documents in desired language Form monolingual query from the documents Perform a monolingual free text search Information Need Thesaurus Controlled Vocabulary Multilingual Text Retrieval System Alta Vista French Query Terms English Abstracts English Web Pages Query Translation:  Query Translation Queries (E) An English-Chinese CLIR System Queries (C) Results (C) Results (E) Chinese Documents Controlled Vocabulary:  Controlled Vocabulary A controlled vocabulary information retrieval system can be very useful in the hands of a skilled searcher, but end users often find free text searching to be more helpful. Experience has shown that although the domain knowledge that can be encoded in a thesaurus permits experienced users to form more precise queries casual and intermittent users have diffculty exploiting the expressive power of a traditional query interface in exact match retrieval systems Controlled vocabulary text retrieval systems are widely used in libraries and user needs assessment has received considerable attention from library and information science researchers. Knowledge-based Techniques for Free Text Searching:  Knowledge-based Techniques for Free Text Searching Knowledge Structures for IR:  Knowledge Structures for IR Ontology Representation of concepts and relationships Thesaurus Ontology specialized for retrieval Bilingual lexicon Ontology specialized for machine translation Bilingual dictionary Ontology specialized for human translation Machine Readable Dictionaries:  Machine Readable Dictionaries Based on printed bilingual dictionaries Becoming widely available Used to produce bilingual term lists Cross-language term mappings are accessible Sometimes listed in order of most common usage Some knowledge structure is also present Hard to extract and represent automatically The challenge is to pick the right translation CLIR: Dictionary Based :  CLIR: Dictionary Based Problems Limitations of dictionaries Inflected word forms Phrases and compound words Lexical ambiguity Possible solution Approximate string matching Unconstrained Query Translation:  Unconstrained Query Translation Replace each word with every translation Typically 5-10 translations per word About 50% of monolingual effectiveness Ambiguity is a serious problem Example: Fly (English) 8 word senses (e.g., to fly a flag) 13 Spanish translations (enarbolar, ondear, …) 38 English retranslations (hoist, brandish, lift…) Slide32:  Exploiting Part-of-Speech Tags Constrain translations by part of speech Noun, verb, adjective, … Effective taggers are available Works well when queries are full sentences Short queries provide little basis for tagging Constrained matching can hurt monolingual IR Nouns in queries often match verbs in documents Phrase Indexing:  Phrase Indexing Improves retrieval effectiveness two ways Phrases are less ambiguous than single words Idiomatic phrases translate as a single concept Three ways to identify phrases Semantic (e.g., appears in a dictionary) Syntactic (e.g., parse as a noun phrase) Cooccurrence (words found together often) Semantic phrase results are impressive Corpus-based Techniques for Free Text Searching:  Corpus-based Techniques for Free Text Searching Types of Bilingual Corpora:  Types of Bilingual Corpora Parallel corpora: translation-equivalent pairs Document pairs Sentence pairs Term pairs Comparable corpora Content-equivalent document pairs Unaligned corpora Content from the same domain Slide36:  Pseudo-Relevance Feedback Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search Top ranked French Documents French Text Retrieval System Alta Vista French Query Terms English Translations English Web Pages Parallel Corpus Slide37:  Learning From Document Pairs Count how often each term occurs in each pair Treat each pair as a single document E1 E2 E3 E4 E5 S1 S2 S3 S4 Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 4 2 2 1 8 4 4 2 2 2 2 1 2 1 2 1 4 1 2 1 English Terms Spanish Terms Similarity-Based Dictionaries:  Similarity-Based Dictionaries Automatically developed from aligned documents Terms E1 and E3 are used in similar ways Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language Retain only the top few (5 or so) Performs as well as dictionary-based techniques Evaluated on a comparable corpus of news stories Stories were automatically linked based on date and subject Generalized Vector Space Model:  Generalized Vector Space Model “Term space” of each language is different But the “document space” for a corpus is the same Describe new documents based on the corpus Vector of cosine similarity to each corpus document Easily generated from a vector of term weights Multiply by the term-document matrix Compute cosine similarity in document space Excellent results when the domain is the same Latent Semantic Indexing:  Latent Semantic Indexing Designed for better monolingual effectiveness Works well across languages too Cross-language is just a type of term choice variation Produces short dense document vectors Better than long sparse ones for adaptive filtering Training data needs grow with dimensionality Not as good for retrieval efficiency Always 300 multiplications, even for short queries Sentence-Aligned Parallel Corpora:  Sentence-Aligned Parallel Corpora Easily constructed from aligned documents Match pattern of relative sentence lengths Not yet used directly for effective retrieval But all experiments have included domain shift Good first step for term alignment Sentences define a natural context Cooccurrence-Based Translation:  Cooccurrence-Based Translation Align terms using cooccurrence statistics How often do a term pair occur in sentence pairs? Weighted by relative position in the sentences Retain term pairs that occur unusually often Useful for query translation Excellent results when the domain is the same Also practical for document translation Term usage reinforces good translations Exploiting Unaligned Corpora:  Exploiting Unaligned Corpora Documents about the same set of subjects No known relationship between document pairs Easily available in many applications Two approaches Use a dictionary for rough translation But refine it using the unaligned bilingual corpus Use a dictionary to find alignments in the corpus Then extract translation knowledge from the alignments Feedback with Unaligned Corpora:  Feedback with Unaligned Corpora Pseudo-relevance feedback is fully automatic Augment the query with top ranked documents Improves recall “Recenters” queries based on the corpus Short queries get the most dramatic improvement Two opportunities: Query language: Improve the query Document language: Suppress translation error Context Linking:  Context Linking Automatically align portions of documents For each query term: Find translation pairs in corpus using dictionary Select a “context” of nearby terms e.g., +/- 5 words in each language Choose translations from most similar contexts Based on cooccurrence with other translation pairs No reported experimental results Problems with CLIR:  Problems with CLIR Morphological processing difficult for some languages (e.g. Arabic) Many different encodings for Arabic Windows Arabic (e.g. dictionaries) Unicode (UTF-8) (e.g. corpus) Macintosh Arabic (e.g. queries) Normalization Remove diacritics العَرَبِيَّة to العربِية Arabic (language) Standardize spellings for foreign names آلينتون vs آلنتون “Kleentoon” vs “Klntoon” for Clinton Problems with CLIR (contd):  Problems with CLIR (contd) Morphological processing (contd.) Arabic stemming Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb → kitab a book kitabi my book alkitab the book kitabuki your book (f) kataba to write kitabuka your book (m) maktab office kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers Problems with CLIR (contd):  Problems with CLIR (contd) Availability of resources Names and phrases are very important, most lexicons do not have good coverage Difficult to get hold of bilingual dictionaries can sometimes be found on the Web e.g. for recent Arabic cross-lingual evaluation we used 3 on-line Arabic- English dictionaries (including harvesting) and a small lexicon of country and city names Parallel corpora are more difficult and require more formal arrangements CLIR better than IR?:  CLIR better than IR? How can cross-language beat within-language? We know there are translation errors Surely those errors should hurt performance Hypothesis is that translation process may disambiguate some query terms Words that are ambiguous in Arabic may not be ambiguous in English Expansion during translation from English to Arabic prevents the ambiguity from re-appearing Has been proposed that CLIR is a model for IR Translate query into one language and then back to original Given hypothesis, should have an improved query Should be reasonable to do this across many different languages “Low Density Languages”:  “Low Density Languages” Languages for which few on-line resources exist Rumor has it that 25 languages are well represented on Web Extreme is “kitchen languages” that are only spoken More extreme: a language made up of whistling Corpus to be searched may also be very small Bilingual dictionaries often exist in print, may need to use “interlingua” such as French Some approaches, such as those relying on translation probabilities may not work well Solution depends on specific application Performance Evaluation:  Performance Evaluation Constructing Test Collections:  Constructing Test Collections One collection for retrospective retrieval Start with a monolingual test collection Documents, queries, relevance judgments Translate the queries by hand Need 2 collections for adaptive filtering Monolingual test collection in one language Plus a document collection in the other language Generate relevance judgments for the same queries Evaluating Corpus-Based Techniques:  Evaluating Corpus-Based Techniques Same domain evaluation Partition a bilingual corpus Design queries Generate relevance judgments for evaluation part Cross-domain evaluation Can use existing collections and corpora No good metric for degree of domain shift Evaluation Example:  Evaluation Example Corpus-based same domain evaluation Use average precision as figure of merit User Interface Design:  User Interface Design Query Formulation:  Query Formulation Interactive word sense disambiguation Show users the translated query Retranslate it for monolingual users Provide an easy way of adjusting it But don’t require that users adjust or approve it Selection and Examination:  Selection and Examination Document selection is a decision process Relevance feedback, problem refinement, read it Based on factors not used by the retrieval system Provide information to support that decision May not require very good translations e.g., Word-by-word title translation People can “read past” some ambiguity May help to display a few alternative translations References:  References Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Power point presentation, University of Buffalo. 2002 Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual Text Retrieval .1996 Jian-Yun Nie: Cross-Language Information Retrieval. IEEE Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi and Beaulieu, Micheline and Sanderson, Mark (2002) User-Centered Interface Design for Cross-Language Information Retrieval. In: Proceedings of the Twenty-fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language Information Exploitation of Arabic. Power point presentation April 2005

