Published on January 10, 2008
Mandarin-English Information (MEI):Investigating Translingual Speech RetrievalJohns Hopkins University Center of Language and Speech ProcessingSummer Workshop 2000Progress Update: Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer Workshop 2000 Progress Update The MEI Team August 2, 2000 Outline: Outline Baseline (Pat, Gina, Wai-Kit) Upper Bounds (Pat, Erika, Helen) Climbing Upwards (Upcoming Research Problems) translation (Gina, Jian Qiang) word-subword fusion (Helen, Doug, Wai-Kit) named entities , numerals (Helen, Sanjeev, Wai-Kit, Karen) syllable lattice generation (Hsin-Min, Berlin) The MEI Task: The MEI Task An example query (NYT, AP newswire) An example document (VOA) accompanied by raw anchor scripts A China Airlines A-310 jetliner returning from the Indonesian island of Bali with 197 passengers and crew crashed and burst into flames Monday night just short of Taipei’s Chiang Kai-Shek Airport……. (full story used as query, typically 200-500 words) Our Baseline System: Our Baseline System Query Query Term Selection (1 to full document) Query Term Translation (dictionary-lookup) InQuery Retrieval Engine Translated, hexified Chinese query terms Audio documents Dragon Mandarin Speech Recognizer Tokenized, hexified Chinese word sequence Evaluate retrieval outputs Our First Retrieval Experiment...: Our First Retrieval Experiment... Queries 17 exemplars 1 per topic in TDT2 corpus Documents 2265 in all ~500 belong to at least 1 topic others are “off-topic” or “briefs” each topic has >=2 relevant documents Our First Retrieval Experiment: Our First Retrieval Experiment No. of query terms selected = 100 (sweep) No. of alternative translations per term = 1 Word-based retrieval Average Precision = 16.91% In Search of Upper Bounds...: In Search of Upper Bounds... Confounding factors on query side term selection translation (no. of terms, definition of a term, named entities, dictionary / COTS system) Confounding factors on the document side syllable recognition performance, OOV word tokenization Confounding factors in retrieval word-based or subword-based (characters, syllables) subword n-grams (n=??) Upper Bounds (Word): Upper Bounds (Word) Queries (ASR); Documents (ASR) isolates the confounding factors (term selection, translation, recognition performance, word tokenization) Ave Precision=73.3% Queries (Xinhua); Documents (ASR/TKN) isolate similar confounding factors resembles MEI TDT task (queries and documents come from different news sources) word tokenization (CETA / Dragon) Best Ave Precision = 53.5%(ASR), 58.7% (TKN) Chinese Words and Subwords: Chinese Words and Subwords Characters (written) -> syllables (spoken) Degenerate mapping /hang2/, /hang4/, /heng2/ or /xing2/ /fu4 shu4/ (LDC’s CALLHOME lexicon) Tokenization / Segmentation /zhe4 yi1 wan3 hui4 ru2 chang2 ju3 xing2/ Upper Bounds (Subword): Upper Bounds (Subword) Queries (Xinhua); Documents (ASR/TKN) character-based retrieval overlapping character n-grams (document, within-term for queries, bigrams fare best) Best Ave Precision = 54.3%(ASR), 55.9%(TKN) overlapping bigrams in queries Best Ave Precision = 61.7% (cross-term overlap) syllable-based retrieval word tokenization affects syllable lookup syllable bigrams fare best Best Ave Precision = 51.6%(ASR), 53.3% (TKN) Upper Bound (Translingual): Upper Bound (Translingual) Putting back the translingual element Selected English query terms --> translated Chinese query terms (Oracle -- Jian Qiang Wang) Retrieval performance word-based (180 terms, no #syn, #sum) 50.6% subword-based retrieval (character bigrams, #sum 52.1%, #syn 52.3%) TKN?? Thus Far...: Thus Far... Ave Precision TDT_English / ASR (???) “perfect” translation, “best” index term set Trying to climb up Better Translation: Better Translation # translation alternatives per term Current best (120 query terms, 3 translations per term, word-based retrieval, ASR reseg with CETA, #sum 28.1%) (90 query terms, 2 translations pre term, word-based retrieval, ASR orig #sum 27.53%) Phrase-based translation 2 types of phrases (named entities, dictionary-based phrases) term selection (consider both phrases and component words), higher # terms Current best (250 query terms, all translations, word-based retrieval, 43.3%) Word-Subword Fusion: Word-Subword Fusion Words incorporate lexical knowledge Subwords are intended to handle the OOV problem Combination of both may beat either alone Ranked list of retrieved documents from word-based retrieval from subword-based retrieval Merging: Loose Coupling: Merging: Loose Coupling Types of Evidence Score Rank Score Combination Max Linear combination Rank Combination Round robin Source bias Query bias 1 voa4062 .22 2 voa3052 .21 3 voa4091 .17 … 1000 voa4221 .04 1 voa4062 .52 2 voa2156 .37 3 voa3052 .31 … 1000 voa2159 .02 1 voa40612 2 voa30522 3 voa40911 … 1000 voa42201 Tight Coupling: Words and Bigrams: Tight Coupling: Words and Bigrams jiang qiang zhe ze min ming Lattice: Words: Jiang Zemin Words: Jiang Zemin Bigrams: jiang_zhe jiang_ze qiang_zhe qiang_ze zhe_min zhe_ming ze_min ze_ming Combination: jiang_zhe zhe_min Jiang Zemin Word-Subword Fusion(weighted similarity): Word-Subword Fusion (weighted similarity) Merging ranked lists Each retrieved document is scored i denotes words, subword n-grams Numerals and Named Entities: Numerals and Named Entities Verbalize numerals Named Entities BBN tags (names of locations, people, organization) Derive Bilingual Term List from TDT2 English letter-to-phone generation Cross-lingual phonetic mapping (English phones to Chinese phones) Syllabification Cross-Lingual Phonetic Mapping: Cross-Lingual Phonetic Mapping Named entity Jiang Zemin, Kosovo Syllabify Pinyin Spelling E.g. jiang ze min English Pronunciation Lookup or Letter-to-Phone Generation English Phones, e.g. k ao s ax v ow Cross-lingual Phonetic Mapping Chinese Phones, e.g. k e s u o w o Syllabification Chinese syllables, e.g. ke suo wo Syllable Lattice for Document Representation: Syllable Lattice for Document Representation Address ASR errors and OOV Augment Dragon ASR output with alternate syllable hypotheses Generate syllable n-grams for audio indexing Include into word-subword fusion Lots to do still...: Lots to do still... Slide22: Named Entity Tagger Phrase tagging Unknown words and phrases English to Chinese translation dictionary Term Translation Spoken Mandarin documents Dragon Mandarin ASR Query Processing Document Processing Query to INQUERY Document to INQUERY Character n-gram generation Character n-gram generation Mandarin-English Information: Investigation Translingual Speech Retrieval <http://www.glue.umd.edu/~meiweb> Johns Hopkins University, Center for Language and Speech Processing, JHU/NSF Summer Workshop 2000 MEI Team : Helen MENG (CUHK), Berlin CHEN (National Taiwan University), Erika GRAMS (Advanced Analytic Tools), Sanjeev KHUDANPUR (JHU/CLSP), Gina-Anne LEVOW (University of Maryland), Wai-Kit LO (CUHK) Douglas OARD (University of Maryland), Patrick SCHONE (Department of Defense), Karen TANG (Princeton University), Hsin-Min WANG (Academia Sinica), Jianqiang WANG (University of Maryland) Word sequence Character n-gram sequence INQUERY Ranked List of Possibly Relevant Documents Translated words and phrases Relevance Assessments Figure of Merit Scoring Query Term Selection As of Sunday July 9, 2000 Word sequence Character n-gram sequence Segmented Chinese Text Input English text query
MEIProgressPres0802; Mandarin-English Information (MEI): Investigating; certificate; CERTIFICATION SYSTEM GOST RFEDERAL AGENCY FOR TEC; Word Roots And ...
Advertising Programmes Business Solutions +Google About Google Google.com © 2016 - Privacy - Terms. Search; Images; Maps; Play; YouTube; News; Gmail ...