2005 JRC Workshop Eisele

40 %
60 %
Information about 2005 JRC Workshop Eisele
Education

Published on May 2, 2008

Author: Tibald

Source: authorstream.com

Exploiting Multilingual Corpora for Machine Translation:  Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI eisele@dfki.de Arona, September 2005 JRC Enlargement and Integration Workshop Exploiting parallel corpora in up to 20 languages Overview:  Overview Multilingual/MT Projects & Tools at DFKI MT-Related Activities at Saarland University Work in the PTOLEMAIOS Project Plans for Near-Term Future Multilingual Projects at DFKI:  Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Multilingual Natural Communication:  Multilingual Natural Communication NL Dialogue Systems (DISCO, COSMA, Interprice) Speech Dialogue Processing (Verbmobil, Interprice) Robust Speech Parsing (Verbmobil, Interprice) Automatic Processing and Answering of Email (COSMA, ICC, XtraMind) Natural Speech Synthesis (Mary, Interprice) Sample Application Areas: e-commerce (product search, CRM) Application Projects with Interprice, AOL Europe and spin-off company XtraMind Technologies Multilingual Document Production:  Multilingual Document Production Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE) Grammar and Style Checking (LATESLAV, FLAG, SKATE) Controlled Language Checking (FLAG, WHITEBOARD, SKATE) Automatic XML Tagging (WHITEBOARD) Consistency Control (BiLD, WHITEBOARD) Sample Application Areas: multilingual document production, web-content production Application Project with SAP Spin-Off company Crosslingual Information and Knowledge Management:  Crosslingual Information and Knowledge Management Crosslingual Content Management (TWENTYONE, MUCHMORE) Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE) Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO) Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO) Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO) Multilingual Summarization (MULINEX, MUCHMORE, MUSI) Multilingual Language Generation (TG/2, TEMSIS, MIETTA) Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), … Multilingual Resources at DFKI:  Multilingual Resources at DFKI POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs) Morphologies from MMorph project exist for German, English, French, Spanish, Italian Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation Adding more languages is very easy (as done for Arabic with A.Soudi) Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lacking Multilingual Projects at DFKI:  Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Topic emerging since 2005: Machine Translation Machine Translation at DFKI:  Machine Translation at DFKI Topics in Compass (Digital Olympics 2006): Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering Open LOGOS LOGOS MT ® = one of the largest and most powerful among the commercial MT engines DFKI turned LOGOS MT into an open source product (in cooperation with GlobalWare AG) Plans for integrated, hybrid MT from rule-based and stochastic engines (code name: EuroMatrix) MT Activities at Saarland University:  MT Activities at Saarland University Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003) Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT Among best approaches in ongoing DARPA evaluation campaign Easy to deploy (thanks to tools by F.J. Och and P. Köhn) Conceptually very simple, hence a good candidate to enrich models with linguistic sophistication MT Activities at Saarland University:  MT Activities at Saarland University April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish}  English May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese  English) Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English) Diploma Thesis on corpus-based MT via RMRS alignment Experience: Using parallel corpora for MT quickly yields very promising results! We should have more language pairs and more data… Crawling of UN document repository, collection of 6-way parallel {Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German) The PTOLEMAIOS project:  The PTOLEMAIOS project Assumptions: Advanced language technology for truly multilingual applications is a key challenge for computational linguistics Treebanking and supervised learning have been successful for English (and some other languages), but may not be feasible for “smaller” languages Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data Word alignments derived from simple models (GIZA++) can help to support this process “Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the Induction Of Syntactic grammars” PTOLEMAIOS:  PTOLEMAIOS Funding: Emmy-Noether fellowship from DFG, P.I. Jonas Kuhn Expected Duration: April 2005 – March 2009 Original Goal: Induce grammars from parallel corpora (and evaluate them in isolation) Revised Goal (since August’05): Evaluate grammars wrt. impact on MT performance First Steps: Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpus Planned Steps: Explore the usefulness of syntactic analyses for phrase-based SMT word-based and syntax-based partial analyses are offered to decoder decoder can exploit syntax if useful, fall back to plain PBSMT if not optimal weight of syntactic dependencies can be determined empirically Work on more languages (UN corpus in 6 languages, AC corpus) EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh):  EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh) MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition) EuroMatrix: current situation:  EuroMatrix: current situation Most language pairs remain uncovered EuroMatrix: SMT for many languages:  EuroMatrix: SMT for many languages EuroParl Corpus has been constructed to build statistical MT systems Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005 EuroMatrix: SMT for many languages:  EuroMatrix: SMT for many languages Multilingual corpora can be aligned across all languages… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005 EuroMatrix: SMT for many languages:  EuroMatrix: SMT for many languages SMT systems derived from the corpora vary in quality Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005 EuroMatrix: SMT for many languages:  EuroMatrix: SMT for many languages Difficulty of translation into and from a given language may differ widely… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005 EuroMatrix:  EuroMatrix Ideas: For language pairs where rule-based MT and SMT based on parallel corpora exist, they should be integrated to exploit complementary strengths of both approaches Parallel corpora can then be used in two ways feeding the SMT sub-system fine-tuning the integrated setup For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox) Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared task Conclusion:  Conclusion Machine translation performance can be enabled/ boosted by parallel corpora Current work just scratches the surface of what can be done SMT systems for the languages of new member states should soon emerge from AC corpus More parallel data for these languages would be desirable (100MW much better than 10MW!) It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…

Add a comment

Related presentations

Related pages

2005_ESA-JRC_WORKS - eoPortal

2005_ESA-JRC_WORKS. Event ... October 27-28, 2005 Workshop Place ESA-ESRIN Frascati (Italy) Cook Conference Room ...
Read more

JRC Enlargement and Integration Action 2005 Workshops and ...

JRC Enlargement and Integration Action 2005 Workshops and Advanced Training for New Member States, Candidate Countries, Non-EU countries associated to the ...
Read more

DESERT Action JRC - DMCSEE, Drought Management Centre for ...

DESERT Action JRC. Ljubljana on 24 ... between 2005 and 2008 48 49 50. Ljubljana on 24 September 2009 – 1st DMCSEE ... JRC Workshop on Drought Monitoring 30
Read more

Contents - JRC Science Hub

J. Civera, A. Juan September 27, 2005 EU Enlargement Workshop 17. AC 44.0 46.0 48.0 50.0 52.0 54.0 1 2 5 10 1g1g P 1g1g R 2g2g P 2g2g R 3g3g P 3g3g R mix ...
Read more

Ageing Effects into PSA Applications - Europa

Ageing Effects into PSA Applications (JRC FP6 ... • Networking via series of workshops and via ... 7 October 2005, 5 October 2006 • Workshops :
Read more

Workshop „The Institutional Shaping of EU-Society ...

Workshop „The Institutional Shaping of EU-Society ... 2005 Workshop Report ... Gudrun Eisele doing research on civil society actors and their ...
Read more

JRC Publications Repository: Workshop Proceedings of the ...

JRC.H.8-Renewable energies. ... Workshop Proceedings of the "1st International Workshop Thin Films in the Photovoltaic Industry" 10/11 November 2005:
Read more

Soil Contamination Workshop in February 2005

Soil Themes > Soil Contamination > Workshop February 2005 "Towards an European Common Framework for Risk Assessment of Contaminated Sites in Europe"
Read more

Lexicon acquisition using Random Indexing

Lexicon acquisition using Random Indexing. ... workshop - death (!) ... Italy 2005 Magnus Sahlgren mange@sics.se 20
Read more