Better translations through automated source and post edit analysis

67 %
33 %
Information about Better translations through automated source and post edit analysis
Business & Mgmt

Published on March 4, 2014

Author: Welocalize

Source: slideshare.net

Description

Presentation at memoQfest Americas 2014 by Welocalize David Landan. Machine Translations (MT) is here to stay.
Better MT means less post-editing effort, resulting in higher throughput for less money. This technical presentation details how automated sourcing and post-edit analysis produces better translations. MT quality depends on training data quantity, quality, and relevance. Review of weScore, scoring.

Better translations through automated source and post-edit analysis David Landan Welocalize

Background • MT is here to stay – Better MT = less PE effort = higher throughput for less money • MT quality depends on training data quantity, quality, and relevance – Selecting in-domain data increases BLEU scores by 10-20 BLEU over generic engines • LSPs have less control over quantity, so we need to focus on quality & relevance

A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

Candidate Scorer • Uses corpus of known “difficult” text • Compares part of speech (POS) n-grams – Generates per-sentence scores

Perplexity (PPL) Evaluator • Build language models (LMs) from multiple corpora – Known “good” sentences for MT – Known “bad” sentences for MT – Client-specific in-domain data • Each document gets a PPL score against each LM

StyleScorer • Combines PPL ratio, dissimilarity score, and classification score – Each document receives a score from 0-4 – Higher score indicates better match to style established by client’s documents – Does not require parallel data • Source scored for training/tuning suitability

Source Content Profiler • CNGL project (beta) – Classification of docs into profiles – Features based on: • • • • • • • Word & sent. length Readability score Syntactic structure Terminology Tag ratios Do Not Translate lists Glossary matches

Does it work? en-USnl-NL en-USpl-PL en-UShu-HU Plain vanilla 21.26 16.88 17.31 Domain match 36.39 37.07 38.36 Plain + target 44.07 34.61 30.43 Domain + target 64.40 54.55 49.53 Engine

A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

UGC normalization • Make substitutions in source for known MT pain points before translating – Frequent misspellings – “teh”, “mroe”, etc. – Abbreviations – “imho”, “tyvm”, etc. – Missing punctuation – “cant”, “theyll”, etc. – Emoticons – Spelling variants/slang – “cuz”, “usu”, etc.

Number checking • Verify that numeric MT output is localized correctly – Currency – “$1B” vs “1 млрд. $” – Dates – “2/28/2014” vs “28/2/2014” – Time – “2pm” vs “14h00” – Separator & radix – “1,234.5” vs “1 234,5”

StyleScorer revisited • MT output is compared to client’s historical (in-domain) PE data – Treat each target segment as a document – Lower scores indicate segments likely to require greater PE effort

A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

WeScore • Dashboard for viewing MT metrics – Tokenizes input from variety of formats & runs several scoring algorithms in parallel – Exports detailed analysis to spreadsheet for sentence-by-sentence review

WeScore

WeScore

StyleScorer III • PE output is compared to client’s historical (in-domain) data – Treat each PE segment as a document – Lower score indicates possible deviation from established style

Feedback loop • Data collected and lessons learned – Update client-specific data for future engine training – Mine data for generalizable patterns in problem areas – Work with post-editors to understand how to make a better system & how to improve PE experience and throughput

Q&A Thank you!

Add a comment

Related presentations

Related pages

Free Translation and Professional Translation Services ...

SDL is the world's number 1 provider of free and professional language translation services for ... SDL FreeTranslation. Poor #translation could end up ...
Read more

GTranslate - Joomla! Extension Directory

... Automatic translations between 103 languages Hides "Suggest better translation ... automatic translator ... While edits of translations ...
Read more

SDL Trados Studio 2015 Professional - Translation Memory ...

Studio 2015 Professional allows you to manage your own or your clients’ translation assets through ... Access machine translation with post ... automatic ...
Read more

New Microsoft Translator Customization Features Help ...

... automatic translation solutions only ... the better the translations. ... CTF allows you to use human translation to edit the ...
Read more

Translation Company | Document Translations | Localization ...

Our professional translations help you communicate better. ... to browse through our ... www/htdocs/argostranslations.com/classes/Post.class.php ...
Read more

SYSTRAN – Translation Technologies | Online translation ...

Pioneer and global leader in machine translation solutions, SYSTRAN helps organizations communicate more ... solve communication challenges through ...
Read more

What's New in Analysis Services - msdn.microsoft.com

Microsoft.AnalysisServices.Tabular namespace for Tabular 1200 programmability in AMO. Analysis Services Management Objects (AMO) is updated to ...
Read more

GlobalLink Modules - Translations

... it is no longer necessary to supply a website translation vendor with source ... The Better Way to Manage Translation ... translation process through ...
Read more

Patent US20090106017 - Acceleration Method And System For ...

Computerised automatic translation method characterised by the use of multi ... to the second while the first begins the translation of the ...
Read more

AmiBroker - Technical Analysis Software. Charting ...

... , new Formula Editor with Code Snippets, ... Reasons why we are better than competition: ... With AmiBroker the limit is just your imagination.
Read more