Better translations through automated source and post-edit analysis David Landan Welocalize
Background • MT is here to stay – Better MT = less PE effort = higher throughput for less money • MT quality depends on training data quantity, quality, and relevance – Selecting in-domain data increases BLEU scores by 10-20 BLEU over generic engines • LSPs have less control over quantity, so we need to focus on quality & relevance
A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
Candidate Scorer • Uses corpus of known “difficult” text • Compares part of speech (POS) n-grams – Generates per-sentence scores
Perplexity (PPL) Evaluator • Build language models (LMs) from multiple corpora – Known “good” sentences for MT – Known “bad” sentences for MT – Client-specific in-domain data • Each document gets a PPL score against each LM
StyleScorer • Combines PPL ratio, dissimilarity score, and classification score – Each document receives a score from 0-4 – Higher score indicates better match to style established by client’s documents – Does not require parallel data • Source scored for training/tuning suitability
Source Content Profiler • CNGL project (beta) – Classification of docs into profiles – Features based on: • • • • • • • Word & sent. length Readability score Syntactic structure Terminology Tag ratios Do Not Translate lists Glossary matches
Does it work? en-USnl-NL en-USpl-PL en-UShu-HU Plain vanilla 21.26 16.88 17.31 Domain match 36.39 37.07 38.36 Plain + target 44.07 34.61 30.43 Domain + target 64.40 54.55 49.53 Engine
A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
UGC normalization • Make substitutions in source for known MT pain points before translating – Frequent misspellings – “teh”, “mroe”, etc. – Abbreviations – “imho”, “tyvm”, etc. – Missing punctuation – “cant”, “theyll”, etc. – Emoticons – Spelling variants/slang – “cuz”, “usu”, etc.
Number checking • Verify that numeric MT output is localized correctly – Currency – “$1B” vs “1 млрд. $” – Dates – “2/28/2014” vs “28/2/2014” – Time – “2pm” vs “14h00” – Separator & radix – “1,234.5” vs “1 234,5”
StyleScorer revisited • MT output is compared to client’s historical (in-domain) PE data – Treat each target segment as a document – Lower scores indicate segments likely to require greater PE effort
A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
WeScore • Dashboard for viewing MT metrics – Tokenizes input from variety of formats & runs several scoring algorithms in parallel – Exports detailed analysis to spreadsheet for sentence-by-sentence review
WeScore
WeScore
StyleScorer III • PE output is compared to client’s historical (in-domain) data – Treat each PE segment as a document – Lower score indicates possible deviation from established style
Feedback loop • Data collected and lessons learned – Update client-specific data for future engine training – Mine data for generalizable patterns in problem areas – Work with post-editors to understand how to make a better system & how to improve PE experience and throughput
Q&A Thank you!
Canvas Prints at Affordable Prices make you smile.Visit http://www.shopcanvasprint...
30 Días en Bici en Gijón organiza un recorrido por los comercios históricos de la ...
Con el fin de conocer mejor el rol que juega internet en el proceso de compra en E...
With three established projects across the country and seven more in the pipeline,...
Retailing is not a rocket science, neither it's walk-in-the-park. In this presenta...
SDL is the world's number 1 provider of free and professional language translation services for ... SDL FreeTranslation. Poor #translation could end up ...
Read more
... Automatic translations between 103 languages Hides "Suggest better translation ... automatic translator ... While edits of translations ...
Read more
Studio 2015 Professional allows you to manage your own or your clients’ translation assets through ... Access machine translation with post ... automatic ...
Read more
... automatic translation solutions only ... the better the translations. ... CTF allows you to use human translation to edit the ...
Read more
Our professional translations help you communicate better. ... to browse through our ... www/htdocs/argostranslations.com/classes/Post.class.php ...
Read more
Pioneer and global leader in machine translation solutions, SYSTRAN helps organizations communicate more ... solve communication challenges through ...
Read more
Microsoft.AnalysisServices.Tabular namespace for Tabular 1200 programmability in AMO. Analysis Services Management Objects (AMO) is updated to ...
Read more
... it is no longer necessary to supply a website translation vendor with source ... The Better Way to Manage Translation ... translation process through ...
Read more
Computerised automatic translation method characterised by the use of multi ... to the second while the first begins the translation of the ...
Read more
... , new Formula Editor with Code Snippets, ... Reasons why we are better than competition: ... With AmiBroker the limit is just your imagination.
Read more
Add a comment