Accuracy sas-redmore-2014-2

56 %
44 %
Information about Accuracy sas-redmore-2014-2

Published on March 6, 2014

Author: sredmore



Understanding how to best optimize the accuracy of your sentiment analysis system. Discusses limits of accuracy, crowdsourcing (MTurk), and process for how to actually tune and test the accuracy of your system.

The Opportunity of Accuracy No, you can't always get what you want You can't always get what you want You can't always get what you want And if you try sometime you find You get what you need Seth Redmore, VP Marketing and Product Management @sredmore, @lexalytics,

Accuracy? Opportunity? What? A very rough estimate of “companies that care about accuracy…” Care Lots Care Some Don't Care ©2014 Lexalytics Inc. All rights reserved. 2

Agenda • “Accuracy” is imprecise Sentiment is personal Precision/Recall/F1 Different applications require different balance Precision and Recall are bounded by Inter-Rater Agreement • How to tune • How to crowdsource © 2014 Lexalytics Inc. All rights reserved. 3

“Accuracy” is imprecise. • Because sentiment is personal (e.g. over what dataset is sentiment “accurate”?) • Because you may care more about precision, or you may care more about recall © 2014 Lexalytics Inc. All rights reserved. 4

Sentiment Accuracy is Personal! • “Wells Fargo lost $200M last month” • “Kölnisch wasser smells like my grandmother.” • “Taco Bell is like Russian Roulette for your ass, but it’s worth the risk.” • “We’re switching to Direct TV.” • “Microsoft is dropping their prices.” © 2014 Lexalytics Inc. All rights reserved. 5

Precision, Recall, F1 • Precision: “of the items you coded, what % are correct?” • Recall is “of all the possible items that match the code, what % did you retrieve?” • F1 is the harmonic mean of precision and recall 2*((precision*recall)/precision+recall) © 2014 Lexalytics Inc. All rights reserved. 6

Different apps require different balance • High precision -> Social media trending Want to know that what you’re graphing has absolutely no crap • High recall -> Customer support requests Really don’t want to miss even a single pissed off customer, even at the cost of having to filter through lots of not-upset customers HIGH PRECISION © 2014 Lexalytics Inc. All rights reserved. HIGH RECALL 7

Sentiment F1 scores (and “accuracy”) bounded by IRA • MPQA Corpus Wiebe, et al., 2005 “Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165-210 Grad students, 40 hours of training, 16k sentences, ~80% IRA • To ponder: If people max out at 80%, how can a machine be scored any better? • Answer: it can’t. A machine will do a “poor” job of scoring content that people can’t agree on. © 2014 Lexalytics Inc. All rights reserved. 8

So, you want to maximize your own accuracy? • Get a clear goal on what you’re optimizing for Precision/recall What is “sentiment” – does an opinion have to be expressed, or? Bounds of neutral • Score a set of content yourself • Crowdsource © 2014 Lexalytics Inc. All rights reserved. 9

Tuning © 2014 Lexalytics Inc. All rights reserved. 10

Mturk Jargon • Worker The individual scoring the doc. • Requester You • HIT Human Intelligence Task (Work unit) • Quals Which workers can work on your task? © 2014 Lexalytics Inc. All rights reserved. 11

Crowsourcing Flowchart © 2014 Lexalytics Inc. All rights reserved. 12

Worker Qualifications • Control which workers get to work on your HITs • Amazon has built-in qualifications (Categorization Masters) ~20% more expensive Opaque process Workers don’t get anything more for them • Manage your list tightly: Build a small qual test, open up to a limited set of users Manually add workers to qualification list • Use a “# of accepted HITs > 5000” (or some other number) • Check against gold set • Drop workers, don’t reject HITs ©2014 Lexalytics Inc. All rights reserved. 13

Communication with Mturkers • Boards • Turkopticon © 2014 Lexalytics Inc. All rights reserved. 14

Mturk compensation • It is unfortunate that “crowdsourcing” is sometimes a sophisticated term for trying to get the cheapest work possible. • Sophisticated Mturkers (the ones you want doing your work) look for a lower bound of $6/hr. • Don’t rely on “sweatshop” labor. • You cannot rely on the little mturk compensation app – only way to fairly judge compensation is to do some yourself. • Lexalytics aims for $8-10/hr for our sentiment scoring work. Some projects go more, some less. • If you are using a 3rd party service that you know is doing your crowdsourcing, please go to “” and look to see what the rates are that *they* are charging the workers for your HITs © 2014 Lexalytics Inc. All rights reserved. 15

Summary • There’s opportunity in caring about accuracy, since not everyone does. • Sentiment is personal and precision/recall are bounded by InterRater Agreement • Understand your content, what’s positive/negative for you? • Understand how you need to balance precision and recall • Score some of your own content • Write some instructions • Gather a set of workers • Set them loose (and pay them right!) © 2014 Lexalytics Inc. All rights reserved. 16

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...