Published on January 14, 2014
Tools for Text Dr. Stuart Shulman
Prepared for the Digital Methods Initiative Winter School 2014
University of Amsterdam !1
Acknowledgements Richard Rogers The National Science Foundation Mark J. Hoy !2
Plan of Attack A few high level thoughts Five pillars of text analytics Getting started on DiscoverText A small collaborative project The twittersifter.com beta release !3
“A funny thing happened…” A brief history of DiscoverText ! !4
A Master Metaphor: Sifter !5
An Open Source Kernel !6
Three Primary Tasks in CAT !7
Classification of Text A 2500 year-old problem Plato argued it would be frustrating It still is… !8
Grimmer & Stewart “Text as Data” Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use of any method without a validation step.” !9
(Patent Pending) !10
Three Important Books !11
One Particularly Important Idea !12
Five Pillars of Text Analytics Search Filter Code Cluster Classify You can execute all five using DT !13
Pillar #1: Search !14
Search for Negative Cases !15
Defined Search (Multi-term) !16
Pillar #2: Filters Remember this ﬁlter !17
Another Common Filter !18
Pillar#3: Human Coding !20
Keystroke Coding is Fast !21
Coding Off a List is Faster !22
Data Cleaning is Fundamental !23
Pillar #4: Clustering !24
Latent Dirichlet Allocation (LDA) Topic Models !26
LDA on the Christie Data Data is still processing… !27
Pillar#5: Machine-Learning !28
Getting Started on DiscoverText !29
Use the Key in Your Email !30
Note the Peer Visibility Setting !31
Peers Make Collaboration Possible !32
Perhaps a Trending Topic !36
The Basics Raw Data Subsets of Data Data Humans or Machines Classify !38
Grab Some Twitter Data !40
Create an Empty Archive !41
Login to a Twitter Account !42
Enable via OAuth !43
Ready to Query Twitter !44
Use Operators to Refine Queries !45
Set the Frequency of Fetches !46
Data Will Start Flowing !47
Data List View !48
Best List Settings for Twitter Data !49
Use Buckets to Refine Lists Search results go into buckets “Defined search” is a multi-term filter Meta data filters also useful for buckets Buckets focus the text analytic process !50
Create a Dataset to Code Any archive or bucket Use the random sampling tool Standard: All coders get all items Triage: Coders get next uncoded item !52
Select from Three Coding Styles Default: Mutually Exclusive Codes Option 1: Non-Mutually Exclusive Codes Option 2: User-Defined Codes (Grounded Theory) !54
Assign Peers to Code a Dataset How many coders? How many items need to be coded? How many test or training sets? There are no cookbook answers !56
Look at Inter-Rater Reliability Highly reliable coding (easy tasks) Unreliable coding (interesting tasks) If humans can’t, neither can machines Some tasks better suited for machines !57
Adjudication: The Secret Sauce Expert review or consensus process Invalidate false positives Identify strong and weak coders Exclude false positives from training sets !58
Use Classification Scores as Filters Iteration plays a critical role Train, classify, filter Repeat until the model is trusted Each round weeds out false positives !61
Classifier Histograms: More Filtering !62
Track Your Progress !63
Running the Classifier !67
Filter by Classification !69
Filtered List >95% Not Chris Christie !70
Thanks for Having Me! Dr. Stuart Shulman @stuartwshulman email@example.com discovertext.com twittersifter.com !74
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
With dozens of text mining features, DiscoverText is software to make better decisions faster with a mix of human and machine-learning capabilities.
The latest Tweets from DiscoverText (@discovertext). Text and social data analytics software tools for insights into citizens, customers, products, and ...
We were pleasantly surprised to find we were the only text analytics firm making a play for the business. Our conversations with firms (large and small ...
Powerful online software tools for text analytics. Collect, archive, filter, search & classify data from surveys, public comments & social media.
“Unlocking the power of text,” DiscoverText offers new analytic tools to mine text including public comments, FOIA processing, legal eDiscovery,...
DiscoverText is cloud-based text analytics software. ... Learn to use simple point and click tools to collect and analyze text data from Twitter, ...
This tutorial provides software training in "DiscoverText," which is a powerful, text ... This tutorial introduces new users to the powerful tools ...
Tools for Text Stuart Shulman. ... A presentation that outlines the DiscoverText approach to text analytics. Category Howto & Style; License
Check out the 5 best survey analysis tools to analyze your SurveyMonkey data. These handy API options make statistical and text analysis easy!