DiscoverText: Tools for Text

50 %
50 %
Information about DiscoverText: Tools for Text

Published on January 14, 2014

Author: StuShulman



A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.

Tools for Text Dr. Stuart Shulman 
 Prepared for the Digital Methods Initiative Winter School 2014
 University of Amsterdam !1

Acknowledgements Richard Rogers
 The National Science Foundation
 Mark J. Hoy !2

Plan of Attack A few high level thoughts Five pillars of text analytics Getting started on DiscoverText A small collaborative project The beta release !3

“A funny thing happened…” A brief history of DiscoverText ! !4

A Master Metaphor: Sifter !5

An Open Source Kernel !6

Three Primary Tasks in CAT !7

Classification of Text A 2500 year-old problem Plato argued it would be frustrating It still is… !8

Grimmer & Stewart “Text as Data”
 Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use 
 of any method without a validation step.” !9

(Patent Pending) !10

Three Important Books !11

One Particularly Important Idea !12

Five Pillars of Text Analytics Search
 Classify You can execute all five using DT !13

Pillar #1: Search !14

Search for Negative Cases !15

Defined Search (Multi-term) !16

Pillar #2: Filters Remember this filter !17

Another Common Filter !18


Pillar#3: Human Coding !20

Keystroke Coding is Fast !21

Coding Off a List is Faster !22

Data Cleaning is Fundamental !23

Pillar #4: Clustering !24


Latent Dirichlet Allocation 
 (LDA) Topic Models !26

LDA on the Christie Data Data is still processing… !27

Pillar#5: Machine-Learning !28

Getting Started on DiscoverText !29

Use the Key in Your Email !30

Note the Peer Visibility Setting !31

Peers Make Collaboration Possible !32




Perhaps a Trending Topic !36


The Basics Raw Data Subsets of Data Data Humans or Machines Classify !38


Grab Some Twitter Data !40

Create an Empty Archive !41

Login to a Twitter Account !42

Enable via OAuth !43

Ready to Query Twitter !44

Use Operators to Refine Queries !45

Set the Frequency of Fetches !46

Data Will Start Flowing !47

Data List View !48

Best List Settings for Twitter Data !49

Use Buckets to Refine Lists Search results go into buckets “Defined search” is a multi-term filter Meta data filters also useful for buckets Buckets focus the text analytic process !50


Create a Dataset to Code Any archive or bucket Use the random sampling tool Standard: All coders get all items Triage: Coders get next uncoded item !52


Select from Three Coding Styles Default: Mutually Exclusive Codes Option 1: Non-Mutually Exclusive Codes Option 2: User-Defined Codes (Grounded Theory) !54


Assign Peers to Code a Dataset How many coders? How many items need to be coded? How many test or training sets? There are no cookbook answers !56

Look at Inter-Rater Reliability Highly reliable coding (easy tasks) Unreliable coding (interesting tasks) If humans can’t, neither can machines Some tasks better suited for machines !57

Adjudication: The Secret Sauce Expert review or consensus process Invalidate false positives Identify strong and weak coders Exclude false positives from training sets !58



Use Classification Scores as Filters Iteration plays a critical role Train, classify, filter Repeat until the model is trusted Each round weeds out false positives !61

Classifier Histograms: More Filtering !62

Track Your Progress !63



Running the Classifier !67


Filter by Classification !69

Filtered List >95% Not Chris Christie !70

Thanks for Having Me! Dr. Stuart Shulman
 @stuartwshulman !74

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Text, Survey, & Social Media Analytics | DiscoverText

With dozens of text mining features, DiscoverText is software to make better decisions faster with a mix of human and machine-learning capabilities.
Read more

DiscoverText (@discovertext) | Twitter

The latest Tweets from DiscoverText (@discovertext). Text and social data analytics software tools for insights into citizens, customers, products, and ...
Read more

Tools | DiscoverText

We were pleasantly surprised to find we were the only text analytics firm making a play for the business. Our conversations with firms (large and small ...
Read more

the login page - DiscoverText

Powerful online software tools for text analytics. Collect, archive, filter, search & classify data from surveys, public comments & social media.
Read more

DiscoverText - Journalism Accelerator

“Unlocking the power of text,” DiscoverText offers new analytic tools to mine text including public comments, FOIA processing, legal eDiscovery,...
Read more

DiscoverText | Facebook

DiscoverText is cloud-based text analytics software. ... Learn to use simple point and click tools to collect and analyze text data from Twitter, ...
Read more

DiscoverText -

This tutorial provides software training in "DiscoverText," which is a powerful, text ... This tutorial introduces new users to the powerful tools ...
Read more

Tools for Text - YouTube

Tools for Text Stuart Shulman. ... A presentation that outlines the DiscoverText approach to text analytics. Category Howto & Style; License
Read more

The 5 Best Integrations to Help Analyze Your SurveyMonkey ...

Check out the 5 best survey analysis tools to analyze your SurveyMonkey data. These handy API options make statistical and text analysis easy!
Read more