SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Downloaden Sie, um offline zu lesen
Tools for Text
Dr. Stuart Shulman 

@stuartwshulman
stu@texifter.com


Prepared for the Digital Methods Initiative Winter School 2014

University of Amsterdam
!1
Acknowledgements
Richard Rogers

The National Science Foundation

Mark J. Hoy

!2
Plan of Attack
A few high level thoughts
Five pillars of text analytics
Getting started on DiscoverText
A small collaborative project
The twittersifter.com beta release
!3
“A funny thing happened…”
A brief history of DiscoverText
!

!4
A Master Metaphor: Sifter

!5
An Open Source Kernel

!6
Three Primary Tasks in CAT

!7
Classification of Text
A 2500 year-old problem
Plato argued it would be frustrating
It still is…
!8
Grimmer & Stewart “Text as Data”

Political Analysis (2013)
Volume is a problem for scholars
Coders are expensive
Groups struggle to accurately label text at scale
Validation of both humans and machines is “essential”
Some models are easier to validate than others
All models are wrong
Automated models enhance/amplify, but don’t replace humans
There is no one right way to do this
“Validate, validate, validate”

“What should be avoided then, is the blind use 

of any method without a validation step.”
!9
(Patent Pending)

!10
Three Important Books

!11
One Particularly Important Idea

!12
Five Pillars of Text Analytics
Search

Filter

Code

Cluster

Classify
You can execute all five using DT
!13
Pillar #1: Search

!14
Search for Negative Cases

!15
Defined Search (Multi-term)

!16
Pillar #2: Filters
Remember
this filter

!17
Another Common Filter

!18
!19
Pillar#3: Human Coding

!20
Keystroke Coding is Fast

!21
Coding Off a List is Faster

!22
Data Cleaning is Fundamental

!23
Pillar #4: Clustering

!24
!25
Latent Dirichlet Allocation 

(LDA) Topic Models

!26
LDA on the Christie Data
Data is still processing…

!27
Pillar#5: Machine-Learning

!28
Getting Started on DiscoverText

!29
Use the Key in Your Email

!30
Note the Peer Visibility Setting

!31
Peers Make Collaboration Possible

!32
!33
!34
!35
Perhaps a Trending Topic

!36
!37
The Basics

Raw Data
Subsets of Data
Data Humans or
Machines Classify
!38
!39
Grab Some Twitter Data

!40
Create an Empty Archive

!41
Login to a Twitter Account

!42
Enable via OAuth

!43
Ready to Query Twitter

!44
Use Operators to Refine Queries

!45
Set the Frequency of Fetches

!46
Data Will Start Flowing

!47
Data List View

!48
Best List Settings for Twitter Data

!49
Use Buckets to Refine Lists
Search results go into buckets
“Defined search” is a multi-term filter
Meta data filters also useful for buckets
Buckets focus the text analytic process
!50
!51
Create a Dataset to Code
Any archive or bucket
Use the random sampling tool
Standard: All coders get all items
Triage: Coders get next uncoded item
!52
!53
Select from Three Coding Styles
Default: Mutually Exclusive Codes
Option 1: Non-Mutually Exclusive Codes
Option 2: User-Defined Codes
(Grounded Theory)
!54
!55
Assign Peers to Code a Dataset
How many coders?
How many items need to be coded?
How many test or training sets?
There are no cookbook answers
!56
Look at Inter-Rater Reliability
Highly reliable coding (easy tasks)
Unreliable coding (interesting tasks)
If humans can’t, neither can machines
Some tasks better suited for machines
!57
Adjudication: The Secret Sauce
Expert review or consensus process
Invalidate false positives
Identify strong and weak coders
Exclude false positives from training sets
!58
!59
!60
Use Classification Scores as Filters
Iteration plays a critical role
Train, classify, filter
Repeat until the model is trusted
Each round weeds out false positives
!61
Classifier Histograms: More Filtering

!62
Track Your Progress

!63
!64
!66
Running the Classifier

!67
!68
Filter by Classification

!69
Filtered List >95% Not Chris Christie

!70
http://beta.twittersifter.com
Thanks for Having Me!
Dr. Stuart Shulman

@stuartwshulman

stu@texifter.com

discovertext.com

twittersifter.com
!74

Weitere ähnliche Inhalte

Was ist angesagt?

ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 

Was ist angesagt? (13)

Backbone taxonomies, data aggregation, and early career systematists: somethi...
Backbone taxonomies, data aggregation, and early career systematists: somethi...Backbone taxonomies, data aggregation, and early career systematists: somethi...
Backbone taxonomies, data aggregation, and early career systematists: somethi...
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Qualitative Analysis in Atlas.ti
Qualitative Analysis in Atlas.tiQualitative Analysis in Atlas.ti
Qualitative Analysis in Atlas.ti
 
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataOpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
 
atlas ti introduction
atlas ti introductionatlas ti introduction
atlas ti introduction
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
ESA Ignite talk on quality control for data
ESA Ignite talk on quality control for dataESA Ignite talk on quality control for data
ESA Ignite talk on quality control for data
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
CINF127 Using Trends and Relations to Recommend Scientific Content
CINF127 Using Trends and Relations to Recommend Scientific ContentCINF127 Using Trends and Relations to Recommend Scientific Content
CINF127 Using Trends and Relations to Recommend Scientific Content
 
RIIb 2016a-force11
RIIb 2016a-force11RIIb 2016a-force11
RIIb 2016a-force11
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Lawrence-f1000-publishing with data-nfdp13
Lawrence-f1000-publishing with data-nfdp13Lawrence-f1000-publishing with data-nfdp13
Lawrence-f1000-publishing with data-nfdp13
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 

Andere mochten auch (12)

17 essential of qualities of a team player // Frisca Listyaningtyas
17 essential of qualities of a team player // Frisca Listyaningtyas17 essential of qualities of a team player // Frisca Listyaningtyas
17 essential of qualities of a team player // Frisca Listyaningtyas
 
Final draft corrected pdf
Final draft corrected pdfFinal draft corrected pdf
Final draft corrected pdf
 
Textual analysis
Textual analysisTextual analysis
Textual analysis
 
การใช้งานของ Google
การใช้งานของ  Googleการใช้งานของ  Google
การใช้งานของ Google
 
Latesht employee details
Latesht employee detailsLatesht employee details
Latesht employee details
 
Subjunctivo
SubjunctivoSubjunctivo
Subjunctivo
 
Earth day groceries projectbun
Earth day groceries projectbunEarth day groceries projectbun
Earth day groceries projectbun
 
電子報
電子報電子報
電子報
 
Garden of Eatin': 2015 L&C Garden Show
Garden of Eatin': 2015 L&C Garden ShowGarden of Eatin': 2015 L&C Garden Show
Garden of Eatin': 2015 L&C Garden Show
 
Allemagne
AllemagneAllemagne
Allemagne
 
Leadership powerpoint
Leadership powerpointLeadership powerpoint
Leadership powerpoint
 
Ngomong iklan, yuk! (@friscalistya)
Ngomong iklan, yuk! (@friscalistya)Ngomong iklan, yuk! (@friscalistya)
Ngomong iklan, yuk! (@friscalistya)
 

Ähnlich wie DiscoverText: Tools for Text

Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf
eanyang7
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 

Ähnlich wie DiscoverText: Tools for Text (20)

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Summit slide loop ny
Summit slide loop nySummit slide loop ny
Summit slide loop ny
 
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
 
CoderRank: Creating Gold Standards
CoderRank: Creating Gold StandardsCoderRank: Creating Gold Standards
CoderRank: Creating Gold Standards
 
Bias in AI
Bias in AIBias in AI
Bias in AI
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Complex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and DatabasesComplex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and Databases
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
How to Improve Your Technical Test Ability - AADays 2015 Keynote
How to Improve Your Technical Test Ability - AADays 2015 KeynoteHow to Improve Your Technical Test Ability - AADays 2015 Keynote
How to Improve Your Technical Test Ability - AADays 2015 Keynote
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Drone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan TeknologiDrone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan Teknologi
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 

Mehr von Stuart Shulman

Text Analytics for Social Data Using DiscoverText & Sifter
 Text Analytics for Social Data Using DiscoverText & Sifter Text Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & Sifter
Stuart Shulman
 

Mehr von Stuart Shulman (15)

Fear and loathing on the social campaign trail
Fear and loathing on the social campaign trailFear and loathing on the social campaign trail
Fear and loathing on the social campaign trail
 
Fear and Loathing on the Social Campaign Trail
Fear and Loathing on the Social Campaign TrailFear and Loathing on the Social Campaign Trail
Fear and Loathing on the Social Campaign Trail
 
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
 
Text Analytics for Social Data Using DiscoverText & Sifter
 Text Analytics for Social Data Using DiscoverText & Sifter Text Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & Sifter
 
Text Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & SifterText Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & Sifter
 
Twitter for Research
Twitter for ResearchTwitter for Research
Twitter for Research
 
Sifting Social Data: Word Sense Disambiguation Using Machine Learning
Sifting Social Data: Word Sense Disambiguation Using Machine LearningSifting Social Data: Word Sense Disambiguation Using Machine Learning
Sifting Social Data: Word Sense Disambiguation Using Machine Learning
 
CAQDAS 2014 Pecha Kucha - Stuart Shulman
CAQDAS 2014 Pecha Kucha - Stuart ShulmanCAQDAS 2014 Pecha Kucha - Stuart Shulman
CAQDAS 2014 Pecha Kucha - Stuart Shulman
 
Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
 
Technology for Citizen Voices
Technology for Citizen VoicesTechnology for Citizen Voices
Technology for Citizen Voices
 
Citizen Voices in a Networked Age of #BigData
Citizen Voices in a Networked Age of #BigDataCitizen Voices in a Networked Age of #BigData
Citizen Voices in a Networked Age of #BigData
 
DiscoverText Product Overview
DiscoverText Product OverviewDiscoverText Product Overview
DiscoverText Product Overview
 
Importing bulk outlook email into DiscoverText - the .pst file upload
Importing bulk outlook email into DiscoverText - the .pst file uploadImporting bulk outlook email into DiscoverText - the .pst file upload
Importing bulk outlook email into DiscoverText - the .pst file upload
 
Texifter
TexifterTexifter
Texifter
 
Future of text analysis forrester briefing
Future of text analysis   forrester briefingFuture of text analysis   forrester briefing
Future of text analysis forrester briefing
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

DiscoverText: Tools for Text