SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Recap Supervised Machine Learning Exercise Next meetings
Big Data and Automated Content Analysis
Week 8 – Wednesday
»Supervised Machine Learning«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
18 May 2016
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Today
1 Recap: Types of Automated Content Analysis
2 Supervised Machine Learning
You have done it before!
Applications
An implementation
3 Exercise
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Recap: Types of Automated Content Analysis
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
• Bottom-up: frequency counts, co-occurrence networks, LDA
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
• Bottom-up: frequency counts, co-occurrence networks, LDA
But there is some middle-ground: Predefined categories, but no
explicit rules.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Enter supervised machine learning
Methodological approach
deductive inductive
Typical research interests
and content features
Common statistical
procedures
visibility analysis
sentiment analysis
subjectivity analysis
Counting and
Dictionary
Supervised
Machine Learning
Unsupervised
Machine Learning
frames
topics
gender bias
frames
topics
string comparisons
counting
support vector machines
naive Bayes
principal component analysis
cluster analysis
latent dirichlet allocation
semantic network analysis
Boumans, J.W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content
analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4, 1. 8–23.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We have 2,000 of these
messages grouped into such categories
by human coders. We then use this
data to group all remaining messages
as well.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Supervised Machine Learning
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
4 You could now calculate ˆy for a man of 20 years and a woman
of 40 years – even if no such person exists in your dataset:
ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2
ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
This is
Supervised Machine Learning!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
• Typically, IVs are word frequencies (often weighted, e.g.
tf×idf) (⇒BOW-representation)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
In our field
It starts to get popular to measure latent variables
• frames
• topics
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
⇒ This is where you need supervised machine learning!
Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to
code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication
Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527
Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues:
Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1),
122–131.
Big Data and Automated Content Analysis Damian Trilling
http://commons.wikimedia.org/wiki/File:Precisionrecall.svg
Some measures of accuracy
• Recall
• Precision
• F1 = 2 · precision·recall
precision+recall
• AUC (Area under curve)
[0, 1], 0.5 = random
guessing
Recap Supervised Machine Learning Exercise Next meetings
Applications
What does this mean for our research?
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
What does this mean for our research?
It we have 2,000 documents with manually coded frames and
topics. . .
• we can use them to train a SML classifier
• which can code an unlimited number of new documents
• with an acceptable accuracy
Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of
automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
An implementation
Let’s say we have a list of tuples with movie reviews and their
rating:
1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...]
And a second list with an identical structure:
1 test=[("Not that good",-1),("Nice film",1), ... ...]
Both are drawn from the same population, it is pure chance
whether a specific review is on the one list or the other.
Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Training a A Naïve Bayes Classifier
1 from sklearn.naive_bayes import MultinomialNB
2 from sklearn.feature_extraction.text import CountVectorizer
3 from sklearn import metrics
4
5 # This is just an efficient way of computing word counts
6 vectorizer = CountVectorizer(stop_words=’english’)
7 train_features = vectorizer.fit_transform([r[0] for r in reviews])
8 test_features = vectorizer.transform([r[0] for r in test])
9
10 # Fit a naive bayes model to the training data.
11 nb = MultinomialNB()
12 nb.fit(train_features, [r[1] for r in reviews])
13
14 # Now we can use the model to predict classifications for our test
features.
15 predictions = nb.predict(test_features)
16 actual=[r[1] for r in test]
17
18 # Compute the error.
19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label
=1)
20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
And it works!
Using 50,000 IMDB movies that are classified as either negative or
positive,
• I created a list with 25,000 training tuples and another one
with 25,000 test tuples and
• trained a classifier
• that achieved an AUC of .82.
Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T.,
Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of
the Association for Computational Linguistics (ACL 2011)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Playing around with new data
1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is
awsome. I liked this movie a lot, fantastic actors","I would not
recomment it to anyone.", "Enjoyed it a lot"])
2 predictions = nb.predict(newdata)
3 print(predictions)
This returns, as you would expect and hope:
1 [-1 1 -1 1]
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
But we can do even better
We can use different vectorizers and different classifiers.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Different vectorizers
• CountVectorizer (=simple word counts)
• TfidfVectorizer (word counts (“term frequency”) weighted by
number of documents in which the word occurs at all
(“inverse document frequency”))
• additional options: stopwords, thresholds for minimum
frequencies etc.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Different classifiers
• Naïve Bayes
• Logistic Regression
• Support Vector Machine (SVM)
• . . .
Big Data and Automated Content Analysis Damian Trilling
Let’s look at the source code together and find out which setup
performs best.
Recap Supervised Machine Learning Exercise Next meetings
Next (last. . . ) meetings
Monday
pandas & matplotlib — doing statistics in Python
Wednesday
OpenLab
Questions regarding final project
Big Data and Automated Content Analysis Damian Trilling

Weitere ähnliche Inhalte

Was ist angesagt?

Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Daniel Katz
 

Was ist angesagt? (20)

BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
 

Ähnlich wie BDACA1516s2 - Lecture8

Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeOsama Ghandour Geris
 
Machine learning – 101
Machine learning – 101Machine learning – 101
Machine learning – 101Behzad Altaf
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Machine Learning Basics - By Animesh Sinha
Machine Learning Basics - By Animesh Sinha Machine Learning Basics - By Animesh Sinha
Machine Learning Basics - By Animesh Sinha Animesh Sinha
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
San Francisco Hacker News - Machine Learning for Hackers
San Francisco Hacker News - Machine Learning for HackersSan Francisco Hacker News - Machine Learning for Hackers
San Francisco Hacker News - Machine Learning for HackersAdam Gibson
 
PyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessPyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessChia-Chi Chang
 
Introduction to Applied Machine Learning in R
Introduction to Applied Machine Learning in RIntroduction to Applied Machine Learning in R
Introduction to Applied Machine Learning in RBrian Spiering
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning BasicsSuresh Arora
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tpseudor00t overflow
 
Machine Learning lecture1(introduction)
Machine Learning lecture1(introduction)Machine Learning lecture1(introduction)
Machine Learning lecture1(introduction)cairo university
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Dhiana Deva
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised LearningSara Hooker
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopCCG
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGUmair Shafique
 

Ähnlich wie BDACA1516s2 - Lecture8 (20)

BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-code
 
Machine learning – 101
Machine learning – 101Machine learning – 101
Machine learning – 101
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Machine Learning Basics - By Animesh Sinha
Machine Learning Basics - By Animesh Sinha Machine Learning Basics - By Animesh Sinha
Machine Learning Basics - By Animesh Sinha
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
BD-ACA week1b
BD-ACA week1bBD-ACA week1b
BD-ACA week1b
 
San Francisco Hacker News - Machine Learning for Hackers
San Francisco Hacker News - Machine Learning for HackersSan Francisco Hacker News - Machine Learning for Hackers
San Francisco Hacker News - Machine Learning for Hackers
 
PyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessPyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darkness
 
Introduction to Applied Machine Learning in R
Introduction to Applied Machine Learning in RIntroduction to Applied Machine Learning in R
Introduction to Applied Machine Learning in R
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
 
Machine Learning lecture1(introduction)
Machine Learning lecture1(introduction)Machine Learning lecture1(introduction)
Machine Learning lecture1(introduction)
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 

Mehr von Department of Communication Science, University of Amsterdam (9)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Kürzlich hochgeladen

How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 

Kürzlich hochgeladen (20)

How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 

BDACA1516s2 - Lecture8

  • 1. Recap Supervised Machine Learning Exercise Next meetings Big Data and Automated Content Analysis Week 8 – Wednesday »Supervised Machine Learning« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 18 May 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Recap Supervised Machine Learning Exercise Next meetings Today 1 Recap: Types of Automated Content Analysis 2 Supervised Machine Learning You have done it before! Applications An implementation 3 Exercise 4 Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 3. Recap Supervised Machine Learning Exercise Next meetings Recap: Types of Automated Content Analysis Big Data and Automated Content Analysis Damian Trilling
  • 4. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know Big Data and Automated Content Analysis Damian Trilling
  • 5. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 6. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions • Bottom-up: frequency counts, co-occurrence networks, LDA Big Data and Automated Content Analysis Damian Trilling
  • 7. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions • Bottom-up: frequency counts, co-occurrence networks, LDA But there is some middle-ground: Predefined categories, but no explicit rules. Big Data and Automated Content Analysis Damian Trilling
  • 8. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Enter supervised machine learning Methodological approach deductive inductive Typical research interests and content features Common statistical procedures visibility analysis sentiment analysis subjectivity analysis Counting and Dictionary Supervised Machine Learning Unsupervised Machine Learning frames topics gender bias frames topics string comparisons counting support vector machines naive Bayes principal component analysis cluster analysis latent dirichlet allocation semantic network analysis Boumans, J.W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4, 1. 8–23. Big Data and Automated Content Analysis Damian Trilling
  • 9. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Big Data and Automated Content Analysis Damian Trilling
  • 10. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 11. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Big Data and Automated Content Analysis Damian Trilling
  • 12. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We have 2,000 of these messages grouped into such categories by human coders. We then use this data to group all remaining messages as well. Big Data and Automated Content Analysis Damian Trilling
  • 13. Recap Supervised Machine Learning Exercise Next meetings Supervised Machine Learning Big Data and Automated Content Analysis Damian Trilling
  • 14. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Big Data and Automated Content Analysis Damian Trilling
  • 15. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression Big Data and Automated Content Analysis Damian Trilling
  • 16. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi Big Data and Automated Content Analysis Damian Trilling
  • 17. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! Big Data and Automated Content Analysis Damian Trilling
  • 18. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age Big Data and Automated Content Analysis Damian Trilling
  • 19. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age 4 You could now calculate ˆy for a man of 20 years and a woman of 40 years – even if no such person exists in your dataset: ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2 ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4 Big Data and Automated Content Analysis Damian Trilling
  • 20. Recap Supervised Machine Learning Exercise Next meetings You have done it before! This is Supervised Machine Learning! Big Data and Automated Content Analysis Damian Trilling
  • 21. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) Big Data and Automated Content Analysis Damian Trilling
  • 22. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases Big Data and Automated Content Analysis Damian Trilling
  • 23. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) Big Data and Automated Content Analysis Damian Trilling
  • 24. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) • Typically, IVs are word frequencies (often weighted, e.g. tf×idf) (⇒BOW-representation) Big Data and Automated Content Analysis Damian Trilling
  • 25. Recap Supervised Machine Learning Exercise Next meetings Applications Applications Big Data and Automated Content Analysis Damian Trilling
  • 26. Recap Supervised Machine Learning Exercise Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems Big Data and Automated Content Analysis Damian Trilling
  • 27. Recap Supervised Machine Learning Exercise Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems In our field It starts to get popular to measure latent variables • frames • topics Big Data and Automated Content Analysis Damian Trilling
  • 28. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list Big Data and Automated Content Analysis Damian Trilling
  • 29. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) Big Data and Automated Content Analysis Damian Trilling
  • 30. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) ⇒ This is where you need supervised machine learning! Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527 Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues: Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1), 122–131. Big Data and Automated Content Analysis Damian Trilling
  • 31.
  • 32.
  • 33.
  • 34. http://commons.wikimedia.org/wiki/File:Precisionrecall.svg Some measures of accuracy • Recall • Precision • F1 = 2 · precision·recall precision+recall • AUC (Area under curve) [0, 1], 0.5 = random guessing
  • 35. Recap Supervised Machine Learning Exercise Next meetings Applications What does this mean for our research? Big Data and Automated Content Analysis Damian Trilling
  • 36. Recap Supervised Machine Learning Exercise Next meetings Applications What does this mean for our research? It we have 2,000 documents with manually coded frames and topics. . . • we can use them to train a SML classifier • which can code an unlimited number of new documents • with an acceptable accuracy Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. Big Data and Automated Content Analysis Damian Trilling
  • 37. Recap Supervised Machine Learning Exercise Next meetings An implementation An implementation Let’s say we have a list of tuples with movie reviews and their rating: 1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...] And a second list with an identical structure: 1 test=[("Not that good",-1),("Nice film",1), ... ...] Both are drawn from the same population, it is pure chance whether a specific review is on the one list or the other. Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/ Big Data and Automated Content Analysis Damian Trilling
  • 38. Recap Supervised Machine Learning Exercise Next meetings An implementation Training a A Naïve Bayes Classifier 1 from sklearn.naive_bayes import MultinomialNB 2 from sklearn.feature_extraction.text import CountVectorizer 3 from sklearn import metrics 4 5 # This is just an efficient way of computing word counts 6 vectorizer = CountVectorizer(stop_words=’english’) 7 train_features = vectorizer.fit_transform([r[0] for r in reviews]) 8 test_features = vectorizer.transform([r[0] for r in test]) 9 10 # Fit a naive bayes model to the training data. 11 nb = MultinomialNB() 12 nb.fit(train_features, [r[1] for r in reviews]) 13 14 # Now we can use the model to predict classifications for our test features. 15 predictions = nb.predict(test_features) 16 actual=[r[1] for r in test] 17 18 # Compute the error. 19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label =1) 20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr))) Big Data and Automated Content Analysis Damian Trilling
  • 39. Recap Supervised Machine Learning Exercise Next meetings An implementation And it works! Using 50,000 IMDB movies that are classified as either negative or positive, • I created a list with 25,000 training tuples and another one with 25,000 test tuples and • trained a classifier • that achieved an AUC of .82. Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011) Big Data and Automated Content Analysis Damian Trilling
  • 40. Recap Supervised Machine Learning Exercise Next meetings An implementation Playing around with new data 1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is awsome. I liked this movie a lot, fantastic actors","I would not recomment it to anyone.", "Enjoyed it a lot"]) 2 predictions = nb.predict(newdata) 3 print(predictions) This returns, as you would expect and hope: 1 [-1 1 -1 1] Big Data and Automated Content Analysis Damian Trilling
  • 41. Recap Supervised Machine Learning Exercise Next meetings An implementation But we can do even better We can use different vectorizers and different classifiers. Big Data and Automated Content Analysis Damian Trilling
  • 42. Recap Supervised Machine Learning Exercise Next meetings An implementation Different vectorizers • CountVectorizer (=simple word counts) • TfidfVectorizer (word counts (“term frequency”) weighted by number of documents in which the word occurs at all (“inverse document frequency”)) • additional options: stopwords, thresholds for minimum frequencies etc. Big Data and Automated Content Analysis Damian Trilling
  • 43. Recap Supervised Machine Learning Exercise Next meetings An implementation Different classifiers • Naïve Bayes • Logistic Regression • Support Vector Machine (SVM) • . . . Big Data and Automated Content Analysis Damian Trilling
  • 44. Let’s look at the source code together and find out which setup performs best.
  • 45. Recap Supervised Machine Learning Exercise Next meetings Next (last. . . ) meetings Monday pandas & matplotlib — doing statistics in Python Wednesday OpenLab Questions regarding final project Big Data and Automated Content Analysis Damian Trilling