SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Big Data and Automated Content Analysis
Week 8 – Monday
»How further?«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
18 May 2015
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Today
1 Trends in Automated Content Analysis
2 Supervised Machine Learning
Applications
An implementation
3 Unsupervised machine learning
Overview
Topic Modelig
4 Concluding remarks
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Trends in Automated Content Analysis
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
From concrete to abstract, from manifest to latent
Our approaches until know
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
From concrete to abstract, from manifest to latent
Our approaches until know
• Focus on specific words (or regular expressions)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
From concrete to abstract, from manifest to latent
Our approaches until know
• Focus on specific words (or regular expressions)
• But we made a step towards more abstract representations
(cooccurrence networks)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
From concrete to abstract, from manifest to latent
Our approaches until know
• Focus on specific words (or regular expressions)
• But we made a step towards more abstract representations
(cooccurrence networks)
• Today: One step further
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Machine learning:
Making sense of millions of cases — finding structures & grouping
cases
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Making sense of millions of cases
The problem
• We don’t care about each single tweet
• We want groups of tweets (texts, cases, . . . ) instead
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Making sense of millions of cases
The problem
• We don’t care about each single tweet
• We want groups of tweets (texts, cases, . . . ) instead
⇒ We need a technique that tells us to which group each text
belongs
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Machine learning
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Machine learning
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We use a hand-coded CSV
table with two columns (tweet and
gender of the sender) as training
dataset and then predict for a different
dataset per tweet if it was sent by a
man or a woman.
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Machine learning
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We use a hand-coded CSV
table with two columns (tweet and
gender of the sender) as training
dataset and then predict for a different
dataset per tweet if it was sent by a
man or a woman.
Unsupervised
• No manually coded data
• We want groups of most
similar cases
• Per case, we want to know
to which group it belongs
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Machine learning
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We use a hand-coded CSV
table with two columns (tweet and
gender of the sender) as training
dataset and then predict for a different
dataset per tweet if it was sent by a
man or a woman.
Unsupervised
• No manually coded data
• We want groups of most
similar cases
• Per case, we want to know
to which group it belongs
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Supervised Machine Learning
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
Applications
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
syetems
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
syetems
In our field
It starts to get popular to measure latent variables
• frames
• topics
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
⇒ This is where you need supervised machine learning!
Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to
code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication
Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527
Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues:
Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1),
122–131.
Big Data and Automated Content Analysis Damian Trilling
http://commons.wikimedia.org/wiki/File:Precisionrecall.svg
Some measures of accuracy
• Recall
• Precision
• F1 = 2 · precision·recall
precision+recall
• AUC (Area under curve)
[0, 1], 0.5 = random
guessing
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
What does this mean for our research?
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Applications
What does this mean for our research?
It we have 2,000 documents with manually coded frames and
topics. . .
• we can use them to train a SML classifier
• which can code an unlimited number of new documents
• with an acceptable accuracy
Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of
automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
An implementation
An implementation
Let’s say we have a list of tuples with movie reviews and their
rating:
1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...]
And a second list with an identical structure:
1 test=[("Not that good",-1),("Nice film",1), ... ...]
Both are drawn from the same population (with replacement), it is
pure chance whether a specific review is on the one list or the
other.
Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
An implementation
Training a A Naïve Bayes Classifier
1 from sklearn.naive_bayes import MultinomialNB
2 from sklearn.feature_extraction.text import CountVectorizer
3 from sklearn import metrics
4
5 # This is just an efficient way of computing word counts
6 vectorizer = CountVectorizer(stop_words=’english’)
7 train_features = vectorizer.fit_transform([r[0] for r in reviews])
8 test_features = vectorizer.transform([r[0] for r in test])
9
10 # Fit a naive bayes model to the training data.
11 nb = MultinomialNB()
12 nb.fit(train_features, [r[1] for r in reviews])
13
14 # Now we can use the model to predict classifications for our test
features.
15 predictions = nb.predict(test_features)
16 actual=[r[1] for r in test]
17
18 # Compute the error.
19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label
=1)
20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
An implementation
And it works!
Using 50,000 IMDB movies that are classified as either negative or
positive,
• I created a list with 25,000 training tuples and another one
with 25,000 test tuples and
• trained a classifier
• that achieved an AUC of .82.
Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T.,
Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of
the Association for Computational Linguistics (ACL 2011)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
An implementation
Playing around with new data
1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is
awsome. I liked this movie a lot, fantastic actors","I would not
recomment it to anyone.", "Enjoyed it a lot"])
2 predictions = nb.predict(newdata)
3 print(predictions)
This returns, as you would expect and hope:
1 [-1 1 -1 1]
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
An implementation
A last remark on SML
To get the bigger picture: A regression model is also a form of
SML, as you can use the regression formula you get to predict
the outcome for new, unknown cases!
(You see, in the end, everything fits together)
Big Data and Automated Content Analysis Damian Trilling
Unsupervised machine learning
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Overview
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Overview
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset
Unsupervised machine learning
You have no labels.
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Overview
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset
Unsupervised machine learning
You have no labels.
Again, you already know some
techniques for this from other
courses:
• Principal Component
Analysis (PCA)
• Cluster analysis
• . . .
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Principal Component Analysis? How does that fit in here?
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Overview
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
But we’ll look at the state of the art instead: Latent
Dirichlet Allication (LDA)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Topic Modelig
Topic Modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Topic Modelig
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA is a Bag-of-Words (BOW) approach
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Topic Modelig
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
1 cd /home/damian/anaconda3/bin
2 sudo ./pip install gensim
Furthermore, let us assume you have a list of lists of words (!)
called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Topic Modelig
Output: Topics (below) & topic scores (next slide)
1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname
2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister
3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische
4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad
5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal
6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar
7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk
8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro
9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele
10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel
11 ...
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Concluding remarks
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Concluding remarks
Of course, this lecture could only give a very superficial overview.
It should have shown you, though, what is possible — and that
you can do it!
Big Data and Automated Content Analysis Damian Trilling
Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks
Next meeting
Wednesday
Lab session, focus on INDIVIDUAL PROJECTS! Prepare!
(No common exercise — but you can work on those I send you)
Big Data and Automated Content Analysis Damian Trilling

Weitere ähnliche Inhalte

Andere mochten auch

Fusepool Machine Learning Framework
Fusepool Machine Learning FrameworkFusepool Machine Learning Framework
Fusepool Machine Learning Framework
Fusepool SME project
 

Andere mochten auch (20)

Magnet 360 Executive Summit - Fall '10 - Curated Content Marketing Intro
Magnet 360 Executive Summit - Fall '10 - Curated Content Marketing IntroMagnet 360 Executive Summit - Fall '10 - Curated Content Marketing Intro
Magnet 360 Executive Summit - Fall '10 - Curated Content Marketing Intro
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Fusepool Machine Learning Framework
Fusepool Machine Learning FrameworkFusepool Machine Learning Framework
Fusepool Machine Learning Framework
 
The Future of Curated Content - Contextual Machine Learning w/ Thoughtly
The Future of Curated Content - Contextual Machine Learning w/ ThoughtlyThe Future of Curated Content - Contextual Machine Learning w/ Thoughtly
The Future of Curated Content - Contextual Machine Learning w/ Thoughtly
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
Mohan Chaddha - Machine Learning & Content Marketing
Mohan Chaddha - Machine Learning & Content MarketingMohan Chaddha - Machine Learning & Content Marketing
Mohan Chaddha - Machine Learning & Content Marketing
 
Optimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX MunichOptimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX Munich
 
Http2
Http2Http2
Http2
 
Applying Machine Learning and Artificial Intelligence to Business
Applying Machine Learning and Artificial Intelligence to BusinessApplying Machine Learning and Artificial Intelligence to Business
Applying Machine Learning and Artificial Intelligence to Business
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
CMOs: Transforming Marketing into a Growth Engine
CMOs: Transforming Marketing into a Growth EngineCMOs: Transforming Marketing into a Growth Engine
CMOs: Transforming Marketing into a Growth Engine
 
Predictive Content: Engineer Higher Conversions with Machine Learning
Predictive Content: Engineer Higher Conversions with Machine LearningPredictive Content: Engineer Higher Conversions with Machine Learning
Predictive Content: Engineer Higher Conversions with Machine Learning
 
algebra-booleana matematicas discretas
algebra-booleana matematicas discretasalgebra-booleana matematicas discretas
algebra-booleana matematicas discretas
 
How to Spot a Bear - An Intro to Machine Learning for SEO
How to Spot a Bear - An Intro to Machine Learning for SEOHow to Spot a Bear - An Intro to Machine Learning for SEO
How to Spot a Bear - An Intro to Machine Learning for SEO
 
From data to AI with the Machine Learning Canvas by Louis Dorard Slides
From data to AI with the Machine Learning Canvas by Louis  Dorard SlidesFrom data to AI with the Machine Learning Canvas by Louis  Dorard Slides
From data to AI with the Machine Learning Canvas by Louis Dorard Slides
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
4×4: Big Data in der Cloud
4×4: Big Data in der Cloud4×4: Big Data in der Cloud
4×4: Big Data in der Cloud
 

Ähnlich wie BD-ACA Week8a

Ähnlich wie BD-ACA Week8a (20)

BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BD-ACA week4a
BD-ACA week4aBD-ACA week4a
BD-ACA week4a
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
Machine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXLMachine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXL
 
Week 2 lecture
Week 2 lectureWeek 2 lecture
Week 2 lecture
 
Artificial Intelligence.pptx learn and practice
Artificial Intelligence.pptx learn and practiceArtificial Intelligence.pptx learn and practice
Artificial Intelligence.pptx learn and practice
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillars
 
Machine learning
Machine learningMachine learning
Machine learning
 
BD-ACA week1b
BD-ACA week1bBD-ACA week1b
BD-ACA week1b
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning in Banks
Machine learning in BanksMachine learning in Banks
Machine learning in Banks
 
Machine Learning for Auditors
Machine Learning for AuditorsMachine Learning for Auditors
Machine Learning for Auditors
 
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsEMBERS AutoGSR: Automated Coding of Civil Unrest Events
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Portfolio_JenniHawk
Portfolio_JenniHawkPortfolio_JenniHawk
Portfolio_JenniHawk
 

Mehr von Department of Communication Science, University of Amsterdam

Mehr von Department of Communication Science, University of Amsterdam (20)

BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

BD-ACA Week8a

  • 1. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Big Data and Automated Content Analysis Week 8 – Monday »How further?« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 18 May 2015 Big Data and Automated Content Analysis Damian Trilling
  • 2. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Today 1 Trends in Automated Content Analysis 2 Supervised Machine Learning Applications An implementation 3 Unsupervised machine learning Overview Topic Modelig 4 Concluding remarks Big Data and Automated Content Analysis Damian Trilling
  • 3. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Trends in Automated Content Analysis Big Data and Automated Content Analysis Damian Trilling
  • 4. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks From concrete to abstract, from manifest to latent Our approaches until know Big Data and Automated Content Analysis Damian Trilling
  • 5. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks From concrete to abstract, from manifest to latent Our approaches until know • Focus on specific words (or regular expressions) Big Data and Automated Content Analysis Damian Trilling
  • 6. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks From concrete to abstract, from manifest to latent Our approaches until know • Focus on specific words (or regular expressions) • But we made a step towards more abstract representations (cooccurrence networks) Big Data and Automated Content Analysis Damian Trilling
  • 7. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks From concrete to abstract, from manifest to latent Our approaches until know • Focus on specific words (or regular expressions) • But we made a step towards more abstract representations (cooccurrence networks) • Today: One step further Big Data and Automated Content Analysis Damian Trilling
  • 8.
  • 9. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Machine learning: Making sense of millions of cases — finding structures & grouping cases Big Data and Automated Content Analysis Damian Trilling
  • 10. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Making sense of millions of cases The problem • We don’t care about each single tweet • We want groups of tweets (texts, cases, . . . ) instead Big Data and Automated Content Analysis Damian Trilling
  • 11. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Making sense of millions of cases The problem • We don’t care about each single tweet • We want groups of tweets (texts, cases, . . . ) instead ⇒ We need a technique that tells us to which group each text belongs Big Data and Automated Content Analysis Damian Trilling
  • 12. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Machine learning Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Big Data and Automated Content Analysis Damian Trilling
  • 13. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Machine learning Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We use a hand-coded CSV table with two columns (tweet and gender of the sender) as training dataset and then predict for a different dataset per tweet if it was sent by a man or a woman. Big Data and Automated Content Analysis Damian Trilling
  • 14. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Machine learning Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We use a hand-coded CSV table with two columns (tweet and gender of the sender) as training dataset and then predict for a different dataset per tweet if it was sent by a man or a woman. Unsupervised • No manually coded data • We want groups of most similar cases • Per case, we want to know to which group it belongs Big Data and Automated Content Analysis Damian Trilling
  • 15. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Machine learning Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We use a hand-coded CSV table with two columns (tweet and gender of the sender) as training dataset and then predict for a different dataset per tweet if it was sent by a man or a woman. Unsupervised • No manually coded data • We want groups of most similar cases • Per case, we want to know to which group it belongs Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 16. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Supervised Machine Learning Big Data and Automated Content Analysis Damian Trilling
  • 17. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications Applications Big Data and Automated Content Analysis Damian Trilling
  • 18. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation syetems Big Data and Automated Content Analysis Damian Trilling
  • 19. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation syetems In our field It starts to get popular to measure latent variables • frames • topics Big Data and Automated Content Analysis Damian Trilling
  • 20. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list Big Data and Automated Content Analysis Damian Trilling
  • 21. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) Big Data and Automated Content Analysis Damian Trilling
  • 22. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) ⇒ This is where you need supervised machine learning! Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527 Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues: Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1), 122–131. Big Data and Automated Content Analysis Damian Trilling
  • 23.
  • 24.
  • 25.
  • 26. http://commons.wikimedia.org/wiki/File:Precisionrecall.svg Some measures of accuracy • Recall • Precision • F1 = 2 · precision·recall precision+recall • AUC (Area under curve) [0, 1], 0.5 = random guessing
  • 27. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications What does this mean for our research? Big Data and Automated Content Analysis Damian Trilling
  • 28. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Applications What does this mean for our research? It we have 2,000 documents with manually coded frames and topics. . . • we can use them to train a SML classifier • which can code an unlimited number of new documents • with an acceptable accuracy Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. Big Data and Automated Content Analysis Damian Trilling
  • 29. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks An implementation An implementation Let’s say we have a list of tuples with movie reviews and their rating: 1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...] And a second list with an identical structure: 1 test=[("Not that good",-1),("Nice film",1), ... ...] Both are drawn from the same population (with replacement), it is pure chance whether a specific review is on the one list or the other. Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/ Big Data and Automated Content Analysis Damian Trilling
  • 30. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks An implementation Training a A Naïve Bayes Classifier 1 from sklearn.naive_bayes import MultinomialNB 2 from sklearn.feature_extraction.text import CountVectorizer 3 from sklearn import metrics 4 5 # This is just an efficient way of computing word counts 6 vectorizer = CountVectorizer(stop_words=’english’) 7 train_features = vectorizer.fit_transform([r[0] for r in reviews]) 8 test_features = vectorizer.transform([r[0] for r in test]) 9 10 # Fit a naive bayes model to the training data. 11 nb = MultinomialNB() 12 nb.fit(train_features, [r[1] for r in reviews]) 13 14 # Now we can use the model to predict classifications for our test features. 15 predictions = nb.predict(test_features) 16 actual=[r[1] for r in test] 17 18 # Compute the error. 19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label =1) 20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr))) Big Data and Automated Content Analysis Damian Trilling
  • 31. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks An implementation And it works! Using 50,000 IMDB movies that are classified as either negative or positive, • I created a list with 25,000 training tuples and another one with 25,000 test tuples and • trained a classifier • that achieved an AUC of .82. Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011) Big Data and Automated Content Analysis Damian Trilling
  • 32. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks An implementation Playing around with new data 1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is awsome. I liked this movie a lot, fantastic actors","I would not recomment it to anyone.", "Enjoyed it a lot"]) 2 predictions = nb.predict(newdata) 3 print(predictions) This returns, as you would expect and hope: 1 [-1 1 -1 1] Big Data and Automated Content Analysis Damian Trilling
  • 33. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks An implementation A last remark on SML To get the bigger picture: A regression model is also a form of SML, as you can use the regression formula you get to predict the outcome for new, unknown cases! (You see, in the end, everything fits together) Big Data and Automated Content Analysis Damian Trilling
  • 35. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Overview Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset Big Data and Automated Content Analysis Damian Trilling
  • 36. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Overview Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset Unsupervised machine learning You have no labels. Big Data and Automated Content Analysis Damian Trilling
  • 37. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Overview Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset Unsupervised machine learning You have no labels. Again, you already know some techniques for this from other courses: • Principal Component Analysis (PCA) • Cluster analysis • . . . Big Data and Automated Content Analysis Damian Trilling
  • 38. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Principal Component Analysis? How does that fit in here? Big Data and Automated Content Analysis Damian Trilling
  • 39. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Principal Component Analysis? How does that fit in here? In fact, PCA is used everywhere, even in image compression Big Data and Automated Content Analysis Damian Trilling
  • 40. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Principal Component Analysis? How does that fit in here? PCA in ACA • Find out what word cooccur (inductive frame analysis) • Basically, transform each document in a vector of word frequencies and do a PCA Big Data and Automated Content Analysis Damian Trilling
  • 41. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Overview Principal Component Analysis? How does that fit in here? PCA in ACA • Find out what word cooccur (inductive frame analysis) • Basically, transform each document in a vector of word frequencies and do a PCA But we’ll look at the state of the art instead: Latent Dirichlet Allication (LDA) Big Data and Automated Content Analysis Damian Trilling
  • 42. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Topic Modelig Topic Modeling with Latent Dirichlet Allocation (LDA) Big Data and Automated Content Analysis Damian Trilling
  • 43. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Topic Modelig LDA, what’s that? No mathematical details here, but the general idea • There are k topics • Each document Di consists of a mixture of these topics, e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk • On the next level, each topic consists of a specific probability distribution of words • Thus, based on the frequencies of words in Di , one can infer its distribution of topics • Note that LDA is a Bag-of-Words (BOW) approach Big Data and Automated Content Analysis Damian Trilling
  • 44. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Topic Modelig Doing a LDA in Python You can use gensim (Řehůřek & Sojka, 2010) for this. 1 cd /home/damian/anaconda3/bin 2 sudo ./pip install gensim Furthermore, let us assume you have a list of lists of words (!) called texts: 1 articles=[’The tax deficit is higher than expected. This said xxx ...’, ’Germany won the World Cup. After a’] 2 texts=[art.split() for art in articles] which looks like this: 1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’, ’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’ After’, ’a’]] Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA. Big Data and Automated Content Analysis Damian Trilling
  • 45. 1 from gensim import corpora, models 2 3 NTOPICS = 100 4 LDAOUTPUTFILE="topicscores.tsv" 5 6 # Create a BOW represenation of the texts 7 id2word = corpora.Dictionary(texts) 8 mm =[id2word.doc2bow(text) for text in texts] 9 10 # Train the LDA models. 11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics= NTOPICS, alpha="auto") 12 13 # Print the topics. 14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5): 15 print ("n",top) 16 17 print ("nFor further analysis, a dataset with the topic score for each document is saved to",LDAOUTPUTFILE) 18 19 scoresperdoc=lda.inference(mm) 20 21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo: 22 for row in scoresperdoc[0]: 23 fo.write("t".join(["{:0.3f}".format(score) for score in row])) 24 fo.write("n")
  • 46. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Topic Modelig Output: Topics (below) & topic scores (next slide) 1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese + 0.023*overname 2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033* minister 3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland + 0.038*russische 4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek + 0.027*raad 5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal 6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015* jaar 7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar + 0.025*werk 8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro 9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024* financiele 10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025* personeel 11 ... Big Data and Automated Content Analysis Damian Trilling
  • 47.
  • 48. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Concluding remarks Big Data and Automated Content Analysis Damian Trilling
  • 49. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Concluding remarks Of course, this lecture could only give a very superficial overview. It should have shown you, though, what is possible — and that you can do it! Big Data and Automated Content Analysis Damian Trilling
  • 50. Trends in ACA Supervised Machine Learning Unsupervised machine learning Concluding remarks Next meeting Wednesday Lab session, focus on INDIVIDUAL PROJECTS! Prepare! (No common exercise — but you can work on those I send you) Big Data and Automated Content Analysis Damian Trilling