Tweets topic modelling across different countries prezentarea

Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Tweets Topic Modelling Across
Different Countries
Ahmed ABDELWAHAB
Jose ROBLES
Costin-Gabriel CHIRU - costin.chiru@cs.pub.ro
Traian REBEDEA

Contents
• The Problem
• Twitter
• The Data
• Topic Modeling
• Results
• Conclusions
24.04.2014 eLSE 2014 2

The Problem
• Are there differences between the topics of
interests of people around the world?
• Identify the topics of interest for the people from
different countries
• Tweets offer the possibility to identify both the
topics of interest and the location of different
persons  analysis of English tweets published in
different countries
24.04.2014 eLSE 2014 3

Twitter
• Used intensely by the data mining communities:
– enormous amount of data available (more than 883
million users, with 241 million users active monthly,
average number of tweets sent on a daily basis is
about 500 million [Edwards, 2013]).
– news and hot stories are spreading very fast on this
micro-blogging network
• We extracted tweets containing both text and
URLs to external articles and modelled the topics
of the content and of the URLs independently
24.04.2014 eLSE 2014 4

The Data (1)
• Only 2% of the tweets had a location stamp set
• the number of tweets differ from one country to
another and the location stamped tweets are
even more non-uniformly distributed across
countries
24.04.2014 eLSE 2014 5

The Data (2)
• From these tweets, we kept only those written in
English which also contained shared URLs
• Largest shares of tweets written in English (besides
countries such as - UK, USA, South Africa, Canada)
have been seen in a lot of European countries (e.g.
Latvia, Serbia, Poland, Germany, Ukraine,
Netherlands, Italy, France, Portugal, Spain)
• Initially 1 million tweets  only 50 k respected all
conditions
24.04.2014 eLSE 2014 5

Topic Modeling
24.04.2014 eLSE 2014 6
• After extracting the data from tweets:
– For the URLs, the webpage is fetched and the
HTML is parsed  main 10 topics using Latent
Dirichlet Allocation (LDA) [Blei, 2012]
– For the tweets content, we removed the #s and
then used affinity propagation (AP) to cluster the
tweets for each country. The main 10 topics were
extracted from the resulting clusters using LDA
(text was too short  didn’t apply LDA directly).

Results (1)
• Top 10 topics for URL content: activities,
business, career, cooking, fun, market, places,
social, sports, twitter
• Top 10 topics for tweets’ content: city,
entertainment, fun, health, movies, places,
request, restaurants, romance, travel
• Top 10 words for each topic are presented in
the paper
24.04.2014 eLSE 2014 7

Results (2)
• Correlation between the two distributions (the topic tj
for URL content and the topic ti for tweets):
• Where
• This way we considered both the intersection of the
words for the 2 topics and how representative these
words were for the corresponding topics
24.04.2014 eLSE 2014 8
ji tt +
+∑∑ ∈∈ jhik tw ihtw jk
ji
)t,O(w)t,O(w
=)t,ity(tdissimilar



+
∈
=
otherwise
twifrank
,1t
),tw,(
t)O(w,

Results (3)
24.04.2014 eLSE 2014 9
• For identifying the coupling between topics we used a
greedy algorithm and obtained the following pairs:
– Entertainment - Social
– Places - Activities
– Restaurant - Career
– Request - Sports
– Travel - Market
– City - Twitter
– Health - Places
– Romance – Cooking
– Fun – Fun
– Movies – Business

Results (4)
• Country comparison:
– construct new matrix using:
N(Ci,tweets,tj)*P(Ci,tweets,tj) (i stands for the
countries and j for the topics).
– For both URL and tweets’ content, for each topic
select the most representative 5 countries
– Use the next formula to evaluate how similar the
topics are:
– eval = 56%
24.04.2014 eLSE 2014 10
5
)t(5)t(5
=)t,eval(t
ji
ji
CountriesTopCountriesTop 

Results (5)
• What different countries are talking about:
– USA: other tweets 50% of the time and 10% of the
time about blogs and other social networks
– UK: tweets less than 30% and for other social
networks and blogs about 20%
– Canada: tweets 40% of the time and 15% about other
social media
– South Africa: tweets 25% of the time and another
20% of the time take about other social networks
• In all the countries the percentage of discussion
about social and blogs topic is equal to the
percentage of tweets about sports
24.04.2014 eLSE 2014 11

Conclusions (1)
• Low matching between the topics debated in URLs and
tweets (maybe because the tweet doesn't always describe
or summarize the content of the shared URL)
• Analyzing the combined text of the tweets and of the
shared webpages showed that the topics generated from
tweets and shared URLs have only a 56% matching across
different countries
• We expected to have somehow similar country
distributions for the computed topics the degree to
which a topic is debated is highly influenced by the country
 the cultural differences between countries are at least
partly responsible for this difference
24.04.2014 eLSE 2014 12

Conclusions (2)
24.04.2014 eLSE 2014 13
• Results should be interpreted carefully:
– not all the countries have a representative number of
tweets in our dataset
– the ratio of English tweets to the total number of
tweets for each country (e.g. Brazil having a very low
percentage of English tweets)
– only 2% of the Twitter users set a location stamped
– the people who write tweets usually don't care about
spelling or using words which are not in the English
dictionary  problems on parsing (ignoring some
words)

Q&A
Thank you for your time!
24.04.2014 eLSE 2014 14

Tweets topic modelling across different countries prezentarea

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Tweets topic modelling across different countries prezentarea

Ähnlich wie Tweets topic modelling across different countries prezentarea (20)

Mehr von University Politehnica Bucharest

Mehr von University Politehnica Bucharest (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tweets topic modelling across different countries prezentarea

Hinweis der Redaktion