Every country has its own topics of interest and its hot topics at different moments in time. In this paper we present a system that helps to understand and compare different countries, starting from the topics that are debated between their members. In order to do that, we recorded and analyzed the content of the messages that are sent on Twitter by people living in several countries, hoping that this way we will be able to capture the topics of interest for each culture and predict their hot topics. We did our analysis on English written tweets only, based on the fact that English has become a global language, being spoken even by Internet users from non-English speaking countries when they want to share their thoughts and have a global audience for their messages. Our study is trying to capture the topic models both for the tweets and for the URLs shared in them. Then we compare the distribution of topics across different countries both for the tweets and for the URLs to check how consistent these models are. For the topic modelling task, we designed a specialized way of developing them that is adapted for tweets (which have a maximum of 140 characters, being too short to apply classical topic modelling methods). Our system has been tested on a corpus consisting on English tweets, collected using the Twitter streaming API, that have a location attached to them and that also contain an URL. In order to eliminate our bias, we extracted tweets without any restrictions (including tweets written in other languages, tweets without URLs, tweets without location attached) and then we checked the percentage of our targeted tweets for each country. As a consequence, we extended the period of collecting the tweets to decrease the risk of dealing with abnormal events occurring in a certain country
2. Contents
• The Problem
• Twitter
• The Data
• Topic Modeling
• Results
• Conclusions
24.04.2014 eLSE 2014 2
3. The Problem
• Are there differences between the topics of
interests of people around the world?
• Identify the topics of interest for the people from
different countries
• Tweets offer the possibility to identify both the
topics of interest and the location of different
persons analysis of English tweets published in
different countries
24.04.2014 eLSE 2014 3
4. Twitter
• Used intensely by the data mining communities:
– enormous amount of data available (more than 883
million users, with 241 million users active monthly,
average number of tweets sent on a daily basis is
about 500 million [Edwards, 2013]).
– news and hot stories are spreading very fast on this
micro-blogging network
• We extracted tweets containing both text and
URLs to external articles and modelled the topics
of the content and of the URLs independently
24.04.2014 eLSE 2014 4
5. The Data (1)
• Only 2% of the tweets had a location stamp set
• the number of tweets differ from one country to
another and the location stamped tweets are
even more non-uniformly distributed across
countries
24.04.2014 eLSE 2014 5
6. The Data (2)
• From these tweets, we kept only those written in
English which also contained shared URLs
• Largest shares of tweets written in English (besides
countries such as - UK, USA, South Africa, Canada)
have been seen in a lot of European countries (e.g.
Latvia, Serbia, Poland, Germany, Ukraine,
Netherlands, Italy, France, Portugal, Spain)
• Initially 1 million tweets only 50 k respected all
conditions
24.04.2014 eLSE 2014 5
7. Topic Modeling
24.04.2014 eLSE 2014 6
• After extracting the data from tweets:
– For the URLs, the webpage is fetched and the
HTML is parsed main 10 topics using Latent
Dirichlet Allocation (LDA) [Blei, 2012]
– For the tweets content, we removed the #s and
then used affinity propagation (AP) to cluster the
tweets for each country. The main 10 topics were
extracted from the resulting clusters using LDA
(text was too short didn’t apply LDA directly).
8. Results (1)
• Top 10 topics for URL content: activities,
business, career, cooking, fun, market, places,
social, sports, twitter
• Top 10 topics for tweets’ content: city,
entertainment, fun, health, movies, places,
request, restaurants, romance, travel
• Top 10 words for each topic are presented in
the paper
24.04.2014 eLSE 2014 7
9. Results (2)
• Correlation between the two distributions (the topic tj
for URL content and the topic ti for tweets):
• Where
• This way we considered both the intersection of the
words for the 2 topics and how representative these
words were for the corresponding topics
24.04.2014 eLSE 2014 8
ji tt +
+∑∑ ∈∈ jhik tw ihtw jk
ji
)t,O(w)t,O(w
=)t,ity(tdissimilar
+
∈
=
otherwise
twifrank
,1t
),tw,(
t)O(w,
10. Results (3)
24.04.2014 eLSE 2014 9
• For identifying the coupling between topics we used a
greedy algorithm and obtained the following pairs:
– Entertainment - Social
– Places - Activities
– Restaurant - Career
– Request - Sports
– Travel - Market
– City - Twitter
– Health - Places
– Romance – Cooking
– Fun – Fun
– Movies – Business
11. Results (4)
• Country comparison:
– construct new matrix using:
N(Ci,tweets,tj)*P(Ci,tweets,tj) (i stands for the
countries and j for the topics).
– For both URL and tweets’ content, for each topic
select the most representative 5 countries
– Use the next formula to evaluate how similar the
topics are:
– eval = 56%
24.04.2014 eLSE 2014 10
5
)t(5)t(5
=)t,eval(t
ji
ji
CountriesTopCountriesTop
12. Results (5)
• What different countries are talking about:
– USA: other tweets 50% of the time and 10% of the
time about blogs and other social networks
– UK: tweets less than 30% and for other social
networks and blogs about 20%
– Canada: tweets 40% of the time and 15% about other
social media
– South Africa: tweets 25% of the time and another
20% of the time take about other social networks
• In all the countries the percentage of discussion
about social and blogs topic is equal to the
percentage of tweets about sports
24.04.2014 eLSE 2014 11
13. Conclusions (1)
• Low matching between the topics debated in URLs and
tweets (maybe because the tweet doesn't always describe
or summarize the content of the shared URL)
• Analyzing the combined text of the tweets and of the
shared webpages showed that the topics generated from
tweets and shared URLs have only a 56% matching across
different countries
• We expected to have somehow similar country
distributions for the computed topics the degree to
which a topic is debated is highly influenced by the country
the cultural differences between countries are at least
partly responsible for this difference
24.04.2014 eLSE 2014 12
14. Conclusions (2)
24.04.2014 eLSE 2014 13
• Results should be interpreted carefully:
– not all the countries have a representative number of
tweets in our dataset
– the ratio of English tweets to the total number of
tweets for each country (e.g. Brazil having a very low
percentage of English tweets)
– only 2% of the Twitter users set a location stamped
– the people who write tweets usually don't care about
spelling or using words which are not in the English
dictionary problems on parsing (ignoring some
words)