These slides refer to the talk I gave at the last ASE/IEEE SocialCom 2013 International Conference, where I presented the research work entitled "Trending Topics on Twitter Improve the Prediction of Google Hot Queries", which turned to be selected among the top-5% best accepted papers.
Once every five minutes, Twitter publishes a list of trending topics by monitoring and analyzing tweets from its users. Similarly, Google makes available hourly a list of hot queries that have been issued to the search engine. In this work, we analyze the time series derived from the daily volume index of each trend, either by Twitter or Google. Our study on a real-world dataset reveals that about 26% of the trending topics raising from Twitter "as-is" are also found as hot queries issued to Google. Also, we find that about 72% of the similar trends appear first on Twitter. Thus, we assess the relation between comparable Twitter and Google trends by testing three classes of time series regression models. We validate the forecasting power of Twitter by showing that models, which use Google as the dependent variable and Twitter as the explanatory variable, retain as significant the past values of Twitter 60% of times.
1. Trending Topics on Twitter Improve
the Prediction of Google Hot Queries
Gabriele Tolomei
Università Ca’ FoscariVenezia, Italy
Federica Giummolè
Università Ca’ FoscariVenezia, Italy
Salvatore Orlando
Università Ca’ FoscariVenezia, Italy
2013 ASE/IEEE International Conference on Social Computing
September 8th-14th, 2013 - Washington D.C., USA
Monday, September 30, 13
2. Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2
Monday, September 30, 13
3. Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
4. Twitter
• The most popular real-time microblogging
service
• ~ 500M users
• ~ 400M tweets per day on avg. (as of 2012)
• 140-chars limited size tweets
• Social trends pushed by the social network via
user-generated content
• hashtags (#)
• trending topics
42013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
5. Google
• The most popular Web search engine
• ~ 5B search queries per day on avg. (as of 2012)
• Web trends derived from search keywords
issued by users
• Zeitgeist
• Google (Hot)Trends
52013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
6. Social vs.Web Trends
...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...
62013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
8. Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
82013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
9. Data Collection
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 9
Streaming API
Search API
Atom feed
• 15 consecutive days of crawling
• from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC
• Google
• Hot Trends
• Twitter
• Trending Topics
• Public Timelines
Monday, September 30, 13
10. Google Hot Trends
49ers
...
election
...
obama 2016
...
world war z
Pre-processing
&
Cleaning
Top-20
hourly US queries
|VY|=190
Top-20
hourly US queries
102013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
y
Monday, September 30, 13
11. SearchVolume Index
Normalized integer score in [0,100]
Daily relative searches for a keyword limited to
a specific country within a range of dates
112013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
12. Twitter Trending Topics
|VX|=892
50 cent
...
iphone 5
...
election
...
windows 8
122013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Pre-processing
&
Cleaning
Top-10
trending topics
every 5 minutes
Top-10
hourly aggregated
x
Monday, September 30, 13
13. TrendVolume Index
132013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
• Use the public timelines crawled
~ 260M tweets = 10% random sampling
• To be consistent with Google
• daily relative number of tweets mentioning a
particular keyword could be hourly!
• normalized integer score in [0,100]
• limited to US and within a range of dates
Monday, September 30, 13
14. Trend Time Series
• 15 daily observations T = <t1, ..., t15>
• Google
• Hot Trends + SearchVolume Index
• e.g., Yt = election = <5,...,7,40,100,...,15,...>
• Twitter
• Trending Topics + TrendVolume Index
• e.g., Xt = election = <6,...,10,100,55,...,5,...>
142013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
15. Trend Pairing
• Not every pair of Google/Twitter trend time series
are worth analyzing!
• anne hathaway vs. veterans day
• We focus only on trends that are “similar enough”
to each other
• election vs. election
• election vs. barack obama
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 15
Monday, September 30, 13
16. Trend Bipartite Graph
VX VY
...
49ers
...
dow jones
...
election
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
election
...
romney
...
windows 8
...
...
trend
similarity
x
y
162013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
17. Trend Similarity
• Edge weighting scheme of the TBG
• string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc.
• semantic: e.g., Wikipedia-based
• We use the normalized longest common subsequence
(nlcs) between two keywords
172013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
18. Datasets
• 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs
• D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50
• D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69
• Aggregate and normalize Twitter time series
linked to the the same Google keyword in the TBG
• |VX| > |VY|
182013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
19. Research Questions
1) Is there any relation between a particular pair
of (Xt,Yt)?
• Cross-Correlation (lagged relationship)
2) Are variables from Twitter time series useful
to forecast those from Google?
• Time series regression
192013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Because from our data about 70% of times
the same trend appears first on Twitter
...Why not the opposite?
Monday, September 30, 13
20. Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 20
Monday, September 30, 13
21. Cross-Correlation
• Measures the correlation between two time
series Xt, Yt shifted by δ time units
• Xt refers to Twitter and Yt refers to Google
• min δ = 1 day
• Check for which δ the cross-correlation is
maximum
• X leads Y if one or more Xt+δ are predictors
of Yt and δ < 0
• X lags Y, otherwise
212013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
22. Lagged Relationship
Most pairs of time series exhibit their
max cross-correlation at lag δ = 0
Nevertheless, some exceptions
occur and cross-correlation at lag
δ = -1 is still significant
222013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Twitter as measured one day before could help explain Google
Monday, September 30, 13
23. Time Series Regression
• Relate Y (dependent variable) to a parametric function
of a set of explanatory variables X1,...,Xr
• The widest used function is linear in the parameters
• Linear Regression
ε
kx1 column vector kxr matrix
of observed values
for X1,...,Xr parametrized by β
Y = Xβ +
kx1 column vector of errors
232013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
24. Ordinary Least Squares
• Technique to estimate the real vector of
coefficients β
• Choose β’ such that:
β’ = argminβ {(Y-Xβ)T (Y-Xβ)}
β’ = (X T X)-1 X T Y
242013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
25. Autoregressive: AR(p)
• The simplest time series regression model
• Relate a variable Yt to a linear combination of
up to p of its previous values
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt
25
parameters random noise
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
26. Distributed Lag: DL(q)
• The dependent variable Yt is only related to
q+1 explanatory variables Xt at previous time
Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt
262013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
parameters random noise
Monday, September 30, 13
27. Autoregressive Distributed Lag:
ADL(p,q)
• Relate the dependent variable Yt to lags of
itself and of an explanatory variable Xt
+ ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p +
27
parameters random noise
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
28. Model Comparison
• We measure how likely a model AR(p), DL(q),
ADL(p,q) retains its lagged component as significant
• Null hypothesis H0:“the lagged coefficient is not significant”
• Rejecting H0 means that the lagged coefficient is useful
to fit the data
• H0 is rejected whenever the p-value is below a
significance level α (e.g., α = .05)
282013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
29. Model Evaluation
• Compute both R2 ∈ [0,1] and its adjusted
variation which penalizes models with too
much explanatory terms
• Describes how well a regression line fits the
observed data
• Provides a measure of how future observation
are likely to be predicted by the model
292013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
30. AR(p) vs. DL(q)
On both D1 and D2, DL(q) retain their q-lagged
coefficient much more often than AR(p)
302013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Twitter is actually useful to fit Google data!
Monday, September 30, 13
31. ADL(p,q)
312013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Slightly less cases where the lagged component
of Twitter is significant to predict Google data...
But adjusted R2 evaluates much better than DL(q)
Monday, September 30, 13
32. Wrap Up
322013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
ADL(1,1) is the best model
Reasonable!
It mixes the autoregressive component of Google with the
prediction of Twitter, captured one day before
Monday, September 30, 13
33. Overcome Limitations
We might expect better results
if finer-grained analysis (hourly) was possible...
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 33
Twitter vs.Wikipedia: Upcoming CIKM’13 Workshop
Monday, September 30, 13
34. Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 34
Monday, September 30, 13
35. Conclusion
• Relate Twitter trending topics (social trends)
with Google hot queries (web trends)
• Trend Bipartite Graph (TBG) links social and
web trends
• Time Series Analysis
• maximum cross-correlation occurs at lag-0 but
Twitter leads Google significantly (~ 60% of times)
• the very best model to explain data uses both
Twitter and Google lagged coefficients
352013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
39. TrendVocabularies
VX VY
...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
40. Trend Scores
• Given a discrete time interval T = <t1, ..., tT>
• Assign 2 scores (social and web) to each
trending keyword during each time unit
• The score measures the “strength” of how
much trending is a keyword at a given time
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
41. Trend Time Series
• Model each Twitter/Google trending keyword as
a time series of tT random variables
• Each random variable evaluates to the trending
score of the keyword
• The observed time series for a trend is the
sequence of values of its trending score
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
42. Trend Bipartite Graph
• 2 disjoint sets of nodes are the vocabularies of
Twitter and Google trends
• Weighted edges measure the pairwise trend
similarity
• string/lexical: edit distance, LCS, n-grams
• semantic:Wikipedia-based
• TBG identifies a set of pairs of comparable
time series associated with similar trends
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
43. (Weak) Stationarity
Autocorrelation of stationary variable decays
into “noise” and/or negative values in few lags
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Google Twitter
Monday, September 30, 13