SocialCom 2013

Trending Topics on Twitter Improve
the Prediction of Google Hot Queries
Gabriele Tolomei
Università Ca’ FoscariVenezia, Italy
Federica Giummolè
Salvatore Orlando
2013 ASE/IEEE International Conference on Social Computing
September 8th-14th, 2013 - Washington D.C., USA
Monday, September 30, 13

Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2

Agenda
• Introduction
• Methodology
• Conclusion
32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Twitter
• The most popular real-time microblogging
service
• ~ 500M users
• ~ 400M tweets per day on avg. (as of 2012)
• 140-chars limited size tweets
• Social trends pushed by the social network via
user-generated content
• hashtags (#)
• trending topics

Google
• The most popular Web search engine
• ~ 5B search queries per day on avg. (as of 2012)
• Web trends derived from search keywords
issued by users
• Zeitgeist
• Google (Hot)Trends

...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...

Which Came First?
0
20
40
60
80
100
11-01
11-03
11-05
11-07
11-09
11-11
11-13
11-15
VolumeIndex
Timestamp
election
Google
Twitter
Our claim is that a trending topic on Twitter
could later become a hot query on Google

Agenda
• Introduction
• Methodology
• Conclusion

Data Collection
Streaming API
Search API
Atom feed
• 15 consecutive days of crawling
• from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC
• Google
• Hot Trends
• Twitter
• Trending Topics
• Public Timelines

Google Hot Trends
49ers
...
election
...
obama 2016
...
world war z
Pre-processing
&
Cleaning
Top-20
hourly US queries
|VY|=190
Top-20
hourly US queries
y

SearchVolume Index
Normalized integer score in [0,100]
Daily relative searches for a keyword limited to
a speciﬁc country within a range of dates

Twitter Trending Topics
|VX|=892
50 cent
...
iphone 5
...
election
...
windows 8
Pre-processing
&
Cleaning
Top-10
trending topics
every 5 minutes
Top-10
hourly aggregated
x

TrendVolume Index
• Use the public timelines crawled
~ 260M tweets = 10% random sampling
• To be consistent with Google
• daily relative number of tweets mentioning a
particular keyword could be hourly!
• normalized integer score in [0,100]
• limited to US and within a range of dates

Trend Time Series
• 15 daily observations T = <t1, ..., t15>
• Google
• Hot Trends + SearchVolume Index
• e.g., Yt = election = <5,...,7,40,100,...,15,...>
• Twitter
• Trending Topics + TrendVolume Index
• e.g., Xt = election = <6,...,10,100,55,...,5,...>

Trend Pairing
• Not every pair of Google/Twitter trend time series
are worth analyzing!
• anne hathaway vs. veterans day
• We focus only on trends that are “similar enough”
to each other
• election vs. election
• election vs. barack obama

Trend Bipartite Graph
VX VY
...
49ers
...
dow jones
...
election
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
election
...
romney
...
windows 8
...
...
trend
similarity
x
y

Trend Similarity
• Edge weighting scheme of the TBG
• string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc.
• semantic: e.g., Wikipedia-based
• We use the normalized longest common subsequence
(nlcs) between two keywords

Datasets
• 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs
• D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50
• D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69
• Aggregate and normalize Twitter time series
linked to the the same Google keyword in the TBG
• |VX| > |VY|

Research Questions
1) Is there any relation between a particular pair
of (Xt,Yt)?
• Cross-Correlation (lagged relationship)
2) Are variables from Twitter time series useful
to forecast those from Google?
• Time series regression
Because from our data about 70% of times
the same trend appears ﬁrst on Twitter
...Why not the opposite?

Agenda
• Introduction
• Methodology
• Conclusion

Cross-Correlation
• Measures the correlation between two time
series Xt, Yt shifted by δ time units
• Xt refers to Twitter and Yt refers to Google
• min δ = 1 day
• Check for which δ the cross-correlation is
maximum
• X leads Y if one or more Xt+δ are predictors
of Yt and δ < 0
• X lags Y, otherwise

Lagged Relationship
Most pairs of time series exhibit their
max cross-correlation at lag δ = 0
Nevertheless, some exceptions
occur and cross-correlation at lag
δ = -1 is still signiﬁcant
Twitter as measured one day before could help explain Google

Time Series Regression
• Relate Y (dependent variable) to a parametric function
of a set of explanatory variables X1,...,Xr
• The widest used function is linear in the parameters
• Linear Regression
ε
kx1 column vector kxr matrix
of observed values
for X1,...,Xr parametrized by β
Y = Xβ +
kx1 column vector of errors

Ordinary Least Squares
• Technique to estimate the real vector of
coefﬁcients β
• Choose β’ such that:
β’ = argminβ {(Y-Xβ)T (Y-Xβ)}
β’ = (X T X)-1 X T Y

Autoregressive: AR(p)
• The simplest time series regression model
• Relate a variable Yt to a linear combination of
up to p of its previous values
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt
25
parameters random noise

Distributed Lag: DL(q)
• The dependent variable Yt is only related to
q+1 explanatory variables Xt at previous time
Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt

Autoregressive Distributed Lag:
ADL(p,q)
• Relate the dependent variable Yt to lags of
itself and of an explanatory variable Xt
+ ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p +
27

Model Comparison
• We measure how likely a model AR(p), DL(q),
ADL(p,q) retains its lagged component as significant
• Null hypothesis H0:“the lagged coefficient is not significant”
• Rejecting H0 means that the lagged coefficient is useful
to fit the data
• H0 is rejected whenever the p-value is below a
significance level α (e.g., α = .05)

Model Evaluation
• Compute both R2 ∈ [0,1] and its adjusted
variation which penalizes models with too
much explanatory terms
• Describes how well a regression line ﬁts the
observed data
• Provides a measure of how future observation
are likely to be predicted by the model

AR(p) vs. DL(q)
On both D1 and D2, DL(q) retain their q-lagged
coefﬁcient much more often than AR(p)
Twitter is actually useful to ﬁt Google data!

ADL(p,q)
Slightly less cases where the lagged component
of Twitter is signiﬁcant to predict Google data...
But adjusted R2 evaluates much better than DL(q)

Wrap Up
ADL(1,1) is the best model
Reasonable!
It mixes the autoregressive component of Google with the
prediction of Twitter, captured one day before

Overcome Limitations
We might expect better results
if ﬁner-grained analysis (hourly) was possible...
Twitter vs.Wikipedia: Upcoming CIKM’13 Workshop

Agenda
• Introduction
• Methodology
• Conclusion

Conclusion
• Relate Twitter trending topics (social trends)
with Google hot queries (web trends)
• Trend Bipartite Graph (TBG) links social and
web trends
• Time Series Analysis
• maximum cross-correlation occurs at lag-0 but
Twitter leads Google signiﬁcantly (~ 60% of times)
• the very best model to explain data uses both
Twitter and Google lagged coefﬁcients

ThankYou!
Questions?

Backup

TrendVocabularies
VX VY
...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...

Trend Scores
• Given a discrete time interval T = <t1, ..., tT>
• Assign 2 scores (social and web) to each
trending keyword during each time unit
• The score measures the “strength” of how
much trending is a keyword at a given time

Trend Time Series
• Model each Twitter/Google trending keyword as
a time series of tT random variables
• Each random variable evaluates to the trending
score of the keyword
• The observed time series for a trend is the
sequence of values of its trending score

Trend Bipartite Graph
• 2 disjoint sets of nodes are the vocabularies of
Twitter and Google trends
• Weighted edges measure the pairwise trend
similarity
• string/lexical: edit distance, LCS, n-grams
• semantic:Wikipedia-based
• TBG identiﬁes a set of pairs of comparable
time series associated with similar trends

(Weak) Stationarity
Autocorrelation of stationary variable decays
into “noise” and/or negative values in few lags
Google Twitter

SocialCom 2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie SocialCom 2013

Ähnlich wie SocialCom 2013 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SocialCom 2013