SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Trending Topics on Twitter Improve
the Prediction of Google Hot Queries
Gabriele Tolomei
Università Ca’ FoscariVenezia, Italy
Federica Giummolè
Università Ca’ FoscariVenezia, Italy
Salvatore Orlando
Università Ca’ FoscariVenezia, Italy
2013 ASE/IEEE International Conference on Social Computing
September 8th-14th, 2013 - Washington D.C., USA
Monday, September 30, 13
Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2
Monday, September 30, 13
Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Twitter
• The most popular real-time microblogging
service
• ~ 500M users
• ~ 400M tweets per day on avg. (as of 2012)
• 140-chars limited size tweets
• Social trends pushed by the social network via
user-generated content
• hashtags (#)
• trending topics
42013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Google
• The most popular Web search engine
• ~ 5B search queries per day on avg. (as of 2012)
• Web trends derived from search keywords
issued by users
• Zeitgeist
• Google (Hot)Trends
52013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Social vs.Web Trends
...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...
62013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Which Came First?
0
20
40
60
80
100
11-01
11-03
11-05
11-07
11-09
11-11
11-13
11-15
VolumeIndex
Timestamp
election
Google
Twitter
Our claim is that a trending topic on Twitter
could later become a hot query on Google
72013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
82013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Data Collection
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 9
Streaming API
Search API
Atom feed
• 15 consecutive days of crawling
• from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC
• Google
• Hot Trends
• Twitter
• Trending Topics
• Public Timelines
Monday, September 30, 13
Google Hot Trends
49ers
...
election
...
obama 2016
...
world war z
Pre-processing
&
Cleaning
Top-20
hourly US queries
|VY|=190
Top-20
hourly US queries
102013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
y
Monday, September 30, 13
SearchVolume Index
Normalized integer score in [0,100]
Daily relative searches for a keyword limited to
a specific country within a range of dates
112013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Twitter Trending Topics
|VX|=892
50 cent
...
iphone 5
...
election
...
windows 8
122013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Pre-processing
&
Cleaning
Top-10
trending topics
every 5 minutes
Top-10
hourly aggregated
x
Monday, September 30, 13
TrendVolume Index
132013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
• Use the public timelines crawled
~ 260M tweets = 10% random sampling
• To be consistent with Google
• daily relative number of tweets mentioning a
particular keyword could be hourly!
• normalized integer score in [0,100]
• limited to US and within a range of dates
Monday, September 30, 13
Trend Time Series
• 15 daily observations T = <t1, ..., t15>
• Google
• Hot Trends + SearchVolume Index
• e.g., Yt = election = <5,...,7,40,100,...,15,...>
• Twitter
• Trending Topics + TrendVolume Index
• e.g., Xt = election = <6,...,10,100,55,...,5,...>
142013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Trend Pairing
• Not every pair of Google/Twitter trend time series
are worth analyzing!
• anne hathaway vs. veterans day
• We focus only on trends that are “similar enough”
to each other
• election vs. election
• election vs. barack obama
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 15
Monday, September 30, 13
Trend Bipartite Graph
VX VY
...
49ers
...
dow jones
...
election
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
election
...
romney
...
windows 8
...
...
trend
similarity
x
y
162013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Trend Similarity
• Edge weighting scheme of the TBG
• string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc.
• semantic: e.g., Wikipedia-based
• We use the normalized longest common subsequence
(nlcs) between two keywords
172013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Datasets
• 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs
• D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50
• D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69
• Aggregate and normalize Twitter time series
linked to the the same Google keyword in the TBG
• |VX| > |VY|
182013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Research Questions
1) Is there any relation between a particular pair
of (Xt,Yt)?
• Cross-Correlation (lagged relationship)
2) Are variables from Twitter time series useful
to forecast those from Google?
• Time series regression
192013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Because from our data about 70% of times
the same trend appears first on Twitter
...Why not the opposite?
Monday, September 30, 13
Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 20
Monday, September 30, 13
Cross-Correlation
• Measures the correlation between two time
series Xt, Yt shifted by δ time units
• Xt refers to Twitter and Yt refers to Google
• min δ = 1 day
• Check for which δ the cross-correlation is
maximum
• X leads Y if one or more Xt+δ are predictors
of Yt and δ < 0
• X lags Y, otherwise
212013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Lagged Relationship
Most pairs of time series exhibit their
max cross-correlation at lag δ = 0
Nevertheless, some exceptions
occur and cross-correlation at lag
δ = -1 is still significant
222013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Twitter as measured one day before could help explain Google
Monday, September 30, 13
Time Series Regression
• Relate Y (dependent variable) to a parametric function
of a set of explanatory variables X1,...,Xr
• The widest used function is linear in the parameters
• Linear Regression
ε
kx1 column vector kxr matrix
of observed values
for X1,...,Xr parametrized by β
Y = Xβ +
kx1 column vector of errors
232013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Ordinary Least Squares
• Technique to estimate the real vector of
coefficients β
• Choose β’ such that:
β’ = argminβ {(Y-Xβ)T (Y-Xβ)}
β’ = (X T X)-1 X T Y
242013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Autoregressive: AR(p)
• The simplest time series regression model
• Relate a variable Yt to a linear combination of
up to p of its previous values
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt
25
parameters random noise
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Distributed Lag: DL(q)
• The dependent variable Yt is only related to
q+1 explanatory variables Xt at previous time
Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt
262013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
parameters random noise
Monday, September 30, 13
Autoregressive Distributed Lag:
ADL(p,q)
• Relate the dependent variable Yt to lags of
itself and of an explanatory variable Xt
+ ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt
Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p +
27
parameters random noise
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Model Comparison
• We measure how likely a model AR(p), DL(q),
ADL(p,q) retains its lagged component as significant
• Null hypothesis H0:“the lagged coefficient is not significant”
• Rejecting H0 means that the lagged coefficient is useful
to fit the data
• H0 is rejected whenever the p-value is below a
significance level α (e.g., α = .05)
282013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Model Evaluation
• Compute both R2 ∈ [0,1] and its adjusted
variation which penalizes models with too
much explanatory terms
• Describes how well a regression line fits the
observed data
• Provides a measure of how future observation
are likely to be predicted by the model
292013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
AR(p) vs. DL(q)
On both D1 and D2, DL(q) retain their q-lagged
coefficient much more often than AR(p)
302013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Twitter is actually useful to fit Google data!
Monday, September 30, 13
ADL(p,q)
312013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Slightly less cases where the lagged component
of Twitter is significant to predict Google data...
But adjusted R2 evaluates much better than DL(q)
Monday, September 30, 13
Wrap Up
322013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
ADL(1,1) is the best model
Reasonable!
It mixes the autoregressive component of Google with the
prediction of Twitter, captured one day before
Monday, September 30, 13
Overcome Limitations
We might expect better results
if finer-grained analysis (hourly) was possible...
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 33
Twitter vs.Wikipedia: Upcoming CIKM’13 Workshop
Monday, September 30, 13
Agenda
Social vs.Web Trends
• Introduction
• Methodology
• Experiments & Results
• Conclusion
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 34
Monday, September 30, 13
Conclusion
• Relate Twitter trending topics (social trends)
with Google hot queries (web trends)
• Trend Bipartite Graph (TBG) links social and
web trends
• Time Series Analysis
• maximum cross-correlation occurs at lag-0 but
Twitter leads Google significantly (~ 60% of times)
• the very best model to explain data uses both
Twitter and Google lagged coefficients
352013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
ThankYou!
Questions?
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 36
Monday, September 30, 13
Monday, September 30, 13
Backup
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
TrendVocabularies
VX VY
...
49ers
...
dow jones
...
nba
...
obama 2016
...
world war z
...
...
50 cent
...
democrats
...
iphone 5
...
romney
...
windows 8
...
...
anne hathaway
...
barack obama
...
election
...
nyc marathon
...
veterans day
...
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Trend Scores
• Given a discrete time interval T = <t1, ..., tT>
• Assign 2 scores (social and web) to each
trending keyword during each time unit
• The score measures the “strength” of how
much trending is a keyword at a given time
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Trend Time Series
• Model each Twitter/Google trending keyword as
a time series of tT random variables
• Each random variable evaluates to the trending
score of the keyword
• The observed time series for a trend is the
sequence of values of its trending score
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
Trend Bipartite Graph
• 2 disjoint sets of nodes are the vocabularies of
Twitter and Google trends
• Weighted edges measure the pairwise trend
similarity
• string/lexical: edit distance, LCS, n-grams
• semantic:Wikipedia-based
• TBG identifies a set of pairs of comparable
time series associated with similar trends
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Monday, September 30, 13
(Weak) Stationarity
Autocorrelation of stationary variable decays
into “noise” and/or negative values in few lags
2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA
Google Twitter
Monday, September 30, 13

Weitere ähnliche Inhalte

Ähnlich wie SocialCom 2013

Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 
Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)kim.mens
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataVito Ostuni
 
Effective Strategies for Creating Scientific graphics
Effective Strategies for Creating Scientific graphicsEffective Strategies for Creating Scientific graphics
Effective Strategies for Creating Scientific graphicsJoel Kelly
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesHendrik Drachsler
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsLink Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsJames McGivern
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19Hyun Wong Choi
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19Hyun Wong Choi
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19Hyun Wong Choi
 
defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18Hyun Wong Choi
 
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Hyun Wong Choi
 
Adventures in Crowdsourcing: Research at UT Austin & Beyond
Adventures in Crowdsourcing: Research at UT Austin & BeyondAdventures in Crowdsourcing: Research at UT Austin & Beyond
Adventures in Crowdsourcing: Research at UT Austin & BeyondMatthew Lease
 
EDUC5102G Session 2 Presentation
EDUC5102G Session 2 PresentationEDUC5102G Session 2 Presentation
EDUC5102G Session 2 PresentationRobert Power
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...ssuserac56571
 

Ähnlich wie SocialCom 2013 (20)

Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
 
Effective Strategies for Creating Scientific graphics
Effective Strategies for Creating Scientific graphicsEffective Strategies for Creating Scientific graphics
Effective Strategies for Creating Scientific graphics
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for Universities
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsLink Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the Terrorists
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18
 
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 
Adventures in Crowdsourcing: Research at UT Austin & Beyond
Adventures in Crowdsourcing: Research at UT Austin & BeyondAdventures in Crowdsourcing: Research at UT Austin & Beyond
Adventures in Crowdsourcing: Research at UT Austin & Beyond
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
EDUC5102G Session 2 Presentation
EDUC5102G Session 2 PresentationEDUC5102G Session 2 Presentation
EDUC5102G Session 2 Presentation
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Resume
ResumeResume
Resume
 
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...
405_02_Montgomery_Introduction-to-statistical-quality-control-7th-edtition-20...
 

Kürzlich hochgeladen

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Kürzlich hochgeladen (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

SocialCom 2013

  • 1. Trending Topics on Twitter Improve the Prediction of Google Hot Queries Gabriele Tolomei Università Ca’ FoscariVenezia, Italy Federica Giummolè Università Ca’ FoscariVenezia, Italy Salvatore Orlando Università Ca’ FoscariVenezia, Italy 2013 ASE/IEEE International Conference on Social Computing September 8th-14th, 2013 - Washington D.C., USA Monday, September 30, 13
  • 2. Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2 Monday, September 30, 13
  • 3. Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 4. Twitter • The most popular real-time microblogging service • ~ 500M users • ~ 400M tweets per day on avg. (as of 2012) • 140-chars limited size tweets • Social trends pushed by the social network via user-generated content • hashtags (#) • trending topics 42013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 5. Google • The most popular Web search engine • ~ 5B search queries per day on avg. (as of 2012) • Web trends derived from search keywords issued by users • Zeitgeist • Google (Hot)Trends 52013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 6. Social vs.Web Trends ... 49ers ... dow jones ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... romney ... windows 8 ... ... anne hathaway ... barack obama ... election ... nyc marathon ... veterans day ... 62013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 7. Which Came First? 0 20 40 60 80 100 11-01 11-03 11-05 11-07 11-09 11-11 11-13 11-15 VolumeIndex Timestamp election Google Twitter Our claim is that a trending topic on Twitter could later become a hot query on Google 72013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 8. Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 82013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 9. Data Collection 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 9 Streaming API Search API Atom feed • 15 consecutive days of crawling • from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC • Google • Hot Trends • Twitter • Trending Topics • Public Timelines Monday, September 30, 13
  • 10. Google Hot Trends 49ers ... election ... obama 2016 ... world war z Pre-processing & Cleaning Top-20 hourly US queries |VY|=190 Top-20 hourly US queries 102013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA y Monday, September 30, 13
  • 11. SearchVolume Index Normalized integer score in [0,100] Daily relative searches for a keyword limited to a specific country within a range of dates 112013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 12. Twitter Trending Topics |VX|=892 50 cent ... iphone 5 ... election ... windows 8 122013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Pre-processing & Cleaning Top-10 trending topics every 5 minutes Top-10 hourly aggregated x Monday, September 30, 13
  • 13. TrendVolume Index 132013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA • Use the public timelines crawled ~ 260M tweets = 10% random sampling • To be consistent with Google • daily relative number of tweets mentioning a particular keyword could be hourly! • normalized integer score in [0,100] • limited to US and within a range of dates Monday, September 30, 13
  • 14. Trend Time Series • 15 daily observations T = <t1, ..., t15> • Google • Hot Trends + SearchVolume Index • e.g., Yt = election = <5,...,7,40,100,...,15,...> • Twitter • Trending Topics + TrendVolume Index • e.g., Xt = election = <6,...,10,100,55,...,5,...> 142013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 15. Trend Pairing • Not every pair of Google/Twitter trend time series are worth analyzing! • anne hathaway vs. veterans day • We focus only on trends that are “similar enough” to each other • election vs. election • election vs. barack obama 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 15 Monday, September 30, 13
  • 16. Trend Bipartite Graph VX VY ... 49ers ... dow jones ... election ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... election ... romney ... windows 8 ... ... trend similarity x y 162013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 17. Trend Similarity • Edge weighting scheme of the TBG • string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc. • semantic: e.g., Wikipedia-based • We use the normalized longest common subsequence (nlcs) between two keywords 172013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 18. Datasets • 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs • D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50 • D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69 • Aggregate and normalize Twitter time series linked to the the same Google keyword in the TBG • |VX| > |VY| 182013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 19. Research Questions 1) Is there any relation between a particular pair of (Xt,Yt)? • Cross-Correlation (lagged relationship) 2) Are variables from Twitter time series useful to forecast those from Google? • Time series regression 192013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Because from our data about 70% of times the same trend appears first on Twitter ...Why not the opposite? Monday, September 30, 13
  • 20. Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 20 Monday, September 30, 13
  • 21. Cross-Correlation • Measures the correlation between two time series Xt, Yt shifted by δ time units • Xt refers to Twitter and Yt refers to Google • min δ = 1 day • Check for which δ the cross-correlation is maximum • X leads Y if one or more Xt+δ are predictors of Yt and δ < 0 • X lags Y, otherwise 212013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 22. Lagged Relationship Most pairs of time series exhibit their max cross-correlation at lag δ = 0 Nevertheless, some exceptions occur and cross-correlation at lag δ = -1 is still significant 222013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Twitter as measured one day before could help explain Google Monday, September 30, 13
  • 23. Time Series Regression • Relate Y (dependent variable) to a parametric function of a set of explanatory variables X1,...,Xr • The widest used function is linear in the parameters • Linear Regression ε kx1 column vector kxr matrix of observed values for X1,...,Xr parametrized by β Y = Xβ + kx1 column vector of errors 232013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 24. Ordinary Least Squares • Technique to estimate the real vector of coefficients β • Choose β’ such that: β’ = argminβ {(Y-Xβ)T (Y-Xβ)} β’ = (X T X)-1 X T Y 242013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 25. Autoregressive: AR(p) • The simplest time series regression model • Relate a variable Yt to a linear combination of up to p of its previous values Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt 25 parameters random noise 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 26. Distributed Lag: DL(q) • The dependent variable Yt is only related to q+1 explanatory variables Xt at previous time Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt 262013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA parameters random noise Monday, September 30, 13
  • 27. Autoregressive Distributed Lag: ADL(p,q) • Relate the dependent variable Yt to lags of itself and of an explanatory variable Xt + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + 27 parameters random noise 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 28. Model Comparison • We measure how likely a model AR(p), DL(q), ADL(p,q) retains its lagged component as significant • Null hypothesis H0:“the lagged coefficient is not significant” • Rejecting H0 means that the lagged coefficient is useful to fit the data • H0 is rejected whenever the p-value is below a significance level α (e.g., α = .05) 282013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 29. Model Evaluation • Compute both R2 ∈ [0,1] and its adjusted variation which penalizes models with too much explanatory terms • Describes how well a regression line fits the observed data • Provides a measure of how future observation are likely to be predicted by the model 292013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 30. AR(p) vs. DL(q) On both D1 and D2, DL(q) retain their q-lagged coefficient much more often than AR(p) 302013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Twitter is actually useful to fit Google data! Monday, September 30, 13
  • 31. ADL(p,q) 312013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Slightly less cases where the lagged component of Twitter is significant to predict Google data... But adjusted R2 evaluates much better than DL(q) Monday, September 30, 13
  • 32. Wrap Up 322013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA ADL(1,1) is the best model Reasonable! It mixes the autoregressive component of Google with the prediction of Twitter, captured one day before Monday, September 30, 13
  • 33. Overcome Limitations We might expect better results if finer-grained analysis (hourly) was possible... 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 33 Twitter vs.Wikipedia: Upcoming CIKM’13 Workshop Monday, September 30, 13
  • 34. Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 34 Monday, September 30, 13
  • 35. Conclusion • Relate Twitter trending topics (social trends) with Google hot queries (web trends) • Trend Bipartite Graph (TBG) links social and web trends • Time Series Analysis • maximum cross-correlation occurs at lag-0 but Twitter leads Google significantly (~ 60% of times) • the very best model to explain data uses both Twitter and Google lagged coefficients 352013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 36. ThankYou! Questions? 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 36 Monday, September 30, 13
  • 38. Backup 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 39. TrendVocabularies VX VY ... 49ers ... dow jones ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... romney ... windows 8 ... ... anne hathaway ... barack obama ... election ... nyc marathon ... veterans day ... 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 40. Trend Scores • Given a discrete time interval T = <t1, ..., tT> • Assign 2 scores (social and web) to each trending keyword during each time unit • The score measures the “strength” of how much trending is a keyword at a given time 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 41. Trend Time Series • Model each Twitter/Google trending keyword as a time series of tT random variables • Each random variable evaluates to the trending score of the keyword • The observed time series for a trend is the sequence of values of its trending score 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 42. Trend Bipartite Graph • 2 disjoint sets of nodes are the vocabularies of Twitter and Google trends • Weighted edges measure the pairwise trend similarity • string/lexical: edit distance, LCS, n-grams • semantic:Wikipedia-based • TBG identifies a set of pairs of comparable time series associated with similar trends 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
  • 43. (Weak) Stationarity Autocorrelation of stationary variable decays into “noise” and/or negative values in few lags 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Google Twitter Monday, September 30, 13