SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Hashtagger+: Real-time Social
Tagging of Streaming News
Georgiana Ifrim
(joint work with Bichen Shi, Gevorg Poghosyan, Neil Hurley)
Insight Centre for Data Analytics,
University College Dublin, Ireland
1
The Umbrella Revolution:
Sit-in street protests in Hong Kong, 2014
2
“The ants have megaphones now”
C. Anderson
Sep21 Sep23 Sep25 Sep27 Sep29 Oct01 Oct03 Oct05 Oct07
#OccupyCentral
#UmbrellaRevolution
#HongKong
Hong Kong
students begin
pro-democracy
class boycott
Thousands at
Hong Kong
protest as
Occupy Central
is launched
Hong Kong
protests:
Thousands defy
calls to go home
Hong Kong
students vow
stronger protests
if leader stays
Hong Kong
protests: Formal
talks agreed as
protests shrink
3
Insight Centre for Data Analytics
Motivation:
News articles – Hashtag – Twitter conversation
#IndyRef
(Referendum on Scottish
Independence)
BBC: Scottish independence: Yes
vote 'means big Scots EU boost'
BBC: Could Scotland compete on
tax with Westminster?
IrishTimes: Brown promises
more devolution for Scotland
RTE: Lloyds could move
south if Scots vote for
independence
Reuters: British PM heads to
Scotland as independence
campaign gathers steam
TheGuardian:
Scottish independence: No
camp sends for Gordon
Brown as polls tighten
April 2016 4
Outline
•Problem Statement
•State-of-the-Art
•Hashtagger+ Model
•Applications
•Conclusion
5
Insight Centre for Data Analytics
Problem Statement
Map a stream of articles to a stream of
hashtags in real-time, with high-precision
and high-coverage.
Joe Schmidt makes six changes to Irish side to face Japan #rugby

Paris Airshow: eight takeaways from the major aerospace event #business

The tortoise and the software: the human glitch in the machine #ux

Duke of Edinburgh leaves hospital #princephilip

April 2016 6
Insight Centre for Data Analytics
Problem Statement
•Real-time Recommendation: given an article, how quickly
can we recommend hashtags? (5mins ok, 5h not ok)
•High-precision (focused hashtags):
X Deadly car bomb targets Afghan bank #news

V Deadly car bomb targets Afghan bank #afghanistan #helmand

•High-coverage: how many articles get any recommended
hashtags within 5 minutes? (9 out of 10 ok, 1 out of 10 not ok)
April 2016 7
State-of-the-Art
•Modeling Approach:
• Multi-class Classification
• Content-based Features
• Static Datasets
•Workflow: Tweets Article
• Collect Tweets -> Hashtags as Classes -> Train Hashtag
Classifiers -> Apply Classifiers to Article ->
Recommended Hashtags
8
Insight Centre for Data Analytics
State-of-the-art:
• Multi-class classification (e.g., Naive Bayes, SVM,
LDA, CNN)
• One hashtag = one class
• Content-based features
April 2016
#GE16
#ge16: Fine Gael and Fianna Fáil to discuss government options
Ruth Coppinger to be nominated for Taoiseach #GE16 #irishwater
…
#Germanwings
"No evidence" that co-pilot told anyone he was planning #Germanwings
crash, prosecutor says
…
9
Insight Centre for Data Analytics
State-of-the-art:
April 2016
#GE16
#ge16: Fine Gael and Fianna Fáil to
discuss government options
Ruth Coppinger to be nominated for
Taoiseach #GE16 #irishwater
…
#German
wings
"No evidence" that co-pilot told anyone he
was planning #Germanwings crash,
prosecutor says…
… …
Model
Train
#Panama
Papers
#PanamaPapers: Mossack Fonseca leak
reveals elite's tax havens
#PanamaPapers: How the World's Rich
and Famous Hide Their Money Offshore
How about new hashtags? Concept-drift of old hashtags?
#German
wings:
One year on, Haltern commemorates
the crash
Nice flight from Manchester to Koln/
Bonn this morning.
Re-train the model
Weakness:
Apply
Apply
10
Insight Centre for Data Analytics
Challenges:
•Many Classes: thousands of hashtags (e.g., 26k/day)
•Dynamic Classes: hashtags emerge and die-off
•Concept Drift: usage and meaning of hashtags changes
•Efficiency/coverage: real-time tagging to capture
how the story moves over time
•Precision: state-of-the-art models have P@1 of ~50%
April 2016 11
Hashtagger+ Model
•Modeling Approach:
• Learning-to-rank (L2R)
• Focus on the concept of hashtag relevance
• IR Framework:Article = query, Hashtags =
documents retrieved/ranked for the query
• Workflow: Article Tweets
12
Hashtagger+ Model
13
Insight Centre for Data Analytics
Hashtagger+ Model
April 2016
Object Class
Article1 Hashtagx Hashtagy Hashtagz
Article2 Hashtagx
Article3 Hashtagy Hashtagz
Object Class
(Article1 , Hashtagx) Relevant
(Article1 , Hashtagm) Irrelevant
(Article1 , Hashtagn) Irrelevant
SOTA: Multi-class Classification
Proposed L2R Model
14
• Pointwise L2R model
• Input feature vector xarticle,hashtag,time describes a given
(Article , Hashtag) pair at a point in time
• Human provided label yarticle,hashtag,time tells if the hashtag is
relevant or irrelevant to the article, at that point in time
• Time-aware features capture how strongly a hashtag is
associated with an article
Content Similarity
Hashtag Popularity, Specificity, Trending
User Credibility
15
Hashtagger+ Model
Insight Centre for Data Analytics
Hashtagger+ Model:
April 2016
(Article1 , #GE16) 0.34 0.73 0 … Relevant
(Article1 , #Germanwings) 0.01 0.23 0 … Irrelevant
… … … … … …
(Article2 , #GE16) 0.02 0.48 0 … Irrelevant
(Article2 , #Germanwings) 0.76 0.45 1 … Relevant
… … … … … …
Model
Train
How about new hashtags? Concept-drift of old hashtags?
Train once, use model (no retraining needed)
(Article1 , #PanamaPapers) 0.66 0.82 1 …
(Article2 , #PanamaPapers) 0.08 0.73 0 …
(Article1 , #Germanwings) 0.28 0.45 0 …
(Article2 , #Germanwings) 0.53 0.24 1 …
Apply
Apply
16
Insight Centre for Data Analytics
Two-Step L2R Approach
• Filtering: Article -> Set of Candidate Hashtags
• Efficient Data Collection
• Query generation from given article
• Retrieving relevant tweets for article/query
• Ranking Model: Article, Candidate Hashtags -> Ranked
Hashtag List
• Apply pre-trained L2R model to rank candidate hashtags
April 2016 17
Hashtagger+ Model
18
Query Generation: Article -> Query
•What is a good set of keywords to describe what
the article is about? (open research problem)
•How quickly can we generate the query?
•How good is the set of tweets retrieved with a
given query?
•We compare 4 methods for query generation and
the effect on quality & size of retrieved tweet set
19
Tweet Retrieval: Query -> Tweets
•Given a query (generated from an article), how do
we quickly collect a good set of tweets?
•Cold-start Search for new articles:
• Re-use tweets collected for older articles
• How do we do this efficiently/effectively?
•Twitter Streaming API to continuously update
tweet collection for each article
20
Experiments
•Query Generation
•Comparing L2R Algorithms
•Comparing to State-of-the-Art Methods
21
Query Generation
22
empirical study to evaluate the impact of each query type
on the amount/quality of data collected, as well as how this
influences the recommendation effectiveness.
TABLE 1
Example article and ranked article-keyphrases using 4 approaches.
Article Headline Easyjet doubles number of female pilots
Subheadline Easyjet says it has doubled the number of female
pilots this year and is on the hunt for more.
First Sentence The Amy Johnson initiative, named after the first
female pilot to fly solo from the UK to Australia,
caused a surge in applications.
POS + Tf.idf (1) australia easyjet, (2) easyjet number, (3) easyjet
uk, (4) australia number, (5) australia uk
POS + NER + Tf.idf (1) amy johnson, (2) australia easyjet, (3) easyjet
uk, (4) australia uk, (5) easyjet number
AlchemyAPI (1) amy johnson initiative, (2) female pilots, (3)
easyjet, (4) female pilot, (5) surge
URL (1) bbc.com/news/business-38326523
3.2.2 Cold-Start Search
ar
tim
ba
ar
of
re
th
so
re
w
A
Query Generation
23
P@1 0.930 0.947
Coverage 67.3% 63.3%
Time 301s 200s
TABLE 4
Average cosine similarity, number of tweets, number of candidate
hashtags and hashtag frequency using tweets collected using four
query generation methods.
POS +
Tf.idf
POS + NER +
Tf.idf
AlchemyAPI URL
Cosine 0.221 0.242 0.246 0.265
Tweets 3696.2 2982.9 5083.8 4.2
Hashtags 529 442 976 1.5
Tag Freq 5.26 5.73 5.81 1.49
TABLE 5
Comparing the P@1, NDCG@3 and running time of 16 ranking
methods using Ranklib, sklearn and Cornell’s RankSVM.
L2R Algorithm P@1 NDCG@3 Time(s)
Pointwise
RandomForest(sklearn) 0.852 0.848 2.75
MultilayerPerceptron(sklearn) 0.835 0.803 6.14
SVM(poly)(sklearn) 0.823 0.827 0.78
GradientBoosting(sklearn) 0.810 0.817 1.71
LinearRegression(sklearn) 0.803 0.824 0.16
AdaBoost(sklearn) 0.801 0.840 1.51
RandomForest(ranklib) 0.792 0.778 2.01
MART(ranklib) 0.783 0.768 49.87
Time
that f
outpe
findin
proac
4.4
To e
proac
8am-1
and a
size (
ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO.
TABLE 3
age, and running time of end-to-end hashtag recommendation using tweets collected using four query g
L2R (POS + Tf.idf) L2R (POS + NER + Tf.idf) L2R (AlchemyAPI) L2R (URL)
P@1 0.930 0.947 0.901 0.410
Coverage 67.3% 63.3% 71.3% 22.1%
Time 301s 200s 588s 48s
TABLE 4
milarity, number of tweets, number of candidate
htag frequency using tweets collected using four
query generation methods.
TABLE 6
Time-window Size: Precision@1, Article Covera
time of the hashtag recommendation using
Precision@1, article coverage and running time for hashtag recommendation
Comparing L2R algorithms
24
Cosine 0.221 0.242 0.246 0.265
Tweets 3696.2 2982.9 5083.8 4.2
Hashtags 529 442 976 1.5
Tag Freq 5.26 5.73 5.81 1.49
TABLE 5
Comparing the P@1, NDCG@3 and running time of 16 ranking
methods using Ranklib, sklearn and Cornell’s RankSVM.
L2R Algorithm P@1 NDCG@3 Time(s)
Pointwise
RandomForest(sklearn) 0.852 0.848 2.75
MultilayerPerceptron(sklearn) 0.835 0.803 6.14
SVM(poly)(sklearn) 0.823 0.827 0.78
GradientBoosting(sklearn) 0.810 0.817 1.71
LinearRegression(sklearn) 0.803 0.824 0.16
AdaBoost(sklearn) 0.801 0.840 1.51
RandomForest(ranklib) 0.792 0.778 2.01
MART(ranklib) 0.783 0.768 49.87
GaussianNaiveBayes(sklearn) 0.764 0.757 0.05
Pairwise
RankBoost(ranklib) 0.774 0.773 15.67
RankSVM(cornell) 0.728 0.734 2.05
RankNet(ranklib) 0.654 0.718 7.45
Listwise
CoordinateAscent(ranklib) 0.778 0.765 28.11
LambdaMART(ranklib) 0.769 0.766 54.48
ListNet(ranklib) 0.751 0.756 14.56
AdaRank(ranklib) 0.737 0.749 2.53
listwise ranking algorithms, pointwise methods have higher
th
ou
fi
p
4.
To
p
8a
an
si
4.
T
ar
th
th
•Multi-class Classification Methods:
• Use hashtagged tweets as labeled data (hashtag = class)
• Need to wait to collect enough training data (tweet history size, e.g., 2h or
4h of past tweets)
• Need to be retrained often to keep up with: changes in tweet vocabulary,
emerging/dieing hashtags (retraining time, e.g., time required to train the
model decides how often we can re-train)
• Naive Bayes, Liblinear SVM, Neural Net
• L2R Methods:
• Trained once with hashtaged tweets or manually labeled (article, hashtag)
examples
• Pairwise L2R and Pointwise L2R (Hashtagger+)
25
Comparing to State-of-the-Art
Comparing to State-of-the-Art
26
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO.
Article Coverage
Precision@1
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
0.10.20.30.40.50.60.70.80.91 66%,0.94(th=0.5)
77%,0.89(th=0.3)
All Articles
Hashtagger+ (search)
Hashtagger (stream)
PairwiseL2R
Liblinear (2h/30min))
Naive Bayes (4h/5min)
MultilayerPerc (1h/1h)
Fig. 7. P@1 and article coverage of the SOTA methods compared.
Precision@1
27
Comparing to State-of-the-Art:
Popular vs Niche Articles
pared.
from 4h
ticle, and
candidate
rained by
d articles
4 binary
he article
bbc/rte),
L, (4) is a
ashtags).
thod pre-
y labeled
ming for
to gather
tions.
Article Coverage
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
0.10.20.3
Hashtagger (stream)
PairwiseL2R
Liblinear (2h/30min))
Naive Bayes (4h/5min)
MultilayerPerc (1h/1h)
Article Coverage
Precision@1
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
0.10.20.30.40.50.60.70.80.91
45%,0.94 (th=0.5)
58%,0.89(th=0.3)
Niche Articles
Fig. 8. P@1 and article coverage for popular versus niche articles.
Applications
•Hashtagger+ is deployed in a Web application
(http://insight4news.ucd.ie)
•Using the recommended hashtags:
• News Publishing on Twitter
• Story Detection & Tracking
28
Live Tweeting with Hashtagger+
https://twitter.com/Insight4News3
29
Insight Centre for Data Analytics April 2016
No Hashtag #News Hashtagger
050000100000150000
Sum of Impressions
No Hashtag #News Hashtagger
02006001000
Sum of Engagements
No Hashtag #News Hashtagger
0200400600
Sum of Url Clicks
Twitter account (@insight4news3) automatically tweets article headlines.
Randomly allocate articles into 3 groups:
No Hashtag: Article Headline + URL
#News: Article Headline + URL + #News
Hashtagger: Article Headline + URL + Recommended Hashtags
Twitter Analytics Stats
30
Story Detection: http://ani.ucd.ie/
plot_news_patterns.html
31
Story Tracking with Social Tags:
32
Story Tracking with Social Tags
33
Insight Centre for Data Analytics
Conclusion
April 2016
•Hashtagger+: a framework for real-time hashtag
recommendation to news.
•L2R model trained with human-labeled data can
address efficiency & precision challenges.
•By merging news and social media we can address
difficult problems: story & entity detection/
visualization/tracking/disambiguation/linking.
34
Thank you!
References
•Hashtagger+: Efficient High-Coverage Social Tagging of Streaming News,
B. Shi, G Poghosyan, G Ifrim, N Hurley [2017, under review]
•Learning-to-Rank for Real-Time High-Precision Hashtag
Recommendation for Streaming News, B Shi, G Ifrim, N Hurley [WWW16]
•Real-time News Story Detection and Tracking with Hashtags, G.
Poghosyan, G Ifrim [CNewsStory16]
•Topy: Real-time Story Tracking via Social Tags, G. Poghosyani,A. Qureshi, G
Ifrim [ECML/PKDD16]
•Insight4news: Connecting news to relevant social conversations, B Shi, G
Ifrim, N Hurley [ECML/PKDD14]
35

Weitere ähnliche Inhalte

Ähnlich wie Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIAInsight_Altmetrics
 
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...Seoul National University
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923Rosanna Man
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
 
Rob Procter
Rob ProcterRob Procter
Rob ProcterNSMNSS
 
Finding potential candidates via git hub network analysis
Finding potential candidates via git hub network analysisFinding potential candidates via git hub network analysis
Finding potential candidates via git hub network analysisRangsarid Pringwanid
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterDan Nguyen
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
Drone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan TeknologiDrone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan TeknologiIsmail Fahmi
 

Ähnlich wie Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim (20)

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarch...
 
wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Rob Procter
Rob ProcterRob Procter
Rob Procter
 
Finding potential candidates via git hub network analysis
Finding potential candidates via git hub network analysisFinding potential candidates via git hub network analysis
Finding potential candidates via git hub network analysis
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Saner17 sharma
Saner17 sharmaSaner17 sharma
Saner17 sharma
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Drone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan TeknologiDrone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan Teknologi
 

Mehr von Sebastian Ruder

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionSebastian Ruder
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiSebastian Ruder
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSebastian Ruder
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENSebastian Ruder
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Sebastian Ruder
 

Mehr von Sebastian Ruder (20)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 

Kürzlich hochgeladen

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 

Kürzlich hochgeladen (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

  • 1. Hashtagger+: Real-time Social Tagging of Streaming News Georgiana Ifrim (joint work with Bichen Shi, Gevorg Poghosyan, Neil Hurley) Insight Centre for Data Analytics, University College Dublin, Ireland 1
  • 2. The Umbrella Revolution: Sit-in street protests in Hong Kong, 2014 2 “The ants have megaphones now” C. Anderson
  • 3. Sep21 Sep23 Sep25 Sep27 Sep29 Oct01 Oct03 Oct05 Oct07 #OccupyCentral #UmbrellaRevolution #HongKong Hong Kong students begin pro-democracy class boycott Thousands at Hong Kong protest as Occupy Central is launched Hong Kong protests: Thousands defy calls to go home Hong Kong students vow stronger protests if leader stays Hong Kong protests: Formal talks agreed as protests shrink 3
  • 4. Insight Centre for Data Analytics Motivation: News articles – Hashtag – Twitter conversation #IndyRef (Referendum on Scottish Independence) BBC: Scottish independence: Yes vote 'means big Scots EU boost' BBC: Could Scotland compete on tax with Westminster? IrishTimes: Brown promises more devolution for Scotland RTE: Lloyds could move south if Scots vote for independence Reuters: British PM heads to Scotland as independence campaign gathers steam TheGuardian: Scottish independence: No camp sends for Gordon Brown as polls tighten April 2016 4
  • 6. Insight Centre for Data Analytics Problem Statement Map a stream of articles to a stream of hashtags in real-time, with high-precision and high-coverage. Joe Schmidt makes six changes to Irish side to face Japan #rugby Paris Airshow: eight takeaways from the major aerospace event #business The tortoise and the software: the human glitch in the machine #ux Duke of Edinburgh leaves hospital #princephilip April 2016 6
  • 7. Insight Centre for Data Analytics Problem Statement •Real-time Recommendation: given an article, how quickly can we recommend hashtags? (5mins ok, 5h not ok) •High-precision (focused hashtags): X Deadly car bomb targets Afghan bank #news V Deadly car bomb targets Afghan bank #afghanistan #helmand •High-coverage: how many articles get any recommended hashtags within 5 minutes? (9 out of 10 ok, 1 out of 10 not ok) April 2016 7
  • 8. State-of-the-Art •Modeling Approach: • Multi-class Classification • Content-based Features • Static Datasets •Workflow: Tweets Article • Collect Tweets -> Hashtags as Classes -> Train Hashtag Classifiers -> Apply Classifiers to Article -> Recommended Hashtags 8
  • 9. Insight Centre for Data Analytics State-of-the-art: • Multi-class classification (e.g., Naive Bayes, SVM, LDA, CNN) • One hashtag = one class • Content-based features April 2016 #GE16 #ge16: Fine Gael and Fianna Fáil to discuss government options Ruth Coppinger to be nominated for Taoiseach #GE16 #irishwater … #Germanwings "No evidence" that co-pilot told anyone he was planning #Germanwings crash, prosecutor says … 9
  • 10. Insight Centre for Data Analytics State-of-the-art: April 2016 #GE16 #ge16: Fine Gael and Fianna Fáil to discuss government options Ruth Coppinger to be nominated for Taoiseach #GE16 #irishwater … #German wings "No evidence" that co-pilot told anyone he was planning #Germanwings crash, prosecutor says… … … Model Train #Panama Papers #PanamaPapers: Mossack Fonseca leak reveals elite's tax havens #PanamaPapers: How the World's Rich and Famous Hide Their Money Offshore How about new hashtags? Concept-drift of old hashtags? #German wings: One year on, Haltern commemorates the crash Nice flight from Manchester to Koln/ Bonn this morning. Re-train the model Weakness: Apply Apply 10
  • 11. Insight Centre for Data Analytics Challenges: •Many Classes: thousands of hashtags (e.g., 26k/day) •Dynamic Classes: hashtags emerge and die-off •Concept Drift: usage and meaning of hashtags changes •Efficiency/coverage: real-time tagging to capture how the story moves over time •Precision: state-of-the-art models have P@1 of ~50% April 2016 11
  • 12. Hashtagger+ Model •Modeling Approach: • Learning-to-rank (L2R) • Focus on the concept of hashtag relevance • IR Framework:Article = query, Hashtags = documents retrieved/ranked for the query • Workflow: Article Tweets 12
  • 14. Insight Centre for Data Analytics Hashtagger+ Model April 2016 Object Class Article1 Hashtagx Hashtagy Hashtagz Article2 Hashtagx Article3 Hashtagy Hashtagz Object Class (Article1 , Hashtagx) Relevant (Article1 , Hashtagm) Irrelevant (Article1 , Hashtagn) Irrelevant SOTA: Multi-class Classification Proposed L2R Model 14
  • 15. • Pointwise L2R model • Input feature vector xarticle,hashtag,time describes a given (Article , Hashtag) pair at a point in time • Human provided label yarticle,hashtag,time tells if the hashtag is relevant or irrelevant to the article, at that point in time • Time-aware features capture how strongly a hashtag is associated with an article Content Similarity Hashtag Popularity, Specificity, Trending User Credibility 15 Hashtagger+ Model
  • 16. Insight Centre for Data Analytics Hashtagger+ Model: April 2016 (Article1 , #GE16) 0.34 0.73 0 … Relevant (Article1 , #Germanwings) 0.01 0.23 0 … Irrelevant … … … … … … (Article2 , #GE16) 0.02 0.48 0 … Irrelevant (Article2 , #Germanwings) 0.76 0.45 1 … Relevant … … … … … … Model Train How about new hashtags? Concept-drift of old hashtags? Train once, use model (no retraining needed) (Article1 , #PanamaPapers) 0.66 0.82 1 … (Article2 , #PanamaPapers) 0.08 0.73 0 … (Article1 , #Germanwings) 0.28 0.45 0 … (Article2 , #Germanwings) 0.53 0.24 1 … Apply Apply 16
  • 17. Insight Centre for Data Analytics Two-Step L2R Approach • Filtering: Article -> Set of Candidate Hashtags • Efficient Data Collection • Query generation from given article • Retrieving relevant tweets for article/query • Ranking Model: Article, Candidate Hashtags -> Ranked Hashtag List • Apply pre-trained L2R model to rank candidate hashtags April 2016 17
  • 19. Query Generation: Article -> Query •What is a good set of keywords to describe what the article is about? (open research problem) •How quickly can we generate the query? •How good is the set of tweets retrieved with a given query? •We compare 4 methods for query generation and the effect on quality & size of retrieved tweet set 19
  • 20. Tweet Retrieval: Query -> Tweets •Given a query (generated from an article), how do we quickly collect a good set of tweets? •Cold-start Search for new articles: • Re-use tweets collected for older articles • How do we do this efficiently/effectively? •Twitter Streaming API to continuously update tweet collection for each article 20
  • 21. Experiments •Query Generation •Comparing L2R Algorithms •Comparing to State-of-the-Art Methods 21
  • 22. Query Generation 22 empirical study to evaluate the impact of each query type on the amount/quality of data collected, as well as how this influences the recommendation effectiveness. TABLE 1 Example article and ranked article-keyphrases using 4 approaches. Article Headline Easyjet doubles number of female pilots Subheadline Easyjet says it has doubled the number of female pilots this year and is on the hunt for more. First Sentence The Amy Johnson initiative, named after the first female pilot to fly solo from the UK to Australia, caused a surge in applications. POS + Tf.idf (1) australia easyjet, (2) easyjet number, (3) easyjet uk, (4) australia number, (5) australia uk POS + NER + Tf.idf (1) amy johnson, (2) australia easyjet, (3) easyjet uk, (4) australia uk, (5) easyjet number AlchemyAPI (1) amy johnson initiative, (2) female pilots, (3) easyjet, (4) female pilot, (5) surge URL (1) bbc.com/news/business-38326523 3.2.2 Cold-Start Search ar tim ba ar of re th so re w A
  • 23. Query Generation 23 P@1 0.930 0.947 Coverage 67.3% 63.3% Time 301s 200s TABLE 4 Average cosine similarity, number of tweets, number of candidate hashtags and hashtag frequency using tweets collected using four query generation methods. POS + Tf.idf POS + NER + Tf.idf AlchemyAPI URL Cosine 0.221 0.242 0.246 0.265 Tweets 3696.2 2982.9 5083.8 4.2 Hashtags 529 442 976 1.5 Tag Freq 5.26 5.73 5.81 1.49 TABLE 5 Comparing the P@1, NDCG@3 and running time of 16 ranking methods using Ranklib, sklearn and Cornell’s RankSVM. L2R Algorithm P@1 NDCG@3 Time(s) Pointwise RandomForest(sklearn) 0.852 0.848 2.75 MultilayerPerceptron(sklearn) 0.835 0.803 6.14 SVM(poly)(sklearn) 0.823 0.827 0.78 GradientBoosting(sklearn) 0.810 0.817 1.71 LinearRegression(sklearn) 0.803 0.824 0.16 AdaBoost(sklearn) 0.801 0.840 1.51 RandomForest(ranklib) 0.792 0.778 2.01 MART(ranklib) 0.783 0.768 49.87 Time that f outpe findin proac 4.4 To e proac 8am-1 and a size ( ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. TABLE 3 age, and running time of end-to-end hashtag recommendation using tweets collected using four query g L2R (POS + Tf.idf) L2R (POS + NER + Tf.idf) L2R (AlchemyAPI) L2R (URL) P@1 0.930 0.947 0.901 0.410 Coverage 67.3% 63.3% 71.3% 22.1% Time 301s 200s 588s 48s TABLE 4 milarity, number of tweets, number of candidate htag frequency using tweets collected using four query generation methods. TABLE 6 Time-window Size: Precision@1, Article Covera time of the hashtag recommendation using Precision@1, article coverage and running time for hashtag recommendation
  • 24. Comparing L2R algorithms 24 Cosine 0.221 0.242 0.246 0.265 Tweets 3696.2 2982.9 5083.8 4.2 Hashtags 529 442 976 1.5 Tag Freq 5.26 5.73 5.81 1.49 TABLE 5 Comparing the P@1, NDCG@3 and running time of 16 ranking methods using Ranklib, sklearn and Cornell’s RankSVM. L2R Algorithm P@1 NDCG@3 Time(s) Pointwise RandomForest(sklearn) 0.852 0.848 2.75 MultilayerPerceptron(sklearn) 0.835 0.803 6.14 SVM(poly)(sklearn) 0.823 0.827 0.78 GradientBoosting(sklearn) 0.810 0.817 1.71 LinearRegression(sklearn) 0.803 0.824 0.16 AdaBoost(sklearn) 0.801 0.840 1.51 RandomForest(ranklib) 0.792 0.778 2.01 MART(ranklib) 0.783 0.768 49.87 GaussianNaiveBayes(sklearn) 0.764 0.757 0.05 Pairwise RankBoost(ranklib) 0.774 0.773 15.67 RankSVM(cornell) 0.728 0.734 2.05 RankNet(ranklib) 0.654 0.718 7.45 Listwise CoordinateAscent(ranklib) 0.778 0.765 28.11 LambdaMART(ranklib) 0.769 0.766 54.48 ListNet(ranklib) 0.751 0.756 14.56 AdaRank(ranklib) 0.737 0.749 2.53 listwise ranking algorithms, pointwise methods have higher th ou fi p 4. To p 8a an si 4. T ar th th
  • 25. •Multi-class Classification Methods: • Use hashtagged tweets as labeled data (hashtag = class) • Need to wait to collect enough training data (tweet history size, e.g., 2h or 4h of past tweets) • Need to be retrained often to keep up with: changes in tweet vocabulary, emerging/dieing hashtags (retraining time, e.g., time required to train the model decides how often we can re-train) • Naive Bayes, Liblinear SVM, Neural Net • L2R Methods: • Trained once with hashtaged tweets or manually labeled (article, hashtag) examples • Pairwise L2R and Pointwise L2R (Hashtagger+) 25 Comparing to State-of-the-Art
  • 26. Comparing to State-of-the-Art 26 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. Article Coverage Precision@1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0.10.20.30.40.50.60.70.80.91 66%,0.94(th=0.5) 77%,0.89(th=0.3) All Articles Hashtagger+ (search) Hashtagger (stream) PairwiseL2R Liblinear (2h/30min)) Naive Bayes (4h/5min) MultilayerPerc (1h/1h) Fig. 7. P@1 and article coverage of the SOTA methods compared. Precision@1
  • 27. 27 Comparing to State-of-the-Art: Popular vs Niche Articles pared. from 4h ticle, and candidate rained by d articles 4 binary he article bbc/rte), L, (4) is a ashtags). thod pre- y labeled ming for to gather tions. Article Coverage 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0.10.20.3 Hashtagger (stream) PairwiseL2R Liblinear (2h/30min)) Naive Bayes (4h/5min) MultilayerPerc (1h/1h) Article Coverage Precision@1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0.10.20.30.40.50.60.70.80.91 45%,0.94 (th=0.5) 58%,0.89(th=0.3) Niche Articles Fig. 8. P@1 and article coverage for popular versus niche articles.
  • 28. Applications •Hashtagger+ is deployed in a Web application (http://insight4news.ucd.ie) •Using the recommended hashtags: • News Publishing on Twitter • Story Detection & Tracking 28
  • 29. Live Tweeting with Hashtagger+ https://twitter.com/Insight4News3 29
  • 30. Insight Centre for Data Analytics April 2016 No Hashtag #News Hashtagger 050000100000150000 Sum of Impressions No Hashtag #News Hashtagger 02006001000 Sum of Engagements No Hashtag #News Hashtagger 0200400600 Sum of Url Clicks Twitter account (@insight4news3) automatically tweets article headlines. Randomly allocate articles into 3 groups: No Hashtag: Article Headline + URL #News: Article Headline + URL + #News Hashtagger: Article Headline + URL + Recommended Hashtags Twitter Analytics Stats 30
  • 32. Story Tracking with Social Tags: 32
  • 33. Story Tracking with Social Tags 33
  • 34. Insight Centre for Data Analytics Conclusion April 2016 •Hashtagger+: a framework for real-time hashtag recommendation to news. •L2R model trained with human-labeled data can address efficiency & precision challenges. •By merging news and social media we can address difficult problems: story & entity detection/ visualization/tracking/disambiguation/linking. 34
  • 35. Thank you! References •Hashtagger+: Efficient High-Coverage Social Tagging of Streaming News, B. Shi, G Poghosyan, G Ifrim, N Hurley [2017, under review] •Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News, B Shi, G Ifrim, N Hurley [WWW16] •Real-time News Story Detection and Tracking with Hashtags, G. Poghosyan, G Ifrim [CNewsStory16] •Topy: Real-time Story Tracking via Social Tags, G. Poghosyani,A. Qureshi, G Ifrim [ECML/PKDD16] •Insight4news: Connecting news to relevant social conversations, B Shi, G Ifrim, N Hurley [ECML/PKDD14] 35