SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Twinder: A Search Engine for Twitter Streams

                                   #ICWE2012 Berlin, Germany
                                              July 25th, 2012




     Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben
                      Web Information Systems, TU Delft



                                                            1
Get information from Twitter
How do people use Twitter as a source of information?

•  Twitter is more like a news media.


•  How do people search Twitter?
   •  Search on Twitter (Teevan et al.)



                                  Web Search     Twitter Search
      Query length (chars)            18.80          12.00
     Query length (words)                 3.08       1.64
      Is a celebrity name             3.11%         15.22%



           Twinder: A search engine for Twitter streams           2
Research Questions
What are the challenges we are facing?

 1. Given a topic, can we identify the relevant tweets
 based on the characteristics of the tweets?


 2. Are semantics meaningful for determining the tweets’
 relevance for a topic?


 3. How can we design an architecture that is scalable?




          Twinder: A search engine for Twitter streams     3
Search on Twitter
What you can do on current Twitter?


• Twitter search interface

• Ordered by time

• Keyword-based match




          Twinder: A search engine for Twitter streams   4
Twinder = TWeet +fINDER
Our solution - Architecture
                                                                                               $03.0&

                                                            B%"&@'                                 3""?:#*<'
                                                                                    &"1%5$1'

                                                                        ."#&*/'01"&'2-$"&3#*"'

                             784*%3.&&
                             93/.2:&&                                   4"5"6#-*"'(1+7#+,-'
                             ;*+4*3&
                                                                         !"#$%&"'()$&#*+,-'
                             !"#$%&"'()$&#*+,-'


                                                   1#(4254*03*04)63&&                              1#(42503*04)63&&
                                ;#1<'=&,<"&'


                                                  .@-$#*+*#5'!"#$%&"1'                            C"@D,&?>:#1"?'
                                                                                                    4"5"6#-*"'
                 3"#$%&"'                         A,-$")$%#5'!"#$%&"1'          2-?")'
                ")$&#*+,-'                                                                        ."7#-+*1>:#1"?'
!"#$%&!#'($)*+&   $#1<1'                          ."7#-+*'!"#$%&"1'                                  4"5"6#-*"'
 ,*-./01.$21$.3&


                                                                                     7"11#E"1'


                                                                         .,*8#5'9":'.$&"#71'



                   Twinder: A search engine for Twitter streams                                                       5
Core Components of Twinder
Feature Extraction

 •  Receive Twitter messages from Social Web Streams


 •  Features of two categories:
    •  (1) Topic-sensitive features

    •  (2) Topic-insensitive features



 •  Different extracting strategies designed for different features




          Twinder: A search engine for Twitter streams                6
Core Components of Twinder
Feature Extraction Task Broker

 •  Twinder makes use of MapReduce and cloud computing
    infrastructures to allow for high scalability and frequent
    updates of its multifaceted index.


 •  Feature Extration Task Broker dispatches features
    extration tasks and indexing tasks to cloud computing
    infrastructure.




          Twinder: A search engine for Twitter streams           7
Core Components of Twinder
Relevance Estimation

 •  Accepting search queries from front-end, passing them to
    Feature Extration Component.


 •  Tweets are classified into the relevant and the non-relevant
    by Relevance Estimation component, and are further
    delivered to front-end for rendering.


 •  Twinder can learn the classification model, initially from
    training dataset, then from usage data.




          Twinder: A search engine for Twitter streams             8
Efficiency of Indexing
How good does Twinder make use of cloud-computing
infrastructure?


Corpus size        Mainstream Server EMR(10 instances)
100k (13MBytes)    0.4 min             5 min
1m (122MBytes)     5 min               8 min
10m (1.3GBytes)    48 min              19 min
32m (3.9GBytes)    283 min             47 min




         Twinder: A search engine for Twitter streams    9
Features of Microposts

 Hypothesis H1: The greater the keyword-
 based relevance score, the more relevant
 and interesting the tweet is to the topic.

 We already have keyword-based relevance,
 and…?

   Topic sensitive            Topic insensitive
    Keyword-based
      relevance
          ?
       Twinder: A search engine for Twitter streams   10
Semantic-based relevance
Expand the queries to match more tweets




                               dbp:United_States

  Reformulated query is expected to get a
            dbp:Hu_Jintao
  more accurate retrieval score.

 Hypothesis H2 : The greater the semantic-
 based relevance score, the more relevant
 and interesting the tweet is.
         Twinder: A search engine for Twitter streams   11
Semantic-based relatedness
Is there a semantic overlap between the query and the tweet?




                                dbp:the_United_States

dbp:Hu_Jintao


Hypothesis H3 : If a tweet is considered to
be semantically related to the query then it
is also relevant and interesting for the user.


          Twinder: A search engine for Twitter streams         12
Overview of features
What do we have now?




   Topic sensitive             Topic insensitive
    Keyword-based                      ?
   Semantic-based




         Twinder: A search engine for Twitter streams   13
Syntactical feature : Hashtag
Is a tweet more relevant if it contains a #hashtag?

  Hypothesis 4: tweets that contain hashtags
  are more likely to be relevant than tweets
  that do not contain hashtags.




          Twinder: A search engine for Twitter streams   14
Syntactical feature : hasURL
Is a tweet that contains a URL more relevant?

  Hypothesis 5: tweets that contain a URL are
  more likely to be relevant than tweets that
  do not contain a URL.




          Twinder: A search engine for Twitter streams   15
Syntactical feature : isReply
Is a tweet which is a reply to @somebody more relevant?

  Hypothesis 6: tweets that are formulated as
  a reply to another tweet are less likely to be
  relevant than other tweets.




          Twinder: A search engine for Twitter streams    16
Syntactical feature : length
Does the length of a tweet influence its relevance for a topic?




  Hypothesis 7: the longer a tweet, the more
  likely it is to be relevant and interesting.



          Twinder: A search engine for Twitter streams            17
Overview of features
 Short summary




    Topic sensitive             Topic insensitive
     Keyword-based              Syntactical features
    Semantic-based                       ?


   Are there further features that
allow for estimating the relevance?

          Twinder: A search engine for Twitter streams   18
Semantic features
Find semantics in a tweet to estimate the relevance


         dbp:Tim_Berners-Lee          dbp:World_Wide_Web




                                              dbp:France

                                       dbp:Lyon

                          dbp:International_World_Wide_Web_Conference




          Twinder: A search engine for Twitter streams             19
Semantic features : #entity
Is a tweet with more entities more interesting?


• 5 entities extracted.




  Hypothesis 8: the more entities a tweet
  mentions, the more likely it is to be relevant
  and interesting.
          Twinder: A search engine for Twitter streams   20
Semantic features : diversity
How many types are there in the entities?


• 4 types of entities




  Hypothesis 9: the greater the diversity of
  concepts mentioned in a tweet, the more
  likely it is to be interesting and relevant.
          Twinder: A search engine for Twitter streams   21
Semantic features : sentiment
Was the author of the tweet happy or not?


• Sentiment : Neutral




  Hypothesis 10: the likelihood of a tweet’s
  relevance is influenced by its sentiment
  polarity.
          Twinder: A search engine for Twitter streams   22
Overview of features
By now, we have 4 types of features.




    Topic sensitive              Topic insensitive
     Keyword-based                  Syntactical
    Semantic-based                  Semantics



       Can we utilize the contextual
          information of tweets?
          Twinder: A search engine for Twitter streams   23
Contextual features
Does the number of followers influence the relatedness?




    Hypothesis 11: The higher the number of
    followers a creator of a message has, the
    more likely it is that her tweets are relevant.


          Twinder: A search engine for Twitter streams    24
Contextual features
Or the number of followers that the author appears in?

 Hypothesis 12: The higher the number of
 lists in which the creator of a message
 appears, the more likely it is that her tweets
 are relevant.




          Twinder: A search engine for Twitter streams   25
Contextual features
How long has been the author on Twitter?

 Hypothesis 13: The older the Twitter
 account of a user, the more likely it is that
 her tweets are relevant.




         Signed up                    Post


   July 2008                    June 2012




           Twinder: A search engine for Twitter streams   26
Summary of Features
The features




    Topic sensitive             Topic insensitive
     Keyword-based                 Syntactical
    Semantic-based                 Semantics
                                   Contextual




          Twinder: A search engine for Twitter streams   27
Analysis

• Research Questions:
  1.  Which features are more influential on predicting the
      relatedness of a tweet to a certain topic?
  2.  Which types of features are more important? Are
      semantics meaningful?
  3.  What’s the performance that we can achieve by utilizing
      these features?

• Twinder Setup
  •  Consider the search problem as a classification task
  •  Classification algorithm = Logistic Regression



         Twinder: A search engine for Twitter streams           28
Dataset
From TREC 2011 Microblog Track


•  Twitter corpus
   •  16 million tweets (Jan. 24th, 2011 – Feb. 8th)
   •  4,766,901 tweets classified as English
   •  6.2 million entity-extractions (140k distinct entities)

•  Relevance judgments
   •  49 topics
   •  40,855 (topic, tweet) pairs
   •  60.31 relevant tweets per topic (on average)




           Twinder: A search engine for Twitter streams         29
Results
Which type of features matters?
Features             Precision       Recall            F-measure
keyword relevance           0.3036            0.2851         0.2940
semantic relevance         0.3050             0.3294        0.3167
topic-sensitive            0.3135             0.3252        0.3192
topic-insensitive           0.1956            0.0064         0.0123
without semantics           0.3363            0.4618         0.3965
without sentiment           0.3701            0.3923         0.4048
without context            0.3827             0.4714        0.4225
all features                0.3674            0.4736         0.4138

 Overall, we can achieve the precision and
 recall of over 35% and 45% respectively by
 applying all the features.
           Twinder: A search engine for Twitter streams            30
Weights of features                              Syntactical
     Which feature matters?         2
                                    1
         Keyword-based              0    hasHashtag     hasURL               isReply       length
2
                                    -1
1
0                                                       Semantics
-1       Keyword-based relevance    2
                                    1
         Semantic-based             0
     2                                      #entities            diversity             sentiment

                                    -1
     1
                                                      Contextual
     0
                                    2
 -1      Relevance    Relatedness
                                    1
                                    0      #followers             #lists                 Age

                                    -1
               Twinder: A search engine for Twitter streams                                        31
Topics of different categories
The impact on the performance and models

•  49 topics categorized into 2 parts w.r.t. 3 dimensions:
   •  Popularity
   •  Gobal vs. Local
   •  Temporal persistence

•  Popularity
   •  Higher recall for popular topics
   •  Less impact from sentiment features on unpopular topics

•  Temporal persistence
   •  Higher performance on shorter-term topics
   •  Less impact from sentiment features on persistent topic



           Twinder: A search engine for Twitter streams         32
Conclusions
What are our contributions?

1.  Twinder search engine proposed: analyzing various
    features to determine the relevance and interestingness of
    Twitter messages for a given topic.

2.  Scalability demonstrated for the Twinder search engine.

3.  Extensive analysis on 13 features along two-dimensions:
    topic-sensitive features and topic-insensitive features.




          Twinder: A search engine for Twitter streams           33
Conclusions
The lessons learned

1.  The learned models which take advantage of semantics
    and topic-sensitive features outperform those which do not
    take the semantics and topic-sensitive features into
    account.

2.  Contextual features that characterize the users who are
    posting the messages have little impact on the relevance
    estimation.

3.  The importance of a feature differs depending on the topic
    characteristics; for example, the sentiment-based features
    are more important for popular than for unpopular topics.



          Twinder: A search engine for Twitter streams           34
QUESTIONS?



July 25th, 2012

k.tao@tudelft.nl       http://ktao.nl/


THANK YOU!


            Twinder: A search engine for Twitter streams   35

Weitere ähnliche Inhalte

Andere mochten auch

Leveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataLeveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataKe Tao
 
2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TRECHua Li, PhD
 
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...GUANGYUAN PIAO
 
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...GUANGYUAN PIAO
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...GUANGYUAN PIAO
 
GeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebGeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebWeb Information Systems, TU Delft
 

Andere mochten auch (6)

Leveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataLeveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked Data
 
2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC
 
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
 
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
 
GeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebGeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic Web
 

Ähnlich wie Twinder: A Search Engine for Twitter Streams

Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsFabian Abel
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsWeb Information Systems, TU Delft
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 
Development of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesDevelopment of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesMyungjin Lee
 
How Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsHow Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsAlex Ishida
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherKMb Unit, York University
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash TagIRJET Journal
 
Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Mike Arnesen
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonPyData
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataDavid Gerson
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentationPriyaj Kumar
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterKrist Wongsuphasawat
 

Ähnlich wie Twinder: A Search Engine for Twitter Streams (20)

The Value of Twitter
The Value of TwitterThe Value of Twitter
The Value of Twitter
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
Development of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesDevelopment of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for Websites
 
How Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsHow Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company Accounts
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for Researcher
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter Data
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 
Using Twitter
Using TwitterUsing Twitter
Using Twitter
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 

Kürzlich hochgeladen

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Twinder: A Search Engine for Twitter Streams

  • 1. Twinder: A Search Engine for Twitter Streams #ICWE2012 Berlin, Germany July 25th, 2012 Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft 1
  • 2. Get information from Twitter How do people use Twitter as a source of information? •  Twitter is more like a news media. •  How do people search Twitter? •  Search on Twitter (Teevan et al.) Web Search Twitter Search Query length (chars) 18.80 12.00 Query length (words) 3.08 1.64 Is a celebrity name 3.11% 15.22% Twinder: A search engine for Twitter streams 2
  • 3. Research Questions What are the challenges we are facing? 1. Given a topic, can we identify the relevant tweets based on the characteristics of the tweets? 2. Are semantics meaningful for determining the tweets’ relevance for a topic? 3. How can we design an architecture that is scalable? Twinder: A search engine for Twitter streams 3
  • 4. Search on Twitter What you can do on current Twitter? • Twitter search interface • Ordered by time • Keyword-based match Twinder: A search engine for Twitter streams 4
  • 5. Twinder = TWeet +fINDER Our solution - Architecture $03.0& B%"&@' 3""?:#*<' &"1%5$1' ."#&*/'01"&'2-$"&3#*"' 784*%3.&& 93/.2:&& 4"5"6#-*"'(1+7#+,-' ;*+4*3& !"#$%&"'()$&#*+,-' !"#$%&"'()$&#*+,-' 1#(4254*03*04)63&& 1#(42503*04)63&& ;#1<'=&,<"&' .@-$#*+*#5'!"#$%&"1' C"@D,&?>:#1"?' 4"5"6#-*"' 3"#$%&"' A,-$")$%#5'!"#$%&"1' 2-?")' ")$&#*+,-' ."7#-+*1>:#1"?' !"#$%&!#'($)*+& $#1<1' ."7#-+*'!"#$%&"1' 4"5"6#-*"' ,*-./01.$21$.3& 7"11#E"1' .,*8#5'9":'.$&"#71' Twinder: A search engine for Twitter streams 5
  • 6. Core Components of Twinder Feature Extraction •  Receive Twitter messages from Social Web Streams •  Features of two categories: •  (1) Topic-sensitive features •  (2) Topic-insensitive features •  Different extracting strategies designed for different features Twinder: A search engine for Twitter streams 6
  • 7. Core Components of Twinder Feature Extraction Task Broker •  Twinder makes use of MapReduce and cloud computing infrastructures to allow for high scalability and frequent updates of its multifaceted index. •  Feature Extration Task Broker dispatches features extration tasks and indexing tasks to cloud computing infrastructure. Twinder: A search engine for Twitter streams 7
  • 8. Core Components of Twinder Relevance Estimation •  Accepting search queries from front-end, passing them to Feature Extration Component. •  Tweets are classified into the relevant and the non-relevant by Relevance Estimation component, and are further delivered to front-end for rendering. •  Twinder can learn the classification model, initially from training dataset, then from usage data. Twinder: A search engine for Twitter streams 8
  • 9. Efficiency of Indexing How good does Twinder make use of cloud-computing infrastructure? Corpus size Mainstream Server EMR(10 instances) 100k (13MBytes) 0.4 min 5 min 1m (122MBytes) 5 min 8 min 10m (1.3GBytes) 48 min 19 min 32m (3.9GBytes) 283 min 47 min Twinder: A search engine for Twitter streams 9
  • 10. Features of Microposts Hypothesis H1: The greater the keyword- based relevance score, the more relevant and interesting the tweet is to the topic. We already have keyword-based relevance, and…? Topic sensitive Topic insensitive Keyword-based relevance ? Twinder: A search engine for Twitter streams 10
  • 11. Semantic-based relevance Expand the queries to match more tweets dbp:United_States Reformulated query is expected to get a dbp:Hu_Jintao more accurate retrieval score. Hypothesis H2 : The greater the semantic- based relevance score, the more relevant and interesting the tweet is. Twinder: A search engine for Twitter streams 11
  • 12. Semantic-based relatedness Is there a semantic overlap between the query and the tweet? dbp:the_United_States dbp:Hu_Jintao Hypothesis H3 : If a tweet is considered to be semantically related to the query then it is also relevant and interesting for the user. Twinder: A search engine for Twitter streams 12
  • 13. Overview of features What do we have now? Topic sensitive Topic insensitive Keyword-based ? Semantic-based Twinder: A search engine for Twitter streams 13
  • 14. Syntactical feature : Hashtag Is a tweet more relevant if it contains a #hashtag? Hypothesis 4: tweets that contain hashtags are more likely to be relevant than tweets that do not contain hashtags. Twinder: A search engine for Twitter streams 14
  • 15. Syntactical feature : hasURL Is a tweet that contains a URL more relevant? Hypothesis 5: tweets that contain a URL are more likely to be relevant than tweets that do not contain a URL. Twinder: A search engine for Twitter streams 15
  • 16. Syntactical feature : isReply Is a tweet which is a reply to @somebody more relevant? Hypothesis 6: tweets that are formulated as a reply to another tweet are less likely to be relevant than other tweets. Twinder: A search engine for Twitter streams 16
  • 17. Syntactical feature : length Does the length of a tweet influence its relevance for a topic? Hypothesis 7: the longer a tweet, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 17
  • 18. Overview of features Short summary Topic sensitive Topic insensitive Keyword-based Syntactical features Semantic-based ? Are there further features that allow for estimating the relevance? Twinder: A search engine for Twitter streams 18
  • 19. Semantic features Find semantics in a tweet to estimate the relevance dbp:Tim_Berners-Lee dbp:World_Wide_Web dbp:France dbp:Lyon dbp:International_World_Wide_Web_Conference Twinder: A search engine for Twitter streams 19
  • 20. Semantic features : #entity Is a tweet with more entities more interesting? • 5 entities extracted. Hypothesis 8: the more entities a tweet mentions, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 20
  • 21. Semantic features : diversity How many types are there in the entities? • 4 types of entities Hypothesis 9: the greater the diversity of concepts mentioned in a tweet, the more likely it is to be interesting and relevant. Twinder: A search engine for Twitter streams 21
  • 22. Semantic features : sentiment Was the author of the tweet happy or not? • Sentiment : Neutral Hypothesis 10: the likelihood of a tweet’s relevance is influenced by its sentiment polarity. Twinder: A search engine for Twitter streams 22
  • 23. Overview of features By now, we have 4 types of features. Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Can we utilize the contextual information of tweets? Twinder: A search engine for Twitter streams 23
  • 24. Contextual features Does the number of followers influence the relatedness? Hypothesis 11: The higher the number of followers a creator of a message has, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 24
  • 25. Contextual features Or the number of followers that the author appears in? Hypothesis 12: The higher the number of lists in which the creator of a message appears, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 25
  • 26. Contextual features How long has been the author on Twitter? Hypothesis 13: The older the Twitter account of a user, the more likely it is that her tweets are relevant. Signed up Post July 2008 June 2012 Twinder: A search engine for Twitter streams 26
  • 27. Summary of Features The features Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Contextual Twinder: A search engine for Twitter streams 27
  • 28. Analysis • Research Questions: 1.  Which features are more influential on predicting the relatedness of a tweet to a certain topic? 2.  Which types of features are more important? Are semantics meaningful? 3.  What’s the performance that we can achieve by utilizing these features? • Twinder Setup •  Consider the search problem as a classification task •  Classification algorithm = Logistic Regression Twinder: A search engine for Twitter streams 28
  • 29. Dataset From TREC 2011 Microblog Track •  Twitter corpus •  16 million tweets (Jan. 24th, 2011 – Feb. 8th) •  4,766,901 tweets classified as English •  6.2 million entity-extractions (140k distinct entities) •  Relevance judgments •  49 topics •  40,855 (topic, tweet) pairs •  60.31 relevant tweets per topic (on average) Twinder: A search engine for Twitter streams 29
  • 30. Results Which type of features matters? Features Precision Recall F-measure keyword relevance 0.3036 0.2851 0.2940 semantic relevance 0.3050 0.3294 0.3167 topic-sensitive 0.3135 0.3252 0.3192 topic-insensitive 0.1956 0.0064 0.0123 without semantics 0.3363 0.4618 0.3965 without sentiment 0.3701 0.3923 0.4048 without context 0.3827 0.4714 0.4225 all features 0.3674 0.4736 0.4138 Overall, we can achieve the precision and recall of over 35% and 45% respectively by applying all the features. Twinder: A search engine for Twitter streams 30
  • 31. Weights of features Syntactical Which feature matters? 2 1 Keyword-based 0 hasHashtag hasURL isReply length 2 -1 1 0 Semantics -1 Keyword-based relevance 2 1 Semantic-based 0 2 #entities diversity sentiment -1 1 Contextual 0 2 -1 Relevance Relatedness 1 0 #followers #lists Age -1 Twinder: A search engine for Twitter streams 31
  • 32. Topics of different categories The impact on the performance and models •  49 topics categorized into 2 parts w.r.t. 3 dimensions: •  Popularity •  Gobal vs. Local •  Temporal persistence •  Popularity •  Higher recall for popular topics •  Less impact from sentiment features on unpopular topics •  Temporal persistence •  Higher performance on shorter-term topics •  Less impact from sentiment features on persistent topic Twinder: A search engine for Twitter streams 32
  • 33. Conclusions What are our contributions? 1.  Twinder search engine proposed: analyzing various features to determine the relevance and interestingness of Twitter messages for a given topic. 2.  Scalability demonstrated for the Twinder search engine. 3.  Extensive analysis on 13 features along two-dimensions: topic-sensitive features and topic-insensitive features. Twinder: A search engine for Twitter streams 33
  • 34. Conclusions The lessons learned 1.  The learned models which take advantage of semantics and topic-sensitive features outperform those which do not take the semantics and topic-sensitive features into account. 2.  Contextual features that characterize the users who are posting the messages have little impact on the relevance estimation. 3.  The importance of a feature differs depending on the topic characteristics; for example, the sentiment-based features are more important for popular than for unpopular topics. Twinder: A search engine for Twitter streams 34
  • 35. QUESTIONS? July 25th, 2012 k.tao@tudelft.nl http://ktao.nl/ THANK YOU! Twinder: A search engine for Twitter streams 35