SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Searching for Quality Microblog Posts:
Filtering and Ranking based on Content
Analysis and Implicit Links


Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Department of Computer Science and Engineering
HKUST
Hong Kong

DASFAA‟12
Introduction   Method      Features    Experiments   Conclusions


      Agenda
2


         Introduction
         Proposed method
         Quality features of tweets
         Experiments
         Conclusions
Introduction     Method   Features   Experiments   Conclusions




    3          Introduction
Introduction        Method   Features         Experiments       Conclusions


      Microblogs
4


                             mentioned user         timestamp
                user

          Tweet 1


          Tweet 2
                                                                hashtag
                                                URL link


         Both social network and social media
           Linksbetween users (follow, mention, re-tweet)
           Users post updates (tweets)
Introduction   Method   Features   Experiments   Conclusions


      Searching for “ipad” on Twitter
5




                                        Around 50 tweets
                                        mentioning “iPad”
                                        posted within a
                                        1-minute period
Introduction           Method       Features       Experiments       Conclusions


      Research challenge
6


         Twitter: user-generated content
           Short messages, often comments or opinions
           High volume
           Varying quality
                  “Most tweets are not of general interest (57%)”   (Alonso et
                   al.’10)
              Information overload
         Research questions:
           How  to distinguish content worth reading from
            useless or less important messages?
           How to promote „high quality‟ content?
Introduction       Method          Features       Experiments        Conclusions


      Defining „quality‟
7


         General (global) definition for assessing tweet
          quality
         3 criteria:
              Well-formedness
               + Well-written, grammatically correct, understandable
               - Heavy slang, misspellings, excessive punctuation
              Factuality
               + News, events, announcements
               - Unclear message, private conversations, generic personal
                 feelings
              Navigational quality (URL links)
               + Reputable external resources (e.g. news articles)
Introduction   Method   Features   Experiments   Conclusions


      Quality-based tweet filtering
8




                                             +
                                             -
                                             -
                                             +
                                             -
Introduction   Method   Features   Experiments   Conclusions


      Quality-based tweet ranking
9




                                             5
                                             4
                                             3
                                             1
                                             1
Introduction          Method         Features    Experiments   Conclusions


      Research goals
10


         Quality-based tweet filtering
           Filtering      out low-quality tweets
                In twitter feeds
                In search results

         Quality-based tweet ranking
           Re-ranking         Twitter search results
                For   a given time period
Introduction    Method   Features   Experiments   Conclusions




   11          Proposed Method
Introduction       Method          Features     Experiments   Conclusions


      Representation of tweets
12


         Vector-space model: not sufficient
           Short tweet length, terms often malformed
           Ignores special features in Twitter

         Feature-vector representation
           Extract features from tweet
           Traditional features: e.g. length, spelling

           Twitter-specific features:
                Exploiting   hashtags, URL links, mentioned usernames
Introduction    Method   Features   Experiments   Conclusions




   13          Quality Features of Tweets
Introduction        Method                   Features              Experiments        Conclusions


      Feature categories
14


           1. Punctuation and Spelling                  2. Syntactic and semantic
                                                        complexity
           Number of exclamation marks                  Max. & Avg. word length
           Number of question marks                     Length of tweet
           Max. no. of repeated letters                 Percentage of stopwords
           % of correctly spelled words                 Contains numbers
           No. of capitalized words                     Contains a measure
           Max. no. of consecutive capitalized          Contains emoticons
           words                                        Uniqueness score

           3. Grammaticality                            4. Link-based
           Has first-person part-of-speech              Contains link
           Formality score                              Is reply-tweet
           Number of proper names                       Is re-tweet
           Max. no. of consecutive                      No. of mentions of users
           proper names                                 Number of hashtags
           Number of named entities                     URL domain reputation score
                                                        RT source reputation score
                                                        Hashtag reputation score
           5. Timestamp
Introduction        Method          Features         Experiments         Conclusions


      1. Punctuation and spelling
15


         Excessive punctuation
              Number of exclamation marks
              Number of question marks
              Max. number of consecutive dots
         Capitalization
              Presence of all-capitalized words
              Largest number of consecutive words in capital letters
         Spellchecking
              Number of correctly spelled words
              Percentage of words found in a dictionary
                    RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!??
                    lls. He's only the greatest guy next to jesus lmao
Introduction        Method          Features         Experiments      Conclusions
      2. Syntactic and semantic
16
          complexity
         Syntactic complexity
              Tweet length
              Max. & avg. word length
              Percentage of stopwords
              Presence of emoticons and other sentiment indicators
              Presence of measure symbols ($, %)
              Numbers – number of digits
         Tweet uniqueness
              Uniqueness of the tweet relative to other tweets by the author


                                                  where
Introduction          Method                Features                Experiments            Conclusions


      3. Grammaticality
17


         Parts-of-speech labelling
              Presence of first person parts-of-speech
              Formality score [Heylighen‟02]
                  F = (noun frequency + adjective freq. + preposition freq.+ article freq.
                   − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2
         Names
              Number of „proper names‟ as words with a single initial capital
               letter
              Number of consecutive „proper names‟
              Number of Named entities




           F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure.
           Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
Introduction          Method       Features     Experiments   Conclusions


      4. Link-based features
18


         Links to other items
           Re-tweet(RT), reply tweet, mention of other users
           Presence of a URL link

           Number of hashtags as indicated by the “#” sign

         Link target‟s quality reputation
           metrics       to reflect the quality of tweets which relate
               to a
                URL  domain
                Hashtag
                a user
Introduction           Method           Features         Experiments        Conclusions


      URL domain reputation
19


         Observation:
               Tweets which link to news articles usually better quality than
                tweets which link to photo sharing websites

               Q=1                                 Q=5
          Tweet 1                                  Tweet 4
                                Tweetpic.co                            NYtimes.co
                     Q=3
                                    m                                      m
                                                         Q=4
                                      Q=2
                 Tweet 2                                 Tweet 5             Q=5
                                     Tweet 3                                Tweet 6

         Questions:
               What does the quality of tweets linking to a website say about its
                quality?
               Can we predict quality of future tweets linking to that website?
Introduction       Method          Features       Experiments     Conclusions


      URL domain reputation
20


         Step 1: URL translation
          Short link to original link
                 bit.ly/e2jt9F  http://www.reuters.com/4151120


         Step 2: summarize tweets linking to a URL
          domain
              Accumulate “quality reputation” over time
Introduction          Method          Features          Experiments     Conclusions


      URL domain reputation
21


         Average URL domain quality



              Td = set of tweets linking to domain d
              qt = quality label of tweet t


              Weakness:
                  Does not reflect the number of inlink tweets in the score
                  Favours domains with few inlink tweets
Introduction          Method            Features                Experiments        Conclusions


      URL domain reputation
22


         Domain reputation score


                   where AvgQ(d) is between [-1, +1]

              “Collecting evidence” behaviour:
                  Score getting higher with more good quality inlink tweets

                                4.00
                                                                     -1
                                2.00
                                                                     -0,5
                          DRS   0.00                                 0      AvgQ
                                        1     10   100   1000        0,5
                                -2.00
                                                                     1
                                -4.00

                                            |Td|
Introduction        Method            Features       Experiments        Conclusions


      URL domain reputation
23




     10 domains with a high DRS:                 10 domains with a low DRS:
     Domain            AvgQ Inlinks      RS      Domain            AvgQ Inlinks      RS
     gallup.com         0,96     99    1,92      tweetphoto.com    -0,77    106   -1,57
     mashable.com       0,79     97    1,58      twitpic.com       -0,75    113   -1,54
     hrw.org            0,86     57    1,51      twitlonger.com    -0,85     66   -1,54
     foxnews.com        0,68     38    1,08      myloc.me          -0,85     54   -1,48
     good.is            0,68     31    1,01      instagr.am        -0,62     52   -1,06
     intuit.com         0,57     60    1,01      formspring.me     -0,78     18   -0,98
     forbes.com         0,68     19    0,87      yfrog.com         -0,55     53   -0,94
     reuters.com        1,00      6    0,78      lockerz.com       -0,63     16   -0,75
     cnn.com            0,36     85    0,70      qik.com           -0,75      8   -0,68


                  Mainly                                      Mainly
               News-oriented                                Image and
                  sites                                  location sharing
                                                               sites
Introduction         Method             Features               Experiments       Conclusions


      Reputation of hashtag & user
24




       Q=1                                         Q=5
      Tweet 1                                      Tweet 4
                          #justforfun                                   #DASFAA
               Q=3                                       Q=4
                                Q=2
           Tweet 2                                       Tweet 5               Q=5
                               Tweet 3                                        Tweet 6



         Hashtag reputation                                         #DASFAA vs. #justforfun


         Re-tweet source user reputation                                    @barackobama vs.
                                                                                 @wysz22212
Introduction    Method   Features   Experiments   Conclusions




   25          Experiments
Introduction       Method         Features       Experiments     Conclusions


      Dataset
27


         10,000 tweets
           100    users, 100 recent tweets per user
         Users:
           50 random users
           50 influential users
                Selected  from listorious.com
                5 categories: technology, business, politics,
                 celebrities, activism
                10 users per category
Introduction       Method        Features      Experiments   Conclusions


      Labelling
28


         Crowdsourcing
              Amazon Mechanical Turk
         3 labels per tweet from different reviewers
         Possible labels: 1 to 5
              1 = low quality, 5 = high quality
         Random order of tweets
Introduction    Method     Features    Experiments        Conclusions


      Labelling results
29


         Tweet quality distribution
                                                     Quality score:
Introduction    Method          Features       Experiments   Conclusions


      Feature analysis
30


         Total 29 features
         Top 5 features based on Information Gain:

                   0.374   Domain reputation
                   0.287   Contains link
                   0.130   Formality score
                   0.127   Num. proper names
                   0.113   Max. proper names
Introduction      Method             Features             Experiments      Conclusions


      Feature selection
31


         Greedy attribute selection
           15   selected features:

               Domain reputation                RT source reputation
               Formality                        Tweet uniqueness
               No. named entities               % correct. spelled words
               Max. no. repeat. Letters         No. hash-tags
               Contains numbers                 No. capitalized words
               Is reply-tweet                   Is re-tweet
               Avg. word length                 Contains first-person
               No. exclam. Marks
Introduction         Method        Features       Experiments    Conclusions
      Classification and Ranking
32
      Method
         Classification:
           SVM,   binary classification (high-quality, low-
            quality)
           50/50 split for training/testing

         Ranking:
           Learning-to-rank (Rank SVM)
           30 queries from 5 topic categories

           Process:
               1.   Retrieve tweets matching a query
               2.   Extract features from the tweets
               3.   „Query-tweet vector‟ pairs + quality scores of the
Introduction        Method             Features           Experiments        Conclusions


      Classification results
33


                                #attribute    High-Quality     Low-Quality       Overall
      Features                  s             P       R        P        R        AUC
      Link only                 1             0.798   0.702    0.894    0.934    0.818
      TF-IDF                    3322          0.862   0.665    0.885    0.96     0.813
      Subset.Reputation         3             0.812   0.746    0.909    0.936    0.841
      Subset.SVM (“greedy”)     15            0.715   0.758    0.912    0.936    0.847
      All quality features      29            0.815   0.66     0.882    0.944    0.802
      All quality ftr‟s + TF-   3351          0.739   0.775    0.915    0.899    0.837
      IDF


      Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)
      Link-based “reputation” features (3 attrs.) achieve the 2nd best result
      Combining quality features + TF-IDF does not improve result
Introduction        Method           Features   Experiments     Conclusions


      Classification results
34



                             #attribute
  Features                   s          AUC
  Link only                  1        0.818
  TF-IDF                     3322     0.813
  Subset.Reputation          3        0.841
  Subset.SVM                 15       0.847
  (“greedy”)                                             Storage cost

  All quality features       29       0.802
  All quality ftr‟s + TF-    3351     0.837
  IDF

   Optimal feature set achieves
    reduced training time and storage
    cost
                                                         Training time
Introduction       Method             Features           Experiments        Conclusions


      Ranking results
35



                             where


                                                         NDCG@N
      Features                   #attributes 1       2         5       10        MAP
      Link only                  1           0.067   0.111     0.22    0.324     0.398
      Subset.Reputation          3           0.822   0.777     0.777   0.764     0.661
      Subset.SVM (“greedy”) 15               0.867   0.767     0.778   0.769     0.653
      All quality features       29          0.733   0.733     0.763   0.753     0.637



      Optimal feature set (15 attrs.) achieves the best result
      Link-based “reputation” features (3 attrs.) achieve the 2nd best result
Introduction    Method   Features   Experiments   Conclusions




   36          Conclusions
Introduction   Method      Features   Experiments   Conclusions


      Summary
37


         Method for quality-based classification and
          ranking of tweets
         Proposed and evaluated a set of tweet‟s
          features to capture the tweet‟s quality
         Link-based features lead to the best
          performance
Introduction       Method    Features     Experiments   Conclusions


      Future work
38


         Consider different types of queries in Twitter
           E.g. searching for hot topics, movie reviews,
            facts, opinions, etc.
           Different features may be important in different
            scenarios
         Incorporating recent hot topics
         Personalized re-ranking
Introduction   Method   Features   Experiments   Conclusions


      Q/A
39
Introduction   Method   Features   Experiments   Conclusions


      Thank You
40
Related work
41


        Spam detection
              Bag-of-words, keyword-based
              Feature-based approaches
              Combinations

        Social networks
            Finding quality answers in Q-A systems
              E.g. Yahoo Answers
              Feature-based

        Web search
            Quality-based ranking of web documents
                Feature-based quality score (WSDM‟11)
ROC Curve
42




        Area under the ROC curve: probability that a classifier
         will rank a randomly chosen positive instance higher
         than a randomly chosen negative one

Weitere ähnliche Inhalte

Ähnlich wie Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)Dmitrii Ivanov
 
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...Anthony Peruma
 
Towards trust-aware recommender systems
Towards trust-aware recommender systemsTowards trust-aware recommender systems
Towards trust-aware recommender systemsAlberto Lumbreras
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Question 1 Many successful organizations perform projects produc.docx
Question 1 Many successful organizations perform projects produc.docxQuestion 1 Many successful organizations perform projects produc.docx
Question 1 Many successful organizations perform projects produc.docxIRESH3
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?George Sam
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative researchGhulam Qambar
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignCommunitySense
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentDhaval Dalal
 
Research design ii
Research design iiResearch design ii
Research design iiKritika Jain
 
Exploratory research design
Exploratory research design Exploratory research design
Exploratory research design Kritika Jain
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classificationijtsrd
 
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docxAssignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docxmurgatroydcrista
 
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...alywise
 
Need help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docxNeed help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docxgibbonshay
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET Journal
 
Research design ii
Research design iiResearch design ii
Research design iiKritika Jain
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningSergey Sosnovsky
 

Ähnlich wie Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links (20)

Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
Ranking Twitter Conversations
Ranking Twitter ConversationsRanking Twitter Conversations
Ranking Twitter Conversations
 
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
 
Towards trust-aware recommender systems
Towards trust-aware recommender systemsTowards trust-aware recommender systems
Towards trust-aware recommender systems
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Question 1 Many successful organizations perform projects produc.docx
Question 1 Many successful organizations perform projects produc.docxQuestion 1 Many successful organizations perform projects produc.docx
Question 1 Many successful organizations perform projects produc.docx
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Research design ii
Research design iiResearch design ii
Research design ii
 
Exploratory research design
Exploratory research design Exploratory research design
Exploratory research design
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docxAssignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
 
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
 
Need help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docxNeed help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docx
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
Research design ii
Research design iiResearch design ii
Research design ii
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
 

Kürzlich hochgeladen

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementNuwan Dias
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Kürzlich hochgeladen (20)

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API Management
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

  • 1. Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng Department of Computer Science and Engineering HKUST Hong Kong DASFAA‟12
  • 2. Introduction Method Features Experiments Conclusions Agenda 2  Introduction  Proposed method  Quality features of tweets  Experiments  Conclusions
  • 3. Introduction Method Features Experiments Conclusions 3 Introduction
  • 4. Introduction Method Features Experiments Conclusions Microblogs 4 mentioned user timestamp user Tweet 1 Tweet 2 hashtag URL link  Both social network and social media  Linksbetween users (follow, mention, re-tweet)  Users post updates (tweets)
  • 5. Introduction Method Features Experiments Conclusions Searching for “ipad” on Twitter 5 Around 50 tweets mentioning “iPad” posted within a 1-minute period
  • 6. Introduction Method Features Experiments Conclusions Research challenge 6  Twitter: user-generated content  Short messages, often comments or opinions  High volume  Varying quality  “Most tweets are not of general interest (57%)” (Alonso et al.’10)  Information overload  Research questions:  How to distinguish content worth reading from useless or less important messages?  How to promote „high quality‟ content?
  • 7. Introduction Method Features Experiments Conclusions Defining „quality‟ 7  General (global) definition for assessing tweet quality  3 criteria:  Well-formedness + Well-written, grammatically correct, understandable - Heavy slang, misspellings, excessive punctuation  Factuality + News, events, announcements - Unclear message, private conversations, generic personal feelings  Navigational quality (URL links) + Reputable external resources (e.g. news articles)
  • 8. Introduction Method Features Experiments Conclusions Quality-based tweet filtering 8 + - - + -
  • 9. Introduction Method Features Experiments Conclusions Quality-based tweet ranking 9 5 4 3 1 1
  • 10. Introduction Method Features Experiments Conclusions Research goals 10  Quality-based tweet filtering  Filtering out low-quality tweets  In twitter feeds  In search results  Quality-based tweet ranking  Re-ranking Twitter search results  For a given time period
  • 11. Introduction Method Features Experiments Conclusions 11 Proposed Method
  • 12. Introduction Method Features Experiments Conclusions Representation of tweets 12  Vector-space model: not sufficient  Short tweet length, terms often malformed  Ignores special features in Twitter  Feature-vector representation  Extract features from tweet  Traditional features: e.g. length, spelling  Twitter-specific features:  Exploiting hashtags, URL links, mentioned usernames
  • 13. Introduction Method Features Experiments Conclusions 13 Quality Features of Tweets
  • 14. Introduction Method Features Experiments Conclusions Feature categories 14 1. Punctuation and Spelling 2. Syntactic and semantic complexity Number of exclamation marks Max. & Avg. word length Number of question marks Length of tweet Max. no. of repeated letters Percentage of stopwords % of correctly spelled words Contains numbers No. of capitalized words Contains a measure Max. no. of consecutive capitalized Contains emoticons words Uniqueness score 3. Grammaticality 4. Link-based Has first-person part-of-speech Contains link Formality score Is reply-tweet Number of proper names Is re-tweet Max. no. of consecutive No. of mentions of users proper names Number of hashtags Number of named entities URL domain reputation score RT source reputation score Hashtag reputation score 5. Timestamp
  • 15. Introduction Method Features Experiments Conclusions 1. Punctuation and spelling 15  Excessive punctuation  Number of exclamation marks  Number of question marks  Max. number of consecutive dots  Capitalization  Presence of all-capitalized words  Largest number of consecutive words in capital letters  Spellchecking  Number of correctly spelled words  Percentage of words found in a dictionary RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!?? lls. He's only the greatest guy next to jesus lmao
  • 16. Introduction Method Features Experiments Conclusions 2. Syntactic and semantic 16 complexity  Syntactic complexity  Tweet length  Max. & avg. word length  Percentage of stopwords  Presence of emoticons and other sentiment indicators  Presence of measure symbols ($, %)  Numbers – number of digits  Tweet uniqueness  Uniqueness of the tweet relative to other tweets by the author where
  • 17. Introduction Method Features Experiments Conclusions 3. Grammaticality 17  Parts-of-speech labelling  Presence of first person parts-of-speech  Formality score [Heylighen‟02]  F = (noun frequency + adjective freq. + preposition freq.+ article freq. − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2  Names  Number of „proper names‟ as words with a single initial capital letter  Number of consecutive „proper names‟  Number of Named entities F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
  • 18. Introduction Method Features Experiments Conclusions 4. Link-based features 18  Links to other items  Re-tweet(RT), reply tweet, mention of other users  Presence of a URL link  Number of hashtags as indicated by the “#” sign  Link target‟s quality reputation  metrics to reflect the quality of tweets which relate to a  URL domain  Hashtag  a user
  • 19. Introduction Method Features Experiments Conclusions URL domain reputation 19  Observation:  Tweets which link to news articles usually better quality than tweets which link to photo sharing websites Q=1 Q=5 Tweet 1 Tweet 4 Tweetpic.co NYtimes.co Q=3 m m Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Questions:  What does the quality of tweets linking to a website say about its quality?  Can we predict quality of future tweets linking to that website?
  • 20. Introduction Method Features Experiments Conclusions URL domain reputation 20  Step 1: URL translation Short link to original link bit.ly/e2jt9F  http://www.reuters.com/4151120  Step 2: summarize tweets linking to a URL domain  Accumulate “quality reputation” over time
  • 21. Introduction Method Features Experiments Conclusions URL domain reputation 21  Average URL domain quality  Td = set of tweets linking to domain d  qt = quality label of tweet t  Weakness:  Does not reflect the number of inlink tweets in the score  Favours domains with few inlink tweets
  • 22. Introduction Method Features Experiments Conclusions URL domain reputation 22  Domain reputation score where AvgQ(d) is between [-1, +1]  “Collecting evidence” behaviour:  Score getting higher with more good quality inlink tweets 4.00 -1 2.00 -0,5 DRS 0.00 0 AvgQ 1 10 100 1000 0,5 -2.00 1 -4.00 |Td|
  • 23. Introduction Method Features Experiments Conclusions URL domain reputation 23 10 domains with a high DRS: 10 domains with a low DRS: Domain AvgQ Inlinks RS Domain AvgQ Inlinks RS gallup.com 0,96 99 1,92 tweetphoto.com -0,77 106 -1,57 mashable.com 0,79 97 1,58 twitpic.com -0,75 113 -1,54 hrw.org 0,86 57 1,51 twitlonger.com -0,85 66 -1,54 foxnews.com 0,68 38 1,08 myloc.me -0,85 54 -1,48 good.is 0,68 31 1,01 instagr.am -0,62 52 -1,06 intuit.com 0,57 60 1,01 formspring.me -0,78 18 -0,98 forbes.com 0,68 19 0,87 yfrog.com -0,55 53 -0,94 reuters.com 1,00 6 0,78 lockerz.com -0,63 16 -0,75 cnn.com 0,36 85 0,70 qik.com -0,75 8 -0,68 Mainly Mainly News-oriented Image and sites location sharing sites
  • 24. Introduction Method Features Experiments Conclusions Reputation of hashtag & user 24 Q=1 Q=5 Tweet 1 Tweet 4 #justforfun #DASFAA Q=3 Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Hashtag reputation #DASFAA vs. #justforfun  Re-tweet source user reputation @barackobama vs. @wysz22212
  • 25. Introduction Method Features Experiments Conclusions 25 Experiments
  • 26. Introduction Method Features Experiments Conclusions Dataset 27  10,000 tweets  100 users, 100 recent tweets per user  Users:  50 random users  50 influential users  Selected from listorious.com  5 categories: technology, business, politics, celebrities, activism  10 users per category
  • 27. Introduction Method Features Experiments Conclusions Labelling 28  Crowdsourcing  Amazon Mechanical Turk  3 labels per tweet from different reviewers  Possible labels: 1 to 5  1 = low quality, 5 = high quality  Random order of tweets
  • 28. Introduction Method Features Experiments Conclusions Labelling results 29  Tweet quality distribution Quality score:
  • 29. Introduction Method Features Experiments Conclusions Feature analysis 30  Total 29 features  Top 5 features based on Information Gain: 0.374 Domain reputation 0.287 Contains link 0.130 Formality score 0.127 Num. proper names 0.113 Max. proper names
  • 30. Introduction Method Features Experiments Conclusions Feature selection 31  Greedy attribute selection  15 selected features: Domain reputation RT source reputation Formality Tweet uniqueness No. named entities % correct. spelled words Max. no. repeat. Letters No. hash-tags Contains numbers No. capitalized words Is reply-tweet Is re-tweet Avg. word length Contains first-person No. exclam. Marks
  • 31. Introduction Method Features Experiments Conclusions Classification and Ranking 32 Method  Classification:  SVM, binary classification (high-quality, low- quality)  50/50 split for training/testing  Ranking:  Learning-to-rank (Rank SVM)  30 queries from 5 topic categories  Process: 1. Retrieve tweets matching a query 2. Extract features from the tweets 3. „Query-tweet vector‟ pairs + quality scores of the
  • 32. Introduction Method Features Experiments Conclusions Classification results 33 #attribute High-Quality Low-Quality Overall Features s P R P R AUC Link only 1 0.798 0.702 0.894 0.934 0.818 TF-IDF 3322 0.862 0.665 0.885 0.96 0.813 Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841 Subset.SVM (“greedy”) 15 0.715 0.758 0.912 0.936 0.847 All quality features 29 0.815 0.66 0.882 0.944 0.802 All quality ftr‟s + TF- 3351 0.739 0.775 0.915 0.899 0.837 IDF  Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)  Link-based “reputation” features (3 attrs.) achieve the 2nd best result  Combining quality features + TF-IDF does not improve result
  • 33. Introduction Method Features Experiments Conclusions Classification results 34 #attribute Features s AUC Link only 1 0.818 TF-IDF 3322 0.813 Subset.Reputation 3 0.841 Subset.SVM 15 0.847 (“greedy”) Storage cost All quality features 29 0.802 All quality ftr‟s + TF- 3351 0.837 IDF  Optimal feature set achieves reduced training time and storage cost Training time
  • 34. Introduction Method Features Experiments Conclusions Ranking results 35 where NDCG@N Features #attributes 1 2 5 10 MAP Link only 1 0.067 0.111 0.22 0.324 0.398 Subset.Reputation 3 0.822 0.777 0.777 0.764 0.661 Subset.SVM (“greedy”) 15 0.867 0.767 0.778 0.769 0.653 All quality features 29 0.733 0.733 0.763 0.753 0.637  Optimal feature set (15 attrs.) achieves the best result  Link-based “reputation” features (3 attrs.) achieve the 2nd best result
  • 35. Introduction Method Features Experiments Conclusions 36 Conclusions
  • 36. Introduction Method Features Experiments Conclusions Summary 37  Method for quality-based classification and ranking of tweets  Proposed and evaluated a set of tweet‟s features to capture the tweet‟s quality  Link-based features lead to the best performance
  • 37. Introduction Method Features Experiments Conclusions Future work 38  Consider different types of queries in Twitter  E.g. searching for hot topics, movie reviews, facts, opinions, etc.  Different features may be important in different scenarios  Incorporating recent hot topics  Personalized re-ranking
  • 38. Introduction Method Features Experiments Conclusions Q/A 39
  • 39. Introduction Method Features Experiments Conclusions Thank You 40
  • 40. Related work 41  Spam detection  Bag-of-words, keyword-based  Feature-based approaches  Combinations  Social networks  Finding quality answers in Q-A systems  E.g. Yahoo Answers  Feature-based  Web search  Quality-based ranking of web documents  Feature-based quality score (WSDM‟11)
  • 41. ROC Curve 42  Area under the ROC curve: probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one