Language-Independent Twitter Sentiment Analysis

S
Language-Independent Twitter Sentiment Analysis
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak


Sascha Narr
Competence Center Information Retrieval & Machine Learning


KDML 2012, LWA, Dortmund, Germany
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   2
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   3
1. Sentiment Analysis on Social Media


►   Why Sentiment Analysis?
       People’s opinions and sentiments about products and events
        in large numbers are invaluable:
       Market research, product feedback and more
       Sentiment Analysis allows to automatically collect such data

►   Why Twitter?
       400 Million tweets posted each day[1]
       Shorter text lengths encourage people to
        “just write” what they think
       Tweets are often informal and contain lots of opinions


                      [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/

              18. September 2012         Language-Independent Twitter Sentiment Analysis                                    4
1. Methods for Sentiment Classification

► Sentiment classification goals:
      Subjectivity: “Does the tweet contain an opinion?”
      Polarity: “Is the expressed opinion positive or negative?”
► Classifiers used:

      Naive Bayes, Maximum Entropy, Support Vector Machines
► Features used:

      n-grams, WordNet semantics, part-of-speech information

►   Tweet texts have unique properties:
       Informal, contain slang, emoticons, misspellings



              18. September 2012   Language-Independent Twitter Sentiment Analysis   5
1. Multilingual Sentiment Analysis

►Less than 40% of tweets are English [1]
►Natural language processing methods are often

 designed specifically for one language

►   Increase coverage of sentiment analysis by using a
    language-independent approach:
       No extra effort for additional languages
       Is the approach really effective for all languages?



                                  [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter


             18. September 2012      Language-Independent Twitter Sentiment Analysis                        6
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   7
2. Creation of a Multilingual Evaluation Dataset


►   We created a hand-annotated sentiment evaluation
    dataset of over 12000 tweets
       4 languages: English, German, French, Portuguese
►Used the Amazon Mechanical Turk platform for
 annotation
►Each tweet was annotated by 3 different workers:

       Labels: “positive”, “neutral”, “negative”
       Added validation tweets to try to ensure the quality of the
        annotations




             18. September 2012   Language-Independent Twitter Sentiment Analysis   8
2. Our Multilingual Evaluation Dataset

►   Observed a low inter-annotator agreement in our dataset
       Sentiment classification is a hard task, even for humans
       Tweets that humans disagree on are harder to classify as
        well
►   The dataset is publicly available for research purposes




              Table 1: Tweet counts for the complete annotated dataset




             18. September 2012   Language-Independent Twitter Sentiment Analysis   9
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   10
3. A Language-Independent Heuristic

► To train a sentiment classifier, a large amount of labeled
  training data is needed
      Can be obtained without human effort using a previously
       proposed heuristic
► The heuristic uses emoticons in tweets as noisy labels




►   Heuristic: If a tweet contains only positive emoticons, label its
    whole text as positive (and vice versa for negative).

►   Examples of emoticons we used:
           Positive:       :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ
           Negative:       :( :-( :(( -.- >:-( D: :/


              18. September 2012   Language-Independent Twitter Sentiment Analysis   11
3. Heuristic for Semi-Supervised Learning

► Heuristic can be applied to almost any language, since
  emoticons are used extensively on Twitter
► Amount of tweets with emoticons differs among languages

     Caused by many factors like language-specific ways to
      express sentiments or different distributions of “formal”
      tweets




            Table 2: Number of tweets containing emoticons for each language




            18. September 2012   Language-Independent Twitter Sentiment Analysis   12
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   13
4. Experiments – Sentiment Classification

►   Data:
       Training: From ~ 800M random tweets of mixed languages:
           Filter for languages: English, German, French, Portuguese
           Use emoticon heuristic to select and label training data
        Evaluation: 12597 hand-annotated tweets (4 languages)

►   Setup:
        Classification: Sentiment polarity only
        Classifier: Naive Bayes
        Features: 1-grams and 1, 2-grams
        Trained 4 classifiers for en, de, fr, pt
                  1 classifier for combined en+de+fr+pt


              18. September 2012   Language-Independent Twitter Sentiment Analysis   14
4. Experiments: Evaluation Dataset

► 2 variations of our evaluation set for the experiments:
      agree-3: Tweets all 3 annotators agreed on for a sentiment
      agree-2: Tweets at least 2 annotators agreed on
► Baseline: always guess “positive” (more pos. tweets than neg.)




               Table 3: Tweet counts for the evaluation datasets



           18. September 2012   Language-Independent Twitter Sentiment Analysis   15
4. Results – English Classifier

► Best results: English classifier using 1-grams, on the 3-agree set
      81.3% accuracy (500k trained tweets)
► Performance on 2-agree set constantly lower than 3-agree



                                                                en




            18. September 2012   Language-Independent Twitter Sentiment Analysis   16
4. Results – All Languages
                              en                                                de




                              fr                                                pt




         18. September 2012   Language-Independent Twitter Sentiment Analysis        17
4. Evaluation – All Languages Compared
                                                                 en                                 de
► Strong differences
  between languages
► Differences do not

  correlate with number
  of emoticons in each                                             fr                                   pt
  language

► Emoticon heuristic better
  fit for some languages,
  may depend on the style of
  expressing sentiment in it
► “muito engraçado kkkkkkkk”

                                          Table3: Tweet counts containing emoticons for each language



           18. September 2012   Language-Independent Twitter Sentiment Analysis                         18
4. Evaluation – Multi-language Classifier
► Tested on combined 4 language evaluation set
► Highest Performance: 71.5% accuracy

      Slightly less than using 4 individual classifiers (73.9% accuracy)
► Usefulness of combined classifier can outweigh performance

  degradation
                                                   en+de+fr+pt




            18. September 2012   Language-Independent Twitter Sentiment Analysis   19
Conclusions

►   We presented and evaluated a language-independent
    sentiment classification approach on 4 languages
        A language-independent classifier can be trained given only
         raw tweets, using a noisy label heuristic
        Good performances across languages, varies for each
        Classifiers need a very large number of tweets for training
        Mixed-language classifiers are viable

►   Future work:
        Currently we only classify sentiment polarity
        Classifying subjectivity in tweets is important, but finding a
         good heuristic to label “neutral” tweets is a challenge

               18. September 2012   Language-Independent Twitter Sentiment Analysis   20
Language-Independent Twitter Sentiment Analysis




         Thanks for your attention!

                            Questions?



           18. September 2012   Language-Independent Twitter Sentiment Analysis   21
Contact


Sascha Narr                                            DAI-Labor
Dipl.-Inform.                                          Technische Universität Berlin




                                                       Fakultät IV –
Competence Center Information Retrieval &              Elektrontechnik & Informatik
Machine Learning

sascha.narr@dai-labor.de                               Sekretariat TEL 14
Fon +49 (0) 30 / 314 – 74 138                          Ernst Reuter Platz 7
Fax +49 (0) 30 / 314 – 74 003                          10587 Berlin




                                                        www.dai-labor.de

                18. September 2012   Language-Independent Twitter Sentiment Analysis   22
1 von 22

Recomendados

Language-Independent Twitter Sentiment Analysis von
Language-Independent Twitter Sentiment AnalysisLanguage-Independent Twitter Sentiment Analysis
Language-Independent Twitter Sentiment Analysissaschanarr
1.4K views22 Folien
P1803018289 von
P1803018289P1803018289
P1803018289IOSR Journals
269 views8 Folien
Data Science - Experiments von
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
749 views27 Folien
J1803015357 von
J1803015357J1803015357
J1803015357IOSR Journals
261 views5 Folien
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text von
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
3.3K views21 Folien
Technical Development Workshop - Text Analytics with Python von
Technical Development Workshop - Text Analytics with PythonTechnical Development Workshop - Text Analytics with Python
Technical Development Workshop - Text Analytics with PythonMichelle Purnama
215 views38 Folien

Más contenido relacionado

Destacado

Text Classification, Sentiment Analysis, and Opinion Mining von
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningFabrizio Sebastiani
1.4K views91 Folien
Sentiment Analysis of Twitter Data von
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
77.9K views21 Folien
P3 von
P3P3
P3John Kirbow
117 views8 Folien
Algorithm Name Detection & Extraction von
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
489 views11 Folien
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase... von
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Knowledge Media Institute - The Open University
4.7K views18 Folien
Unsupervised Sentiment Analysis von
Unsupervised Sentiment AnalysisUnsupervised Sentiment Analysis
Unsupervised Sentiment AnalysisTaras Zagibalov
4.7K views28 Folien

Destacado(20)

Text Classification, Sentiment Analysis, and Opinion Mining von Fabrizio Sebastiani
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani1.4K views
Sentiment Analysis of Twitter Data von Sumit Raj
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Sumit Raj77.9K views
Algorithm Name Detection & Extraction von Deeksha thakur
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur489 views
Unsupervised Sentiment Analysis von Taras Zagibalov
Unsupervised Sentiment AnalysisUnsupervised Sentiment Analysis
Unsupervised Sentiment Analysis
Taras Zagibalov4.7K views
Text classification & sentiment analysis von M. Atif Qureshi
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi4.9K views
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ... von Srivatsan Ramanujam
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam8.3K views
Best Practices for Sentiment Analysis Webinar von Mechanical Turk
Best Practices for Sentiment Analysis Webinar Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar
Mechanical Turk 2.1K views
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms von Sangeeth Nagarajan
Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sangeeth Nagarajan1.9K views
Sentiment Analysis with NVivo 11 Plus von Shalin Hai-Jew
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 Plus
Shalin Hai-Jew7.9K views
Practical Sentiment Analysis von People Pattern
Practical Sentiment AnalysisPractical Sentiment Analysis
Practical Sentiment Analysis
People Pattern12.4K views
Can Deep Learning solve the Sentiment Analysis Problem von Mark Cieliebak
Can Deep Learning solve the Sentiment Analysis ProblemCan Deep Learning solve the Sentiment Analysis Problem
Can Deep Learning solve the Sentiment Analysis Problem
Mark Cieliebak7.1K views
Sentiment Analysis via R Programming von Skillspeed
Sentiment Analysis via R ProgrammingSentiment Analysis via R Programming
Sentiment Analysis via R Programming
Skillspeed3.5K views
Sentiment analysis-by-nltk von Wei-Ting Kuo
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
Wei-Ting Kuo13.1K views
MTech Seminar Presentation [IIT-Bombay] von Sagar Ahire
MTech Seminar Presentation [IIT-Bombay]MTech Seminar Presentation [IIT-Bombay]
MTech Seminar Presentation [IIT-Bombay]
Sagar Ahire14.9K views
R by example: mining Twitter for consumer attitudes towards airlines von Jeffrey Breen
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen173.4K views

Similar a Language-Independent Twitter Sentiment Analysis

Sentiment Analysis and Political Disaffection in Italy von
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalyCorrado Monti
1K views24 Folien
D. Zardetto, Using Twitter data for the Social Mood on Economy Index von
D. Zardetto, Using Twitter data for the Social Mood on Economy Index D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index Istituto nazionale di statistica
1.5K views42 Folien
Affect Level Opinion Mining von
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion MiningYasas Senarath
35 views10 Folien
Analysis Of Web Series And Movies Using Sentiment Analysis A Review von
Analysis Of Web Series And Movies Using Sentiment Analysis  A ReviewAnalysis Of Web Series And Movies Using Sentiment Analysis  A Review
Analysis Of Web Series And Movies Using Sentiment Analysis A ReviewMartha Brown
9 views5 Folien
Rethinking Social Media Measurement von
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media MeasurementMasood Akhtar
771 views8 Folien
A tailor-made one-size-fits-all approach to sentiment analysis von
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisDiana Maynard
1.2K views18 Folien

Similar a Language-Independent Twitter Sentiment Analysis(20)

Sentiment Analysis and Political Disaffection in Italy von Corrado Monti
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in Italy
Corrado Monti1K views
Analysis Of Web Series And Movies Using Sentiment Analysis A Review von Martha Brown
Analysis Of Web Series And Movies Using Sentiment Analysis  A ReviewAnalysis Of Web Series And Movies Using Sentiment Analysis  A Review
Analysis Of Web Series And Movies Using Sentiment Analysis A Review
Martha Brown9 views
Rethinking Social Media Measurement von Masood Akhtar
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media Measurement
Masood Akhtar771 views
A tailor-made one-size-fits-all approach to sentiment analysis von Diana Maynard
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysis
Diana Maynard1.2K views
Sentiment Analysis of Social Media Content: A multi-tool for listening to you... von Eirini Ntoutsi
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Eirini Ntoutsi339 views
This assignment allows you to demonstrate mastery of outcome # 2.docx von howardh5
This assignment allows you to demonstrate mastery of outcome # 2.docxThis assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docx
howardh55 views
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi... von IRJET Journal
IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET Journal49 views
Detecting insults in social media conversations von raj
Detecting insults in social media conversationsDetecting insults in social media conversations
Detecting insults in social media conversations
raj 628 views
Sentiment analysis - Our approach and use cases von Karol Chlasta
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
Karol Chlasta1.7K views
Intellexy social media analysis solutions d2011 von Maya Marashlian
Intellexy social media analysis solutions d2011Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011
Maya Marashlian229 views
Intellexy Social Media Monitoring and Analysis Solutions D2011 von MayaMar
Intellexy Social Media Monitoring and Analysis Solutions D2011Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011
MayaMar368 views
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging von Elena Daehnhardt
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
Elena Daehnhardt1.3K views
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis von Nicole Novielli
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisTo Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
Nicole Novielli164 views
Exciting Strategies for GED Test Preparation Instruction von Meagen Farrell
Exciting Strategies for GED Test Preparation InstructionExciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation Instruction
Meagen Farrell867 views
VenTESOL Social Media for Effective Teacher Development von Andrés Ramos
VenTESOL Social Media for Effective Teacher DevelopmentVenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher Development
Andrés Ramos526 views

Último

AI: mind, matter, meaning, metaphors, being, becoming, life values von
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life valuesTwain Liu 刘秋艳
34 views16 Folien
Throughput von
ThroughputThroughput
ThroughputMoisés Armani Ramírez
32 views11 Folien
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy von
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy NakonechnyyFwdays
40 views21 Folien
ChatGPT and AI for Web Developers von
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web DevelopersMaximiliano Firtman
174 views82 Folien
Data-centric AI and the convergence of data and model engineering: opportunit... von
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
29 views40 Folien
MemVerge: Past Present and Future of CXL von
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXLCXL Forum
110 views26 Folien

Último(20)

AI: mind, matter, meaning, metaphors, being, becoming, life values von Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy von Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 views
Data-centric AI and the convergence of data and model engineering: opportunit... von Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier29 views
MemVerge: Past Present and Future of CXL von CXL Forum
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
CXL Forum110 views
"Fast Start to Building on AWS", Igor Ivaniuk von Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays36 views
Microchip: CXL Use Cases and Enabling Ecosystem von CXL Forum
Microchip: CXL Use Cases and Enabling EcosystemMicrochip: CXL Use Cases and Enabling Ecosystem
Microchip: CXL Use Cases and Enabling Ecosystem
CXL Forum129 views
TE Connectivity: Card Edge Interconnects von CXL Forum
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge Interconnects
CXL Forum96 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV von Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 views
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... von NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS23 views
Micron CXL product and architecture update von CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur von Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays40 views
The details of description: Techniques, tips, and tangents on alternative tex... von BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
.conf Go 2023 - Data analysis as a routine von Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk90 views
GigaIO: The March of Composability Onward to Memory with CXL von CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... von Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 views
MemVerge: Memory Viewer Software von CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 views
Transcript: The Details of Description Techniques tips and tangents on altern... von BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada119 views

Language-Independent Twitter Sentiment Analysis

  • 1. Language-Independent Twitter Sentiment Analysis Sascha Narr, Michael Hülfenhaus, Sahin Albayrak Sascha Narr Competence Center Information Retrieval & Machine Learning KDML 2012, LWA, Dortmund, Germany
  • 2. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 2
  • 3. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 3
  • 4. 1. Sentiment Analysis on Social Media ► Why Sentiment Analysis?  People’s opinions and sentiments about products and events in large numbers are invaluable:  Market research, product feedback and more  Sentiment Analysis allows to automatically collect such data ► Why Twitter?  400 Million tweets posted each day[1]  Shorter text lengths encourage people to “just write” what they think  Tweets are often informal and contain lots of opinions [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 4
  • 5. 1. Methods for Sentiment Classification ► Sentiment classification goals:  Subjectivity: “Does the tweet contain an opinion?”  Polarity: “Is the expressed opinion positive or negative?” ► Classifiers used:  Naive Bayes, Maximum Entropy, Support Vector Machines ► Features used:  n-grams, WordNet semantics, part-of-speech information ► Tweet texts have unique properties:  Informal, contain slang, emoticons, misspellings 18. September 2012 Language-Independent Twitter Sentiment Analysis 5
  • 6. 1. Multilingual Sentiment Analysis ►Less than 40% of tweets are English [1] ►Natural language processing methods are often designed specifically for one language ► Increase coverage of sentiment analysis by using a language-independent approach: No extra effort for additional languages Is the approach really effective for all languages? [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter 18. September 2012 Language-Independent Twitter Sentiment Analysis 6
  • 7. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 7
  • 8. 2. Creation of a Multilingual Evaluation Dataset ► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets  4 languages: English, German, French, Portuguese ►Used the Amazon Mechanical Turk platform for annotation ►Each tweet was annotated by 3 different workers:  Labels: “positive”, “neutral”, “negative”  Added validation tweets to try to ensure the quality of the annotations 18. September 2012 Language-Independent Twitter Sentiment Analysis 8
  • 9. 2. Our Multilingual Evaluation Dataset ► Observed a low inter-annotator agreement in our dataset  Sentiment classification is a hard task, even for humans  Tweets that humans disagree on are harder to classify as well ► The dataset is publicly available for research purposes Table 1: Tweet counts for the complete annotated dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 9
  • 10. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 10
  • 11. 3. A Language-Independent Heuristic ► To train a sentiment classifier, a large amount of labeled training data is needed  Can be obtained without human effort using a previously proposed heuristic ► The heuristic uses emoticons in tweets as noisy labels ► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative). ► Examples of emoticons we used:  Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ  Negative: :( :-( :(( -.- >:-( D: :/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 11
  • 12. 3. Heuristic for Semi-Supervised Learning ► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter ► Amount of tweets with emoticons differs among languages  Caused by many factors like language-specific ways to express sentiments or different distributions of “formal” tweets Table 2: Number of tweets containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 12
  • 13. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 13
  • 14. 4. Experiments – Sentiment Classification ► Data:  Training: From ~ 800M random tweets of mixed languages:  Filter for languages: English, German, French, Portuguese  Use emoticon heuristic to select and label training data  Evaluation: 12597 hand-annotated tweets (4 languages) ► Setup:  Classification: Sentiment polarity only  Classifier: Naive Bayes  Features: 1-grams and 1, 2-grams  Trained 4 classifiers for en, de, fr, pt 1 classifier for combined en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 14
  • 15. 4. Experiments: Evaluation Dataset ► 2 variations of our evaluation set for the experiments:  agree-3: Tweets all 3 annotators agreed on for a sentiment  agree-2: Tweets at least 2 annotators agreed on ► Baseline: always guess “positive” (more pos. tweets than neg.) Table 3: Tweet counts for the evaluation datasets 18. September 2012 Language-Independent Twitter Sentiment Analysis 15
  • 16. 4. Results – English Classifier ► Best results: English classifier using 1-grams, on the 3-agree set  81.3% accuracy (500k trained tweets) ► Performance on 2-agree set constantly lower than 3-agree en 18. September 2012 Language-Independent Twitter Sentiment Analysis 16
  • 17. 4. Results – All Languages en de fr pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 17
  • 18. 4. Evaluation – All Languages Compared en de ► Strong differences between languages ► Differences do not correlate with number of emoticons in each fr pt language ► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it ► “muito engraçado kkkkkkkk” Table3: Tweet counts containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 18
  • 19. 4. Evaluation – Multi-language Classifier ► Tested on combined 4 language evaluation set ► Highest Performance: 71.5% accuracy  Slightly less than using 4 individual classifiers (73.9% accuracy) ► Usefulness of combined classifier can outweigh performance degradation en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 19
  • 20. Conclusions ► We presented and evaluated a language-independent sentiment classification approach on 4 languages  A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic  Good performances across languages, varies for each  Classifiers need a very large number of tweets for training  Mixed-language classifiers are viable ► Future work:  Currently we only classify sentiment polarity  Classifying subjectivity in tweets is important, but finding a good heuristic to label “neutral” tweets is a challenge 18. September 2012 Language-Independent Twitter Sentiment Analysis 20
  • 21. Language-Independent Twitter Sentiment Analysis Thanks for your attention! Questions? 18. September 2012 Language-Independent Twitter Sentiment Analysis 21
  • 22. Contact Sascha Narr DAI-Labor Dipl.-Inform. Technische Universität Berlin Fakultät IV – Competence Center Information Retrieval & Elektrontechnik & Informatik Machine Learning sascha.narr@dai-labor.de Sekretariat TEL 14 Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7 Fax +49 (0) 30 / 314 – 74 003 10587 Berlin www.dai-labor.de 18. September 2012 Language-Independent Twitter Sentiment Analysis 22