Language-Independent Twitter Sentiment Analysis

Language-Independent Twitter Sentiment Analysis
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak

Sascha Narr
Competence Center Information Retrieval & Machine Learning

KDML 2012, LWA, Dortmund, Germany

Overview

►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

tweets
►3. A language-independent sentiment labeling

heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset

18. September 2012 Language-Independent Twitter Sentiment Analysis 2

Overview


tweets



1. Sentiment Analysis on Social Media

► Why Sentiment Analysis?
 People’s opinions and sentiments about products and events
in large numbers are invaluable:
 Market research, product feedback and more
 Sentiment Analysis allows to automatically collect such data

► Why Twitter?
 400 Million tweets posted each day[1]
 Shorter text lengths encourage people to
“just write” what they think
 Tweets are often informal and contain lots of opinions

[1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/


1. Methods for Sentiment Classification

► Sentiment classification goals:
 Subjectivity: “Does the tweet contain an opinion?”
 Polarity: “Is the expressed opinion positive or negative?”
► Classifiers used:

 Naive Bayes, Maximum Entropy, Support Vector Machines
► Features used:

 n-grams, WordNet semantics, part-of-speech information

► Tweet texts have unique properties:
 Informal, contain slang, emoticons, misspellings


1. Multilingual Sentiment Analysis

►Less than 40% of tweets are English [1]
►Natural language processing methods are often

designed specifically for one language

► Increase coverage of sentiment analysis by using a
language-independent approach:
No extra effort for additional languages
Is the approach really effective for all languages?

[1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter


Overview


tweets



2. Creation of a Multilingual Evaluation Dataset

► We created a hand-annotated sentiment evaluation
dataset of over 12000 tweets
 4 languages: English, German, French, Portuguese
►Used the Amazon Mechanical Turk platform for
annotation
►Each tweet was annotated by 3 different workers:

 Labels: “positive”, “neutral”, “negative”
 Added validation tweets to try to ensure the quality of the
annotations


2. Our Multilingual Evaluation Dataset

► Observed a low inter-annotator agreement in our dataset
 Sentiment classification is a hard task, even for humans
 Tweets that humans disagree on are harder to classify as
well
► The dataset is publicly available for research purposes

Table 1: Tweet counts for the complete annotated dataset


Overview


tweets



3. A Language-Independent Heuristic

► To train a sentiment classifier, a large amount of labeled
training data is needed
 Can be obtained without human effort using a previously
proposed heuristic
► The heuristic uses emoticons in tweets as noisy labels

► Heuristic: If a tweet contains only positive emoticons, label its
whole text as positive (and vice versa for negative).

► Examples of emoticons we used:
 Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ
 Negative: :( :-( :(( -.- >:-( D: :/


3. Heuristic for Semi-Supervised Learning

► Heuristic can be applied to almost any language, since
emoticons are used extensively on Twitter
► Amount of tweets with emoticons differs among languages

 Caused by many factors like language-specific ways to
express sentiments or different distributions of “formal”
tweets

Table 2: Number of tweets containing emoticons for each language


Overview


tweets



4. Experiments – Sentiment Classification

► Data:
 Training: From ~ 800M random tweets of mixed languages:
 Filter for languages: English, German, French, Portuguese
 Use emoticon heuristic to select and label training data
 Evaluation: 12597 hand-annotated tweets (4 languages)

► Setup:
 Classification: Sentiment polarity only
 Classifier: Naive Bayes
 Features: 1-grams and 1, 2-grams
 Trained 4 classifiers for en, de, fr, pt
1 classifier for combined en+de+fr+pt


4. Experiments: Evaluation Dataset

► 2 variations of our evaluation set for the experiments:
 agree-3: Tweets all 3 annotators agreed on for a sentiment
 agree-2: Tweets at least 2 annotators agreed on
► Baseline: always guess “positive” (more pos. tweets than neg.)

Table 3: Tweet counts for the evaluation datasets


4. Results – English Classifier

► Best results: English classifier using 1-grams, on the 3-agree set
 81.3% accuracy (500k trained tweets)
► Performance on 2-agree set constantly lower than 3-agree

en


4. Results – All Languages
en de

fr pt


4. Evaluation – All Languages Compared
en de
► Strong differences
between languages
► Differences do not

correlate with number
of emoticons in each fr pt
language

► Emoticon heuristic better
fit for some languages,
may depend on the style of
expressing sentiment in it
► “muito engraçado kkkkkkkk”

Table3: Tweet counts containing emoticons for each language


4. Evaluation – Multi-language Classifier
► Tested on combined 4 language evaluation set
► Highest Performance: 71.5% accuracy

 Slightly less than using 4 individual classifiers (73.9% accuracy)
► Usefulness of combined classifier can outweigh performance

degradation
en+de+fr+pt


Conclusions

► We presented and evaluated a language-independent
sentiment classification approach on 4 languages
 A language-independent classifier can be trained given only
raw tweets, using a noisy label heuristic
 Good performances across languages, varies for each
 Classifiers need a very large number of tweets for training
 Mixed-language classifiers are viable

► Future work:
 Currently we only classify sentiment polarity
 Classifying subjectivity in tweets is important, but finding a
good heuristic to label “neutral” tweets is a challenge


Language-Independent Twitter Sentiment Analysis

Thanks for your attention!

Questions?


Contact

Sascha Narr DAI-Labor
Dipl.-Inform. Technische Universität Berlin

Fakultät IV –
Competence Center Information Retrieval & Elektrontechnik & Informatik
Machine Learning

sascha.narr@dai-labor.de Sekretariat TEL 14
Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7
Fax +49 (0) 30 / 314 – 74 003 10587 Berlin

www.dai-labor.de


Language-Independent Twitter Sentiment Analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Language-Independent Twitter Sentiment Analysis

Ähnlich wie Language-Independent Twitter Sentiment Analysis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Language-Independent Twitter Sentiment Analysis