Opinion mining for social media and news items in Romanian

Authors
UNIVERSITY
POLITEHNICA
OF BUCHAREST
Opinion Mining for Social Media
and News Items in Romanian
Claudia Cârdei
Filip Manișor
Traian Rebedea traian.rebedea@cs.pub.ro

Overview
• Introduction
• Previous Work
– English
– Romanian
• Proposed Solutions
• Opinionated Corpus
• Results and Comparisons
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2

Introduction
• Sentiment analysis and opinion mining research
has mainly concentrated on English and other
important languages (Spanish, Chinese, etc.)
– Various commercial and open-source solutions exist
mainly for English
– Corpora of opinionated texts and databases of
affective words (general or domain specific) also exist
for these languages
• Objective: develop an opinion mining solution for
Romanian texts gathered from a wide range of
online sources (mostly social media and news
items)
22.09.13
ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 3

Introduction
• Popular research domain in the last years
• Sentiment, subjectivity, opinion, publicity
– Related, but somewhat different
• Sentiment or subjectivity in a text:
– Positive, negative or neutral
– Subjective or objective
• Opinionated text
– Opinion author
– Opinion target (subject)
– Opinion (affective) words
– Opinion polarity
E.g. President Obama declared that the US immigration system is broken.
22.09.13

Previous Work - English
22.09.13

Previous Work - English
• Lots of studies and corpora in different domains
• The movie reviews dataset – very popular
• Initial results using BoW, punctuation, etc.
– Accuracy ≈ 80%
• Improvement to find relations/dependencies
between opinion targets and affective words
• Mining frequent dependency subtrees for
positive and negative reviews and using a SVM
with these subtrees as features
22.09.13

Previous Work - Romanian
• Use machine translation to generate English
texts, then apply opinion mining
• Translate affective words databases in
Romanian (e.g. WordNet Affect)
• Developing new affective words lists
• Training and evaluation on specific corpora in
Romanian
• Problems with NER, dependency parsing,
affective words scores
22.09.13

Proposed Solutions
• Supervised solution trained for several
different opinion subjects (entities)
• Three approaches
– Bag of words
– Affective words and dependency parsing
– N-grams probabilities
22.09.13

Bag of Words
• Bag of words model:
– Tokenization, diacritics restoration, lemmatization
– Distinct lemmas selected as features
– Improvements: POS filter, word n-grams filter
– Used both binary features and TF-IDF
22.09.13

Affective Scores & Dependency Parsing
• Compute affective word scores in Romanian:
– Translate all the adjectives and adverbs from the English WordNet
into Romanian using Google Translate
– Uses the probability of each translation pair
• Several affective score databases have been translated:
SentiWordNet, SenticNet 2 and ANEW
• Used the UAIC Romanian FDG parser to identify dependencies
between the subject entity and adjectives or adverbs
22.09.13

N-grams Probabilities
• Compute the conditional probability for each
n-gram in the corpus given that the document
is either positive or negative
• Then use the following score for each n-gram
(feature f):
• The score of a new text is computed by
summing the scores for each of the n-grams
existing in that text
22.09.13

Opinionated Corpus
• Corpus manually annotated by analysts for their
customers (created by Treeworks for their
product ZeList, www.zelist.ro)
• ZeList indexes most of the texts published in
Romanian in most popular social networks, blogs,
online forums, news websites, etc.
• Used data for seven different entities (companies
or brands) ranging from banks and beer brands
and going to web publishers and media
corporations
• The name of the entities have been anonymized
22.09.13

Opinionated Corpus
• Problems:
– These texts are very noisy, very heterogeneous,
from a wide range of sources and with different
writing styles (e.g. Twitter vs. news items)
– Some of them also might express positive and
negative publicity rather than opinions
22.09.13

Opinionated Corpus
• Data about the first version of the corpus
• Data collection ranged from a couple of months to a couple of
years, depending on the entity
• The second version contained a larger export of data for each
entity
22.09.13
Entity Total items Neutral Opinionated Positive Negative
Ent1 6055 5853 202 29 173
Ent2 2240 1961 279 222 57
Ent3 343 260 83 64 19
Ent4 1168 876 292 120 172
Ent5 539 520 19 17 2
Ent6 1025 570 455 330 125
Ent7 3787 3016 771 593 178

Results - Outline
• Results obtained for the first version of the corpus, for all
entities
• Accuracy positive-negative should be more relevant
• Good results for entities with more data, poor results for the
ones with a small number of opinionated texts
22.09.13
Entity
Total
items
Neutral Opinionated
Accuracy
opinion-neutral
Accuracy
positive-
negative
Ent1 6055 5853 202 97.01% 92.07%
Ent2 2240 1961 279 91.79% 87.81%
Ent3 343 260 83 84.84% 89.15%
Ent4 1168 876 292 86.22% 82.19%
Ent5 539 520 19 97.40% 57.89%
Ent6 1025 570 455 76.20% 84.17%
Ent7 3787 3016 771 81.75% 83.65%

Results - Comparison
• Comparison of the above presented solutions using the
second (larger) version of the corpus
• Only for one entity by extracting a balanced dataset with 700
positive and 700 negative opinionated texts
22.09.13
Method Accuracy
BoW + POS filter 81.31%
BoW only adj. 70.89%
BoW only adj. & adv. 76.60%
Frequent bigrams 80.88%
Frequent trigrams 76.60%
Affective scores + dependency parsing 52.18%
Affective scores (comparison with 0 decision) 55.35%
Trigrams probabilities 88.44%
Bigrams probabilities 72.54%

Conclusions
• Several alternatives for determining the opinion
polarity have been evaluated on a corpus manually
annotated for different Romanian entities
• Best results obtained at this moment: BoW plus a POS
filter or a frequent bigrams approach + SVM classifier
• Romanian FDG parser does not provide a good
accuracy for the dependency parsing task, especially
for texts from social media
– Texts are somewhat freely written, with little regards to
usual form or structure
– Improvement of this method & the affective words
database are still possible
22.09.13

Thank you!
• Questions?
• Discussions
22.09.13 CSCS 2013 – Bucharest, Romania 18

Opinion mining for social media and news items in Romanian

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Opinion mining for social media and news items in Romanian

Ähnlich wie Opinion mining for social media and news items in Romanian (20)

Mehr von Traian Rebedea

Mehr von Traian Rebedea (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Opinion mining for social media and news items in Romanian