Cross domainsc new

Background
Preliminary experiments
Modeling accuracy loss for cross-domain SC
Graph-based algorithms

Cross-domain Sentiment Classiﬁcation: Resource
Selection and Algorithms

Natalia Ponomareva

Statistical Cybermetrics Research Group,
University of Wolverhampton, UK

December 17, 2011

Natalia Ponomareva Cross-domain Sentiment Classiﬁcation

Background

Outline
1 Background
Introduction
State-of-the-art research
2 Preliminary experiments
In-domain study
Cross-domain experiments
3 Modeling accuracy loss for cross-domain SC
Domain similarity
Domain complexity
Model construction and validation
4 Graph-based algorithms
Comparison
Document similarity
Strategy for choosing the best parameters

Background
Preliminary experiments Introduction
Modeling accuracy loss for cross-domain SC State-of-the-art research

What is Sentiment Classification?

Task within the research field of Sentiment Analysis.

It concerns classification of documents on the basis of overall
sentiments expressed by their authors.

Different scales can be used:
positive/negative;
positive, negative and neutral;
rating: 1*, 2*, 3*, 4*, 5*;

Example
“The film was fun and I enjoyed it.” ⇒ positive
“The film lasted too long and I got bored.” ⇒ negative


Background

Applications:Business Intelligence


Background

Applications: Event prediction


Background

Applications: Opinion search


Background

Why challenging?

Irony, humour.
Example
If you are reading this because it is your darling fragrance, please
wear it at home exclusively and tape the windows shut.

Generally positive words.
Example
This ﬁlm should be brilliant. It sounds like a great plot, the actors
are ﬁsrt grade, and the supporting cast is good as well, and
Stallone is attempting to deliver a good performance.
However, it cannot hold up.


Background

Why challenging?

Context dependency.
Example
This is a great camera.
A great amount of money was spent for promoting this camera.
One might think this is a great camera. Well think again,
because.....

Rejection or advice?
Example
Go read the book.


Background

Approaches to Sentiment Classiﬁcation

Lexical approaches

Supervised machine learning

Semi-supervised and unsupervised approaches

Cross-domain Sentiment Classiﬁcation (SC)


Background

Lexical approaches


Background

Lexical approaches

Use of dictionaries of sentiment words with a given semantic
orientation.


Background

Lexical approaches

orientation.
Dictionaries are built either manually or (semi-)automatically.


Background

Lexical approaches

orientation.
A special scoring function is applied in order to calculate the
ﬁnal semantic orientation of a text.


Background

Lexical approaches

orientation.
A special scoring function is applied in order to calculate the
ﬁnal semantic orientation of a text.

Example
lightweight +3, good +4, ridiculous -2
Lightweight, stores a ridiculous amount of books and good battery
life.
SO1 = 3+4−2 = 1 2
3 3
SO2 = max{|3|, |4|, |−2|} · sign(max{|3|, |4|, |−2|}) = 4


Background

Supervised Machine Learning


Background


Learn sentiment phenomena from an annotated corpus.


Background



Diﬀerent Machine Learning methods were tested (NB, SVM,
ME). In the majority of cases SVM demonstrates the best
performance.


Background



performance.

For review data ML approach performs better than lexical one
when training and test data belong to the same domain.


Background



performance.

For review data ML approach performs better than lexical one
when training and test data belong to the same domain.

But it needs substantial amount of annotated data.


Background



Background


Require small amount of annotated data or no data at all.


Background


Require small amount of annotated data or no data at all.

Diﬀerent techniques were exploited:

Automatic extraction of sentiment words on the Web using
seed words (Turney, 2002).
Exploiting spectral clustering and active learning (Dasgupta et
al., 2009).
Applying co-training (Li et al., 2010)
Bootstrapping (Zagibalov, 2010)
Using graph-based algorithms (Goldberg et al., 2006)


Background

Cross-domain SC

Main approaches:
Ensemble of classiﬁers (Read 2005, Aue and Gamon 2005);

Structural Correspondence Learning (Blitzer 2007);

Graph-based algorithms (Wu 2009).


Background

Ensemble of classifiers

Classifiers are learned on data belonging to different source
domains.

Various methods can be used to combine classifiers:

Majority voting;

Weighted voting, where development data set is used to learn
credibility weights for each classifier.

Learning a meta-classifier on a small amount of target domain
data.


Background

Structural Correspondence Learning

Blitzer et al., 2007:
Introduce pivot features that appear frequently in source and
target domains.

Find projections of source features the co-occur with pivots in
a target domain.

Example
The laptop is great, it is extremely fast.
The book is great, it is very engaging.


Background

Discussion


Background

Discussion

Machine learning methods demonstrate a
very good performance and when the size of
the data is substantial they outperform
lexical approaches.


Background

Discussion
lexical approaches.
On the other hand, there is a plethora of
annotated resources on the Web and the
possibility to re-use them would be very
beneﬁcial.


Background

Discussion
lexical approaches.
beneficial.
Structural Correspondence Learning and
similar approaches are good for binary
classification but difficult to be applied for
multi-class problem.


Background

Discussion
lexical approaches.
beneficial.
Structural Correspondence Learning and
similar approaches are good for binary
classification but difficult to be applied for
multi-class problem.
That motivates us to exploit graph-based
cross-domain algorithms.

Background
Preliminary experiments In-domain study
Modeling accuracy loss for cross-domain SC Cross-domain experiments

Data


Background

Data

Data represent the corpus consist of Amazon product reviews
on 7 diﬀerent topics: books (BO), electronics (EL),
kitchen&housewares (KI), DVDs (DV), music (MU),
health&personal care (HE) and toys&games(TO).


Background

Data


Reviews are rated either as positive or negative.


Background

Data


Reviews are rated either as positive or negative.

Data within each domain are balanced, they contain 1000
positive and 1000 negative reviews.


Background

Data statistics

corpus num words mean words vocab size vocab size (>= 3)
BO 364k 181.8 23k 8 256
DV 397k 198.7 24k 8 632
MU 300k 150.1 19k 6 163
EL 236k 117.9 12k 4 465
KI 198k 98.9 11k 4 053
TO 206k 102.9 11k 4 018
HE 188k 93.9 11k 4 022


Background

Data statistics

corpus num words mean words vocab size vocab size (>= 3)
BO 364k 181.8 23k 8 256
DV 397k 198.7 24k 8 632
MU 300k 150.1 19k 6 163
EL 236k 117.9 12k 4 465
KI 198k 98.9 11k 4 053
TO 206k 102.9 11k 4 018
HE 188k 93.9 11k 4 022

BO, DV, MU - longer reviews, richer vocabularies.


Background

Feature selection

We compared several characteristics of features:

words vs. stems and lemmas;

unigrams vs. unigrams + bigrams;

binary weights vs. frequency, idf and tfidf;

features filtered by presence of verbs, adjectives, adverbs and
modal verbs vs. unfiltered features.


Background

Feature selection


Background

Feature selection

Filtering of features worsen the accuracy for all domains.


Background

Feature selection


Unigrams + bigrams generally perform signiﬁcantly much
better then unigrams alone.


Background

Feature selection


Unigrams + bigrams generally perform significantly much
better then unigrams alone.

Binary, idf and delta idf weights generally give better results
than frequency, tfidf and delta tfidf weights.


Background

Feature selection

domain features preference conﬁdence interval, α = 0.01
BO word ≈ lemma ≈ stem inside
DV word ≈ lemma ≈ stem inside
MU lemma > stem > word boundary
EL word > lemma ≈ stem inside
KI word ≈ lemma > stem inside
TO word ≈ stem > lemma boundary
HE stem > lemma > word inside


Background

10 most discriminative positive features

BO EL KI DV
highly recommend plenty perfect for album
concise plenty of be perfect magniﬁcent
for anyone highly recommend favorite superb
i highly highly highly recommend debut
excellent ps NUM ﬁestaware wolf
my favorite please with be easy join
unique very happy easy to charlie
inspiring beat perfect love it
must read glad eliminate highly recommend
and also well as easy rare


Background

10 most discriminative negative features

BO EL KI DV
poorly refund waste of your money
disappointing repair return it so bad
waste of do not buy it break ridiculous
your money waste of refund waste of
waste waste to return waste
annoying defective waste worst movie
bunch forum return pointless
boring junk very disappoint talk and
bunch of work worst pathetic
to ﬁnish worst I return horrible


Background

10 most discriminative negative features

BO EL KI DV
poorly refund waste of your money
disappointing repair return it so bad
waste of do not buy it break ridiculous
your money waste of refund waste of
waste waste to return waste
annoying defective waste worst movie
bunch forum return pointless
boring junk very disappoint talk and
bunch of stop work worst pathetic
to ﬁnish worst I return horrible


Background

Results


Background

Results for cross-domain SC

Accuracy Accuracy drop


Background
Domain similarity
Domain complexity

Motivation

Usually cross-domain algorithms do not work well for very
different source and target domains.

Combinations of classifiers from different domains in some
cases perform much worse than a single classifier trained on
the closest domain (Blitzer et al. 2007)

Finding the closest domain can help to improve the results of
cross-domain sentiment classification.


Background
Domain similarity
Domain complexity

How to compare data sets?


Background
Domain similarity
Domain complexity


Machine-learning techniques are based on the assumption that
training and test data are driven from the same probability
distribution, and, therefore, they perform much better when
training and test data sets are alike.


Background
Domain similarity
Domain complexity


The task of ﬁnding the best training data transforms into the
task of ﬁnding data whose feature distribution is similar to the
test one.


Background
Domain similarity
Domain complexity


test one.
We propose two characteristics to model accuracy loss:
domain similarity and domain complexity or, more precisely,
domain complexity variance.


Background
Domain similarity
Domain complexity


test one.
Domain similarity approximate similarity between distributions
for frequent features.


Background
Domain similarity
Domain complexity


test one.
Domain similarity approximate similarity between distributions
for frequent features.
Domain complexity compares tails of distributions.

Background
Domain similarity
Domain complexity

Domain similarity


Background
Domain similarity
Domain complexity

Domain similarity

We are not interested in all terms but rather on those bearing
sentiment.


Background
Domain similarity
Domain complexity

Domain similarity

sentiment.

The study on SA suggested that adjectives, verbs and adverbs
are the main indicators of sentiment, so, we keep only
unigrams and bigrams that contain those POS as features.


Background
Domain similarity
Domain complexity

Domain similarity

sentiment.

The study on SA suggested that adjectives, verbs and adverbs
are the main indicators of sentiment, so, we keep only
unigrams and bigrams that contain those POS as features.

We compare diﬀerent weighting schemes: frequencies, TF-IDF
and IDF to compute corpus similarity.


Background
Domain similarity
Domain complexity

Measures of domain similarity


Background
Domain similarity
Domain complexity


χ2 taken from Corpus Linguistics where it was demonstrated
to have the best correlation with the gold standard.


Background
Domain similarity
Domain complexity



Kullback-Leibler divergence (DKL ) and its symmetric analogue
Jensen-Shannon divergence (DJS ) were borrowed from
Information Theory.


Background
Domain similarity
Domain complexity



Kullback-Leibler divergence (DKL ) and its symmetric analogue
Jensen-Shannon divergence (DJS ) were borrowed from
Information Theory.

Jaccard coeﬃcient (Jaccard) and cosine similarity (cosine) are
well-known similarity measures


Background
Domain similarity
Domain complexity

Correlation for different domain similarity measures

Table: Correlation with accuracy drop
measure R (freq) R (filtr.,freq) R (filtr.,TFIDF) R (filtr.,IDF)
cosine -0.790 -0.840 -0.836 -0.863
Jaccard -0.869 -0.879 -0.879 -0.879
χ2 0.855 0.869 0.876 0.879
DKL 0.734 0.827 0.676 0.796
DJS 0.829 0.833 0.804 0.876


Background
Domain similarity
Domain complexity

Domain similarity: χ2
inv

The boundary between similar and distinct domains approximately
corresponds to χ2 = 1.7.
inv

Background
Domain similarity
Domain complexity

Domain complexity

Similarity between domains is mostly controlled by frequent
words, but the shape of the corpus distribution is also
inﬂuenced by rare words representing its tail.

It was shown that richer domains with more rare words are
more complex for SC.

We also observed that the accuracy loss is higher in
cross-domain settings when source domain is more complex
than the target one.


Background
Domain similarity
Domain complexity

Measures of domain complexity
We propose several measures to approximate domain complexity:

percentage of rare words;

word richness (proportion of vocabulary size in a corpus size);

relative entropy.

Correlation of domain complexity measures with in-domain
accuracy:

% of rare words word richness rel.entropy
-0.904 -0.846 0.793


Background
Domain similarity
Domain complexity

Domain complexity

corpus accuracy % of rare words word richness rel.entropy
BO 0.786 64.77 0.064 9.23
DV 0.796 64.16 0.061 8.02
MU 0.774 67.16 0.063 8.98
EL 0.812 61.71 0.049 12.66
KI 0.829 61.49 0.053 14.44
TO 0.816 63.37 0.053 15.27
HE 0.808 61.83 0.056 15.82


Background
Domain similarity
Domain complexity

Modeling accuracy loss
To model the performance drop we assume a linear dependency on
domain similarity and complexity variance and propose the
following linear regression model:
F (sij , ∆cij ) = β0 + β1 sij + β2 ∆cij , (1)
where
sij – domain similarity (or distance) between target domain i and
source domain j
∆cij = ci − cj , – difference between domain complexities.
The unknown coefficients βi are solutions of the following system
of linear equations:

β0 + β1 sij + β2 ∆cij = ∆aij , (2)
where ∆aij is the accuracy drop when adapting the classifier from
domain i to domain j.

Background
Domain similarity
Domain complexity

Model evaluation

The evaluation of the constructed regression model includes
following steps:
Global test (or F-test) to verify statistical signiﬁcance of
regression model with respect to all its predictors.
Test on individual variables (or t-test) to reveal regressors that
do not bring a signiﬁcant impact into the model.
Leave-one-out-cross validation for the data set of 42 examples.


Background
Domain similarity
Domain complexity

Global test

The null hypothesis for global test states that there is no
correlation between regressors and the response variable.
Our purpose is to demonstrate that this hypothesis must be
rejected with a high level of confidence.
In other words, we have to show that coefficient of
determination R 2 is high enough to consider its value
significantly different from zero.

R2 R F-value p-value
0.873 0.935 134.60 << 0.0001


Background
Domain similarity
Domain complexity

Test on individual coefficients

β0 β1 β2
value -8.67 27.71 -0.55
standard error 1.08 1.77 0.11
t-value -8.00 15.67 -4.86
p-value << 0.0001 << 0.0001 << 0.0001

All coefficients are justified to be statistically significant with
the confidence level higher than 99.9%.


Background
Domain similarity
Domain complexity

Leave-one-out cross-validation results

accuracy drop standard error standard deviation max error, 95%
all data 1.566 1.091 3.404
< 5% 1.465 1.133 3.373
> 5%, < 10% 1.646 1.173 3.622
> 10% 1.556 1.166 3.519


Background
Domain similarity
Domain complexity


all data 1.566 1.091 3.404
< 5% 1.465 1.133 3.373
> 5%, < 10% 1.646 1.173 3.622
> 10% 1.556 1.166 3.519

We are able to predict accuracy loss with standard error of 1.5%
and maximum error not exceeding 3.4%.


Background
Domain similarity
Domain complexity


all data 1.566 1.091 3.404
< 5% 1.465 1.133 3.373
> 5%, < 10% 1.646 1.173 3.622
> 10% 1.556 1.166 3.519

Lower values are being noticed for domains which are more
similar.


Background
Domain similarity
Domain complexity


all data 1.566 1.091 3.404
< 5% 1.465 1.133 3.373
> 5%, < 10% 1.646 1.173 3.622
> 10% 1.556 1.166 3.519

Lower values are being noticed for domains which are more
similar.
This is a strength of the model as our main purpose is to identify
the closest domains.


Background
Domain similarity
Domain complexity

Comparing actual and predicted drop


Background
Comparison
Document similarity

Graph-based algorithms: OPTIM

Goldberg et al., 2006:

The algorithm is based on the
assumption that the rating function is
smooth with respect to the graph.
Rating difference between the closest
nodes is minimised.
Difference between initial rating and
the final value is also minimised.
The result is a solution of an
optimisation problem.


Background
Comparison
Document similarity

Graph-based algorithms: RANK

Wu et al., 2009:

On each iteration of the algorithm
sentiment scores of unlabeled
documents are updated on the basis of
the weighted sum of sentiment scores
of the nearest labeled neighbours and
the nearest unlabeled neighbours.

The process stops when convergence
is achieved.


Background
Comparison
Document similarity

Comparison

OPTIM algorithm RANK algorithm
(Goldberg et al., 2006) (Wu et al., 2009)


Background
Comparison
Document similarity

Comparison

Initial setting of RANK does not allow in-domain and
out-domain neighbours to be diﬀerent: easy to change!

The condition of smoothness of sentiment function over the
nodes is satisﬁed for both algorithms.

Unlike RANK, OPTIM requires the closeness of initial
sentiment values and output ones for unlabeled nodes.

The last condition makes the OPTIM solution more stable.


Background
Comparison
Document similarity

Comparison

Initial setting of RANK does not allow in-domain and
out-domain neighbours to be diﬀerent: easy to change!

The condition of smoothness of sentiment function over the
nodes is satisﬁed for both algorithms.

Unlike RANK, OPTIM requires the closeness of initial
sentiment values and output ones for unlabeled nodes.

The last condition makes the OPTIM solution more stable.

What about the measure of similarity between graph nodes?


Background
Comparison
Document similarity

Document representation

We consider 2 types of document representation:
feature-based

sentiment units-based


Background
Comparison
Document similarity


feature-based, that involves weighted document features.



Background
Comparison
Document similarity


Features are ﬁltered by POS: adjectives, verbs and adverbs.
Features are weighted using either tﬁdf or idf.



Background
Comparison
Document similarity



sentiment units-based, that is based upon the percentage of
positive and negative units in a document.


Background
Comparison
Document similarity



sentiment units-based, that is based upon the percentage of
positive and negative units in a document.
Units can be either sentences or words.
PSP states for positive sentences percentage, PWP - for
positive words percentage.
Lexical approach was exploited to calculate semantic
orientation of sentiment units with the use of SentiWordNet
and SOCAL dictionary.
SO of sentences are averaged by a number of its positive and
negative words.

Background
Comparison
Document similarity

Results
Correlation between document’s ratings and document features/units:

domain idf tﬁdf PSP SWN PSP SOCAL PWP SWN PWP SOCAL
BO 0.387 0.377 0.034 0.206 0.067 0.252
DV 0.376 0.368 0.064 0.251 0.098 0.316
EL 0.433 0.389 0.048 0.182 0.043 0.196
KI 0.444 0.416 0.068 0.238 0.076 0.230


Background
Comparison
Document similarity

Results

BO 0.387 0.377 0.034 0.206 0.067 0.252
DV 0.376 0.368 0.064 0.251 0.098 0.316
EL 0.433 0.389 0.048 0.182 0.043 0.196
KI 0.444 0.416 0.068 0.238 0.076 0.230

Feature-based document representation with idf-weights better
correlates with document rating than any other representation.


Background
Comparison
Document similarity

Results

BO 0.387 0.377 0.034 0.206 0.067 0.252
DV 0.376 0.368 0.064 0.251 0.098 0.316
EL 0.433 0.389 0.048 0.182 0.043 0.196
KI 0.444 0.416 0.068 0.238 0.076 0.230

SentiWordNet does not provide good results for this task, probably
due to high level of noise which comes from its automatic
construction.


Background
Comparison
Document similarity

Results

BO 0.387 0.377 0.034 0.206 0.067 0.252
DV 0.376 0.368 0.064 0.251 0.098 0.316
EL 0.433 0.389 0.048 0.182 0.043 0.196
KI 0.444 0.416 0.068 0.238 0.076 0.230

SentiWordNet does not provide good results for this task, probably
due to high level of noise which comes from its automatic
construction.
Document similarity is calculated using cosine measure.


Background
Comparison
Document similarity

Best accuracy improvement achieved by the algorithms
We tested the performance of each algorithm for several
values of their parameters.
The best accuracy improvement that was given by each
algorithm:
OPTIM RANK


Background
Comparison
Document similarity

General observations

We selected and examined only those results that were inside
the confidence interval of the best accuracy for α = 0.01.

RANK: tends to depend a lot on values of its parameters and
the most unstable results are obtained when source and target
domains are different.

RANK: A great improvement is achieved when adapting the
classifier from more complex to more simple domains.

OPTIM: Stable, but results are modest.


Background
Comparison
Document similarity

Analysis of RANK behaviour


Background
Comparison
Document similarity


Within clusters of similar domains the majority of good
answers have γ ≥ 0.9.


Background
Comparison
Document similarity


This demonstrates that information provided by labeled data
is more valuable.


Background
Comparison
Document similarity


is more valuable.

For non-similar domains, when source domain is more complex
than the target one, best results are achieved with smaller γ
close to 0.5.


Background
Comparison
Document similarity


is more valuable.

For non-similar domains, when source domain is more complex
than the target one, best results are achieved with smaller γ
close to 0.5.
This means that the algorithm beneﬁts much from unlabeled
data.


Background
Comparison
Document similarity


For non-similar domains, when target one is more complex
than the source one, γ tends to increase to 0.7


Background
Comparison
Document similarity


That gives preference to more simple labeled data.


Background
Comparison
Document similarity



Number of labeled and unlabeled neighbours is not equal,
there is a clear tendency to prefer results with smaller number
of unlabeled and higher number of labeled examples.


Background
Comparison
Document similarity



Number of labeled and unlabeled neighbours is not equal,
there is a clear tendency to prefer results with smaller number
of unlabeled and higher number of labeled examples.
Proportion of 50 against 150 seems to be an ideal, covering
most of the cases.


Background
Comparison
Document similarity

RANK best RANK


Background
Comparison
Document similarity

OPTIM best RANK


Background
Comparison
Document similarity

Conclusions and future work

Our strategy seems reasonable, the RANK performance is still
higher than the OPTIM performance.

In the future we aim to apply the gradient descent method to
reﬁne parameters values.


Background
Comparison
Document similarity

Thank you for your
attention!


Cross domainsc new

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Cross domainsc new

Ähnlich wie Cross domainsc new (20)

Mehr von Natalia Ostapuk

Mehr von Natalia Ostapuk (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cross domainsc new