2. Outline
Introduction
Background
Motivation and purpose
System Framework and Methodology
Seed word selection
Expansion of emotion words and intensity
Sentiment classification of stock news
Experimental Results
Experiment setup
Seed word generation
Emotion word expansion
Comparative results
Discussion
Conclusions
2
3. Background
Stock trend prediction using technical indices has been
extensively investigated in the stock market.
ex: moving average (MA)、relative strength index (RSI)
Textual data such as stock news articles are also an
important factor affecting the stock price.
To find such useful information from daily news due to
the huge number of articles and reports is not an easy
task for investors.
3
4. Motivation and purpose
Sentiment classification of stock news can help
investors identify sentimental tendency in stock news
and facilitate their investment decision making.
This study focuses on mining useful features to classify
the sentiment of stock news.
4
5. Motivation and purpose
One of the major characteristics of stock news articles is
the emotion words contained within them, and these
words may have different intensity.
ex:
Positive emotion words: soar vs rise
Negative emotion words: collapse vs fall
5
(strong) (normal)
(strong) (normal)
6. Motivation and purpose
To discover emotion words and their intensity,
traditional approaches can be divided into two areas
of research:
knowledge-based methods
• Rely on exploiting expert knowledge to create affective lexicons
or using existing lexicons to obtain emotion words and their
intensity.
corpus-based methods
• Automatically acquire emotion words with intensity from large
corpora based on a set of seed words.
6
7. Outline
Introduction
Background
Motivation and purpose
System Framework and Methodology
Seed word selection
Expansion of emotion words and intensity
Sentiment classification of stock news
Experimental Results
Experiment setup
Seed word generation
Emotion word expansion
Comparative results
Discussion
Conclusions
7
9. Seed word generation
Information Gain (IG)
wordEnpIwordGain ,
np
n
np
n
np
p
np
p
npI
22 loglog,
0, npI if, either 0p or 0n
ii
w
i
ii
npI
np
np
wordE ,
1
9
10. Emotional word expansion
The methods as follows to acquire more emotion words
and their intensity from the unlabeled corpus based on
the given seed words.
Pointwise Mutual Information (PMI)
• measure the co-occurrence strength between two words.
Contexutual Entropy Model (CE)
• considers both co-occurrence strength and context distribution
between the candidate words and seed.
10
11. Pointwise mutual information
The following are equations of PMI expansion method:
2
( , )
( , ) log .
( ) ( )
i j
i j
i j
C c seed N
PMI c seed
C c C seed
11
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
PMI c P PMI c seed
P
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
PMI c N PMI c seed
N
12. Pointwise mutual information
The following are equations of PMI expansion method:
( , )
( , )
1
, if the sentiment class of is
1 exp
( ) ,
1
, if the sentiment class of is
1 exp
i seed
i seed
iPMI c P
i
iPMI c N
c positive
Intensity c
c negative
12
13. Contextual entropy model
There are three steps used in the contextual entropy
model. That is,
Vector representation
• Using this step to represent the co-occurrence strength
between each word and its context words.
Similarity measure
• Measuring the difference between the probabilistic
context distributions of a seed word and a candidate word.
Expansion procedure
• Determine the sentiment class of each candidate word.
13
14. Vector representation
The contextual entropy model uses a high-dimensional
vector to record the co-occurrence strength between a
word and its context words.
14
k iw dm 1 d=
: window size
: distanced
15. Similarity measure
The distance between ci and seedj can be calculated by the
summation of their KL divergence of left and right context
distributions. That is,
The similarity between ci and seedj can be defined as
( , ) ( , ) ( , )i j i j
left left right right
i j c seed c seedDist c seed Div v v Div v v
1
( , )
1 ( , )
i j
i j
CE c seed
Dist c seed
15
16. Expansion procedure
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
CE c P CE c seed
P
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
CE c N CE c seed
N
16
( , )
( , )
1
, if the sentiment class of is
1 exp
( ) ,
1
, if the sentiment class of is
1 exp
i seed
i seed
iCE c P
i
iCE c N
c positive
Intensity c
c negative
17. Sentiment classification of stock news
This study develops two classification schemes: binary
and intensity, depending on whether or not intensity is
used in classification.
Binary classification scheme
• Only compares the number of positive and negative
emotion words contained in stock news articles without
consideration of their intensity.
Intensity classification scheme
• Compares the sum of the intensity of positive and
negative emotion words in the articles.
17
18. Binary classification scheme
The binary classification scheme is defined as
if ( , ) ( , ) 0
,
if ( , ) ( , ) 0
i i i i
i i i i
i j i j
w P w D w N w D
i j i j
w P w D w N w D
positive I w w I w w
l D
negative I w w I w w
1 if
( , ) .
0 if
i j
i j
i j
w w
I w w
w w
18
19. Intensity classification scheme
The intensity classification scheme is defined as
if ( , ) ( ) ( , ) ( ) 0
,
if ( , ) ( ) ( , ) ( ) 0
i i i i
i i i i
i j i i j i
w P w D w N w D
i j i i j i
w P w D w N w D
positive I w w Intensity w I w w Intensity w
l D
negative I w w Intensity w I w w Intensity w
19
20. Outline
Introduction
Background
Motivation and purpose
System Framework and Methodology
Seed word selection
Expansion of emotion words and intensity
Sentiment classification of stock news
Experimental Results
Experiment setup
Seed word generation
Emotion word expansion
Comparative results
Discussion
Conclusions
20
21. Experiment setup
Experimental data
Data source: a total of 7291 stock news from Yahoo!NEWS
Labeled news articles: 3262 articles
Unlabeled news articles: 4029 articles
Classifiers and feature sets
Classifiers: Binary and Intensity
Seed: Using the information gain from labeled news and
then manually selected by human experts to generate seed
words.
PMI: seed words plus expansion emotion words using PMI
from unlabeled news.
CE: seed words plus expansion emotion words using
contextual entropy method from unlabeled news.
21
24. Emotion word expansion
Accuracy of CE with different window sizes ( )
against different threshold values for expanded
word selection (β).
24
Binary classification Intensity classification
25. Emotion word expansion
Accuracy of PMI and CE against and different threshold
values for expanded word selection (β).
25
Binary classification Intensity classification
26. Emotion word expansion
Accuracy of Seed, PMI, and CE against different
proportions of the labeled corpus for seed word
generation (α).
26
Binary classification Intensity classification
27. Comparative results
Comparative results of Seed, PMI, and CE for
sentiment classification (%)
Seed PMI CE
Binary 67.90 70.69 73.19*
Intensity 70.07 66.17 76.54*
* CE vs PMI significantly different (p<0.05).
27
28. Discussion
The discussion of features in binary and intensity.
The discussion of the number of noising words in
different expanding approaches.
28
30. Discussion of noising words
Example of emotion words and their intensity acquired
by PMI and CE.
Seed 下跌
Approach CE PMI
feature intensity feature intensity
走黑 0.6366 摔破 0.9999
台股收低 0.6023 瀕臨 0.9998
指小跌 0.5940 拋匯 0.9990
年大虧 0.5759 動盪 0.9974
台股走跌 0.5504 盤跌 0.9954
30
31. Outline
Introduction
Background
Motivation and purpose
System Framework and Methodology
Seed word selection
Expansion of emotion words and intensity
Sentiment classification of stock news
Experimental Results
Seed word generation
Emotion word expansion
Comparative results
Discussion
Conclusions
31
32. Conclusions
Experimental results show that the use of the expanded
emotion words contributed to the classification performance,
and incorporating the intensity further improved the
performance.
Our proposed method that considers both co-occurrence
strength and context distribution can acquire more useful
emotion words and less noisy.
Our future work will be devoted to the following directions.
The current approach will be extended with multi-category
of news.
Investigating more significant to further improve
classification performance.
32