應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

LOGO
應用文脈熵模型擴增情緒詞彙與強度於
股市新聞情感分類之研究
Using a Contextual Entropy Model to Expand
Emotion Words and their Intensity for Sentiment
Classification of Stock News
指導教授：禹良治博士
研究生：朱炫碩

Outline
Introduction
 Background
 Motivation and purpose
System Framework and Methodology
 Seed word selection
 Expansion of emotion words and intensity
 Sentiment classification of stock news
Experimental Results
 Experiment setup
 Seed word generation
 Emotion word expansion
 Comparative results
 Discussion
Conclusions
2

Background
Stock trend prediction using technical indices has been
extensively investigated in the stock market.
 ex: moving average (MA)、relative strength index (RSI)
Textual data such as stock news articles are also an
important factor affecting the stock price.
To find such useful information from daily news due to
the huge number of articles and reports is not an easy
task for investors.
3

Motivation and purpose
Sentiment classification of stock news can help
investors identify sentimental tendency in stock news
and facilitate their investment decision making.
This study focuses on mining useful features to classify
the sentiment of stock news.
4

One of the major characteristics of stock news articles is
the emotion words contained within them, and these
words may have different intensity.
ex:
 Positive emotion words: soar vs rise
 Negative emotion words: collapse vs fall
5
(strong) (normal)
(strong) (normal)

To discover emotion words and their intensity,
traditional approaches can be divided into two areas
of research:
 knowledge-based methods
• Rely on exploiting expert knowledge to create affective lexicons
or using existing lexicons to obtain emotion words and their
intensity.
 corpus-based methods
• Automatically acquire emotion words with intensity from large
corpora based on a set of seed words.
6

Outline
Introduction
 Background
 Discussion
Conclusions
7

Seed word generation
Information Gain (IG)
     wordEnpIwordGain  ,
 
np
n
np
n
np
p
np
p
npI



 22 loglog,
  0, npI if, either 0p or 0n
   ii
w
i
ii
npI
np
np
wordE ,
1
 


9

Emotional word expansion
The methods as follows to acquire more emotion words
and their intensity from the unlabeled corpus based on
the given seed words.
 Pointwise Mutual Information (PMI)
• measure the co-occurrence strength between two words.
 Contexutual Entropy Model (CE)
• considers both co-occurrence strength and context distribution
between the candidate words and seed.
10

Pointwise mutual information
The following are equations of PMI expansion method:
2
( , )
( , ) log .
( ) ( )
i j
i j
i j
C c seed N
PMI c seed
C c C seed



11
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
PMI c P PMI c seed
P 
 
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
PMI c N PMI c seed
N 
 

Pointwise mutual information
The following are equations of PMI expansion method:
( , )
( , )
1
, if the sentiment class of is
1 exp
( ) ,
1
1 exp
i seed
i seed
iPMI c P
i
iPMI c N
c positive
Intensity c
c negative
 
 
 
 

 
 

 
12

Contextual entropy model
There are three steps used in the contextual entropy
model. That is,
 Vector representation
• Using this step to represent the co-occurrence strength
between each word and its context words.
 Similarity measure
• Measuring the difference between the probabilistic
context distributions of a seed word and a candidate word.
 Expansion procedure
• Determine the sentiment class of each candidate word.
13

Vector representation
The contextual entropy model uses a high-dimensional
vector to record the co-occurrence strength between a
word and its context words.
14
k iw dm 1 d=
: window size
: distanced

Similarity measure
 The distance between ci and seedj can be calculated by the
summation of their KL divergence of left and right context
distributions. That is,
 The similarity between ci and seedj can be defined as
( , ) ( , ) ( , )i j i j
left left right right
i j c seed c seedDist c seed Div v v Div v v 
1
( , )
1 ( , )
i j
i j
CE c seed
Dist c seed


15

Expansion procedure
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
CE c P CE c seed
P 
 
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
CE c N CE c seed
N 
 
16
( , )
( , )
1
1 exp
( ) ,
1
1 exp
i seed
i seed
iCE c P
i
iCE c N
c positive
Intensity c
c negative
 
 
 
 

 
 

 

Sentiment classification of stock news
This study develops two classification schemes: binary
and intensity, depending on whether or not intensity is
used in classification.
 Binary classification scheme
• Only compares the number of positive and negative
emotion words contained in stock news articles without
consideration of their intensity.
 Intensity classification scheme
• Compares the sum of the intensity of positive and
negative emotion words in the articles.
17

Binary classification scheme
The binary classification scheme is defined as
 
if ( , ) ( , ) 0
,
if ( , ) ( , ) 0
i i i i
i i i i
i j i j
w P w D w N w D
i j i j
w P w D w N w D
positive I w w I w w
l D
negative I w w I w w
   
   
  

 
 

   
   
1 if
( , ) .
0 if
i j
i j
i j
w w
I w w
w w

 

18

Intensity classification scheme
The intensity classification scheme is defined as
 
if ( , ) ( ) ( , ) ( ) 0
,
if ( , ) ( ) ( , ) ( ) 0
i i i i
i i i i
i j i i j i
w P w D w N w D
i j i i j i
w P w D w N w D
positive I w w Intensity w I w w Intensity w
l D
negative I w w Intensity w I w w Intensity w
   
   
  

 
 

   
   
19

Outline
Introduction
 Background
 Discussion
Conclusions
20

Experiment setup
Experimental data
 Data source: a total of 7291 stock news from Yahoo!NEWS
 Labeled news articles: 3262 articles
 Unlabeled news articles: 4029 articles
Classifiers and feature sets
 Classifiers: Binary and Intensity
 Seed: Using the information gain from labeled news and
then manually selected by human experts to generate seed
words.
 PMI: seed words plus expansion emotion words using PMI
from unlabeled news.
 CE: seed words plus expansion emotion words using
contextual entropy method from unlabeled news.
21

Seed word generation
Classification accuracy of Seed against different
proportions of the labeled corpus for seed word
generation (α).
22

Emotion word expansion
Classification accuracy of different window sizes ( )
against different threshold values for expanded
word selection (β)

23
β 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Binary
PMI 68.45 68.73 68.91 68.78 69.38 68.55 67.90 66.56 62.03 52.47
CE ( = 3) 69.56 70.21 70.16 69.84 68.22 67.02 67.11 65.87 65.82 64.76
CE ( = 4) 69.70 70.81 70.99 70.44 70.16 68.68 68.55 67.99 67.25 67.58
CE ( = 5) 69.42 70.95 70.85 70.30 69.61 69.15 68.55 67.76 66.79 66.14
Intensity
PMI 69.75 70.67 71.18 71.64 71.87 72.06 71.22 69.47 65.96 61.94
CE ( = 3) 72.24 73.07 73.30 72.56 70.85 70.02 69.56 69.01 68.82 67.85
CE ( = 4) 72.38 73.63 74.00 74.13 73.12 71.73 71.32 70.67 70.62 71.13
CE ( = 5) 71.82 73.72 74.04 72.89 72.56 71.87 71.27 70.53 69.88 70.39






Accuracy of CE with different window sizes ( )
against different threshold values for expanded
word selection (β).

24
Binary classification Intensity classification







Accuracy of PMI and CE against and different threshold
values for expanded word selection (β).
25

Accuracy of Seed, PMI, and CE against different
proportions of the labeled corpus for seed word
generation (α).
26

Comparative results
Comparative results of Seed, PMI, and CE for
sentiment classification (%)
Seed PMI CE
Binary 67.90 70.69 73.19*
Intensity 70.07 66.17 76.54*
* CE vs PMI significantly different (p<0.05).
27

Discussion
The discussion of features in binary and intensity.
The discussion of the number of noising words in
different expanding approaches.
28

Discussion of different intensity
Example of a stock news articles
<Title> 台股漲新台幣盤升8.4分 </Title>
<Time> 2011/8/16 10:07 </Time>
<Content>
台股開盤上漲，美元偏弱，台北外匯市場盤初觀望氣氛濃，新台幣兌美元匯率走升；上午9時
45分新台幣匯率28.868元，升值8.4分。匯銀人士指出，受美股收紅的激勵，台股開盤上漲71
點，台北匯市盤初瀰漫觀望氣氛，因外商銀行拋匯力道稍強，新台幣盤初走升。1000816
</Content>
Feature(count)
Binary Intensity
PMI CE PMI CE
上漲(2) +1 +1 +1.0000 +1.0000
走升(2) +1 -1 +0.7688 -0.5088
收紅(1) +1 +1 +1.0000 +1.0000
瀰漫(1) -1 -1 -0.9746 -0.5296
拋匯(1) +1 -1 +0.9741 -0.5202
開盤(1) -1 -0.2066
弱(1) -1 -0.3359
觀望(2) -1 -0.3242
濃(1) -1 -0.7591
兌(1) -1 -0.1305
Sum -1 -1 2.5741 0.9327
29

Discussion of noising words
Example of emotion words and their intensity acquired
by PMI and CE.
Seed 下跌
Approach CE PMI
feature intensity feature intensity
走黑 0.6366 摔破 0.9999
台股收低 0.6023 瀕臨 0.9998
指小跌 0.5940 拋匯 0.9990
年大虧 0.5759 動盪 0.9974
台股走跌 0.5504 盤跌 0.9954
30

Outline
Introduction
 Background
 Discussion
Conclusions
31

Conclusions
 Experimental results show that the use of the expanded
emotion words contributed to the classification performance,
and incorporating the intensity further improved the
performance.
 Our proposed method that considers both co-occurrence
strength and context distribution can acquire more useful
emotion words and less noisy.
 Our future work will be devoted to the following directions.
 The current approach will be extended with multi-category
of news.
 Investigating more significant to further improve
classification performance.
32

應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

Ähnlich wie 應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究 (20)

應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

Hinweis der Redaktion