SlideShare ist ein Scribd-Unternehmen logo
1 von 33
LOGO
應用文脈熵模型擴增情緒詞彙與強度於
股市新聞情感分類之研究
Using a Contextual Entropy Model to Expand
Emotion Words and their Intensity for Sentiment
Classification of Stock News
指導教授:禹良治 博士
研 究 生:朱炫碩
Outline
Introduction
 Background
 Motivation and purpose
System Framework and Methodology
 Seed word selection
 Expansion of emotion words and intensity
 Sentiment classification of stock news
Experimental Results
 Experiment setup
 Seed word generation
 Emotion word expansion
 Comparative results
 Discussion
Conclusions
2
Background
Stock trend prediction using technical indices has been
extensively investigated in the stock market.
 ex: moving average (MA)、relative strength index (RSI)
Textual data such as stock news articles are also an
important factor affecting the stock price.
To find such useful information from daily news due to
the huge number of articles and reports is not an easy
task for investors.
3
Motivation and purpose
Sentiment classification of stock news can help
investors identify sentimental tendency in stock news
and facilitate their investment decision making.
This study focuses on mining useful features to classify
the sentiment of stock news.
4
Motivation and purpose
One of the major characteristics of stock news articles is
the emotion words contained within them, and these
words may have different intensity.
ex:
 Positive emotion words: soar vs rise
 Negative emotion words: collapse vs fall
5
(strong) (normal)
(strong) (normal)
Motivation and purpose
To discover emotion words and their intensity,
traditional approaches can be divided into two areas
of research:
 knowledge-based methods
• Rely on exploiting expert knowledge to create affective lexicons
or using existing lexicons to obtain emotion words and their
intensity.
 corpus-based methods
• Automatically acquire emotion words with intensity from large
corpora based on a set of seed words.
6
Outline
Introduction
 Background
 Motivation and purpose
System Framework and Methodology
 Seed word selection
 Expansion of emotion words and intensity
 Sentiment classification of stock news
Experimental Results
 Experiment setup
 Seed word generation
 Emotion word expansion
 Comparative results
 Discussion
Conclusions
7
System Framework
8
Seed word generation
Information Gain (IG)
     wordEnpIwordGain  ,
 
np
n
np
n
np
p
np
p
npI



 22 loglog,
  0, npI if, either 0p or 0n
   ii
w
i
ii
npI
np
np
wordE ,
1
 


9
Emotional word expansion
The methods as follows to acquire more emotion words
and their intensity from the unlabeled corpus based on
the given seed words.
 Pointwise Mutual Information (PMI)
• measure the co-occurrence strength between two words.
 Contexutual Entropy Model (CE)
• considers both co-occurrence strength and context distribution
between the candidate words and seed.
10
Pointwise mutual information
The following are equations of PMI expansion method:
2
( , )
( , ) log .
( ) ( )
i j
i j
i j
C c seed N
PMI c seed
C c C seed



11
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
PMI c P PMI c seed
P 
 
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
PMI c N PMI c seed
N 
 
Pointwise mutual information
The following are equations of PMI expansion method:
( , )
( , )
1
, if the sentiment class of is
1 exp
( ) ,
1
, if the sentiment class of is
1 exp
i seed
i seed
iPMI c P
i
iPMI c N
c positive
Intensity c
c negative
 
 
 
 

 
 

 
12
Contextual entropy model
There are three steps used in the contextual entropy
model. That is,
 Vector representation
• Using this step to represent the co-occurrence strength
between each word and its context words.
 Similarity measure
• Measuring the difference between the probabilistic
context distributions of a seed word and a candidate word.
 Expansion procedure
• Determine the sentiment class of each candidate word.
13
Vector representation
The contextual entropy model uses a high-dimensional
vector to record the co-occurrence strength between a
word and its context words.
14
k iw dm 1 d=
: window size
: distanced
Similarity measure
 The distance between ci and seedj can be calculated by the
summation of their KL divergence of left and right context
distributions. That is,
 The similarity between ci and seedj can be defined as
( , ) ( , ) ( , )i j i j
left left right right
i j c seed c seedDist c seed Div v v Div v v 
1
( , )
1 ( , )
i j
i j
CE c seed
Dist c seed


15
Expansion procedure
1
( , ) ( , ),
j seed
i seed i j
seed Pseed
CE c P CE c seed
P 
 
1
( , ) ( , ),
j seed
i seed i j
seed Nseed
CE c N CE c seed
N 
 
16
( , )
( , )
1
, if the sentiment class of is
1 exp
( ) ,
1
, if the sentiment class of is
1 exp
i seed
i seed
iCE c P
i
iCE c N
c positive
Intensity c
c negative
 
 
 
 

 
 

 
Sentiment classification of stock news
This study develops two classification schemes: binary
and intensity, depending on whether or not intensity is
used in classification.
 Binary classification scheme
• Only compares the number of positive and negative
emotion words contained in stock news articles without
consideration of their intensity.
 Intensity classification scheme
• Compares the sum of the intensity of positive and
negative emotion words in the articles.
17
Binary classification scheme
The binary classification scheme is defined as
 
if ( , ) ( , ) 0
,
if ( , ) ( , ) 0
i i i i
i i i i
i j i j
w P w D w N w D
i j i j
w P w D w N w D
positive I w w I w w
l D
negative I w w I w w
   
   
  

 
 

   
   
1 if
( , ) .
0 if
i j
i j
i j
w w
I w w
w w

 

18
Intensity classification scheme
The intensity classification scheme is defined as
 
if ( , ) ( ) ( , ) ( ) 0
,
if ( , ) ( ) ( , ) ( ) 0
i i i i
i i i i
i j i i j i
w P w D w N w D
i j i i j i
w P w D w N w D
positive I w w Intensity w I w w Intensity w
l D
negative I w w Intensity w I w w Intensity w
   
   
  

 
 

   
   
19
Outline
Introduction
 Background
 Motivation and purpose
System Framework and Methodology
 Seed word selection
 Expansion of emotion words and intensity
 Sentiment classification of stock news
Experimental Results
 Experiment setup
 Seed word generation
 Emotion word expansion
 Comparative results
 Discussion
Conclusions
20
Experiment setup
Experimental data
 Data source: a total of 7291 stock news from Yahoo!NEWS
 Labeled news articles: 3262 articles
 Unlabeled news articles: 4029 articles
Classifiers and feature sets
 Classifiers: Binary and Intensity
 Seed: Using the information gain from labeled news and
then manually selected by human experts to generate seed
words.
 PMI: seed words plus expansion emotion words using PMI
from unlabeled news.
 CE: seed words plus expansion emotion words using
contextual entropy method from unlabeled news.
21
Seed word generation
Classification accuracy of Seed against different
proportions of the labeled corpus for seed word
generation (α).
22
Emotion word expansion
Classification accuracy of different window sizes ( )
against different threshold values for expanded
word selection (β)

23
β 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Binary
PMI 68.45 68.73 68.91 68.78 69.38 68.55 67.90 66.56 62.03 52.47
CE ( = 3) 69.56 70.21 70.16 69.84 68.22 67.02 67.11 65.87 65.82 64.76
CE ( = 4) 69.70 70.81 70.99 70.44 70.16 68.68 68.55 67.99 67.25 67.58
CE ( = 5) 69.42 70.95 70.85 70.30 69.61 69.15 68.55 67.76 66.79 66.14
Intensity
PMI 69.75 70.67 71.18 71.64 71.87 72.06 71.22 69.47 65.96 61.94
CE ( = 3) 72.24 73.07 73.30 72.56 70.85 70.02 69.56 69.01 68.82 67.85
CE ( = 4) 72.38 73.63 74.00 74.13 73.12 71.73 71.32 70.67 70.62 71.13
CE ( = 5) 71.82 73.72 74.04 72.89 72.56 71.87 71.27 70.53 69.88 70.39





Emotion word expansion
Accuracy of CE with different window sizes ( )
against different threshold values for expanded
word selection (β).

24
Binary classification Intensity classification






Emotion word expansion
Accuracy of PMI and CE against and different threshold
values for expanded word selection (β).
25
Binary classification Intensity classification
Emotion word expansion
Accuracy of Seed, PMI, and CE against different
proportions of the labeled corpus for seed word
generation (α).
26
Binary classification Intensity classification
Comparative results
Comparative results of Seed, PMI, and CE for
sentiment classification (%)
Seed PMI CE
Binary 67.90 70.69 73.19*
Intensity 70.07 66.17 76.54*
* CE vs PMI significantly different (p<0.05).
27
Discussion
The discussion of features in binary and intensity.
The discussion of the number of noising words in
different expanding approaches.
28
Discussion of different intensity
Example of a stock news articles
<Title> 台股漲 新台幣盤升8.4分 </Title>
<Time> 2011/8/16 10:07 </Time>
<Content>
台股開盤上漲,美元偏弱,台北外匯市場盤初觀望氣氛濃,新台幣兌美元匯率走升;上午9時
45分新台幣匯率28.868元,升值8.4分。 匯銀人士指出,受美股收紅的激勵,台股開盤上漲71
點,台北匯市盤初瀰漫觀望氣氛,因外商銀行拋匯力道稍強,新台幣盤初走升。1000816
</Content>
Feature(count)
Binary Intensity
PMI CE PMI CE
上漲(2) +1 +1 +1.0000 +1.0000
走升(2) +1 -1 +0.7688 -0.5088
收紅(1) +1 +1 +1.0000 +1.0000
瀰漫(1) -1 -1 -0.9746 -0.5296
拋匯(1) +1 -1 +0.9741 -0.5202
開盤(1) -1 -0.2066
弱(1) -1 -0.3359
觀望(2) -1 -0.3242
濃(1) -1 -0.7591
兌(1) -1 -0.1305
Sum -1 -1 2.5741 0.9327
29
Discussion of noising words
Example of emotion words and their intensity acquired
by PMI and CE.
Seed 下跌
Approach CE PMI
feature intensity feature intensity
走黑 0.6366 摔破 0.9999
台股收低 0.6023 瀕臨 0.9998
指小跌 0.5940 拋匯 0.9990
年大虧 0.5759 動盪 0.9974
台股走跌 0.5504 盤跌 0.9954
30
Outline
Introduction
 Background
 Motivation and purpose
System Framework and Methodology
 Seed word selection
 Expansion of emotion words and intensity
 Sentiment classification of stock news
Experimental Results
 Seed word generation
 Emotion word expansion
 Comparative results
 Discussion
Conclusions
31
Conclusions
 Experimental results show that the use of the expanded
emotion words contributed to the classification performance,
and incorporating the intensity further improved the
performance.
 Our proposed method that considers both co-occurrence
strength and context distribution can acquire more useful
emotion words and less noisy.
 Our future work will be devoted to the following directions.
 The current approach will be extended with multi-category
of news.
 Investigating more significant to further improve
classification performance.
32
LOGO

Weitere ähnliche Inhalte

Andere mochten auch

الشهادة البريطانية
الشهادة البريطانيةالشهادة البريطانية
الشهادة البريطانيةMuthana Al_Zain
 
黏性传播(高级定制版)
黏性传播(高级定制版)黏性传播(高级定制版)
黏性传播(高级定制版)lisimo
 
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلند
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلندارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلند
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلندSiamak H. Mehrabani
 
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρ
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρΚοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρ
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρcsdtesting
 
集创旅项目介绍 中英
集创旅项目介绍 中英集创旅项目介绍 中英
集创旅项目介绍 中英Linda Bonvecchio
 
救援拖车里程分析报告
救援拖车里程分析报告救援拖车里程分析报告
救援拖车里程分析报告Richard Zeng
 
Վիքիպեդիան և մենք
Վիքիպեդիան և մենքՎիքիպեդիան և մենք
Վիքիպեդիան և մենքVachagan Gratian
 
Презентация проекта "Мобильная молочная кухня"
Презентация проекта "Мобильная молочная кухня"Презентация проекта "Мобильная молочная кухня"
Презентация проекта "Мобильная молочная кухня"kulibin
 
ксератек презентация
ксератек презентацияксератек презентация
ксератек презентацияAlexander Barabash
 
Що_треба_знати_про_договір_довічного_утримання
Що_треба_знати_про_договір_довічного_утриманняЩо_треба_знати_про_договір_довічного_утримання
Що_треба_знати_про_договір_довічного_утриманняVitalij Misjats
 
東京電力管内の最大供給電力の推移
東京電力管内の最大供給電力の推移東京電力管内の最大供給電力の推移
東京電力管内の最大供給電力の推移mo mo
 
代理店様用プレゼンツール2
代理店様用プレゼンツール2代理店様用プレゼンツール2
代理店様用プレゼンツール2Hiroshi Soda
 
生命希望工程
生命希望工程生命希望工程
生命希望工程guest3146ca
 
哪家醫院最行
哪家醫院最行哪家醫院最行
哪家醫院最行honan4108
 
賴聲川創意學
賴聲川創意學賴聲川創意學
賴聲川創意學kkjjkevin03
 
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο csdtesting
 
האמת מאחורי ההסכם
האמת מאחורי ההסכםהאמת מאחורי ההסכם
האמת מאחורי ההסכםlioradler
 

Andere mochten auch (20)

الشهادة البريطانية
الشهادة البريطانيةالشهادة البريطانية
الشهادة البريطانية
 
黏性传播(高级定制版)
黏性传播(高级定制版)黏性传播(高级定制版)
黏性传播(高级定制版)
 
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلند
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلندارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلند
ارزش مالی زمان پاسخگویی برای خدمات اورژانس تایلند
 
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρ
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρΚοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρ
Κοβάτση-Το ισοπροστάνιο ως δείκτης οξειδωτικού στρ
 
集创旅项目介绍 中英
集创旅项目介绍 中英集创旅项目介绍 中英
集创旅项目介绍 中英
 
救援拖车里程分析报告
救援拖车里程分析报告救援拖车里程分析报告
救援拖车里程分析报告
 
Վիքիպեդիան և մենք
Վիքիպեդիան և մենքՎիքիպեդիան և մենք
Վիքիպեդիան և մենք
 
Презентация проекта "Мобильная молочная кухня"
Презентация проекта "Мобильная молочная кухня"Презентация проекта "Мобильная молочная кухня"
Презентация проекта "Мобильная молочная кухня"
 
不敗行銷
不敗行銷不敗行銷
不敗行銷
 
ксератек презентация
ксератек презентацияксератек презентация
ксератек презентация
 
Що_треба_знати_про_договір_довічного_утримання
Що_треба_знати_про_договір_довічного_утриманняЩо_треба_знати_про_договір_довічного_утримання
Що_треба_знати_про_договір_довічного_утримання
 
東京電力管内の最大供給電力の推移
東京電力管内の最大供給電力の推移東京電力管内の最大供給電力の推移
東京電力管内の最大供給電力の推移
 
代理店様用プレゼンツール2
代理店様用プレゼンツール2代理店様用プレゼンツール2
代理店様用プレゼンツール2
 
生命希望工程
生命希望工程生命希望工程
生命希望工程
 
哪家醫院最行
哪家醫院最行哪家醫院最行
哪家醫院最行
 
賴聲川創意學
賴聲川創意學賴聲川創意學
賴聲川創意學
 
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο
Κοζικοπούλου-Σύνδρομο αιφνιδίου θανάτου βρεφών-Ο
 
故宮
故宮故宮
故宮
 
האמת מאחורי ההסכם
האמת מאחורי ההסכםהאמת מאחורי ההסכם
האמת מאחורי ההסכם
 
Spilkuvannay
SpilkuvannaySpilkuvannay
Spilkuvannay
 

Ähnlich wie 應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNET
OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNETOPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNET
OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNETijfcstjournal
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
A novel meta-embedding technique for drug reviews sentiment analysis
A novel meta-embedding technique for drug reviews sentiment analysisA novel meta-embedding technique for drug reviews sentiment analysis
A novel meta-embedding technique for drug reviews sentiment analysisIAESIJAI
 
A review on sentiment analysis and emotion detection.pptx
A review on sentiment analysis and emotion detection.pptxA review on sentiment analysis and emotion detection.pptx
A review on sentiment analysis and emotion detection.pptxvoicemail1
 
Evaluating sentiment analysis and word embedding techniques on Brexit
Evaluating sentiment analysis and word embedding techniques on BrexitEvaluating sentiment analysis and word embedding techniques on Brexit
Evaluating sentiment analysis and word embedding techniques on BrexitIAESIJAI
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceVijay Prakash Dwivedi
 
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsAn Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsIJECEIAES
 
Measuring human and Vader performance on sentiment analysis
Measuring human and Vader performance on sentiment analysisMeasuring human and Vader performance on sentiment analysis
Measuring human and Vader performance on sentiment analysisjournal ijrtem
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptxSaravanaD2
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...IJERA Editor
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...eSAT Publishing House
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmIJSRD
 
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUESA SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUESJournal For Research
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Asoka Korale
 
Database Applications in Analyzing Agents
Database Applications in Analyzing AgentsDatabase Applications in Analyzing Agents
Database Applications in Analyzing Agentsiosrjce
 

Ähnlich wie 應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究 (20)

OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNET
OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNETOPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNET
OPTIMIZATION OF CROSS DOMAIN SENTIMENT ANALYSIS USING SENTIWORDNET
 
Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
A novel meta-embedding technique for drug reviews sentiment analysis
A novel meta-embedding technique for drug reviews sentiment analysisA novel meta-embedding technique for drug reviews sentiment analysis
A novel meta-embedding technique for drug reviews sentiment analysis
 
Inspecting the sentiment behind customer ijcset feb_2017
Inspecting the sentiment behind customer ijcset feb_2017Inspecting the sentiment behind customer ijcset feb_2017
Inspecting the sentiment behind customer ijcset feb_2017
 
columbia-gwu
columbia-gwucolumbia-gwu
columbia-gwu
 
A review on sentiment analysis and emotion detection.pptx
A review on sentiment analysis and emotion detection.pptxA review on sentiment analysis and emotion detection.pptx
A review on sentiment analysis and emotion detection.pptx
 
Evaluating sentiment analysis and word embedding techniques on Brexit
Evaluating sentiment analysis and word embedding techniques on BrexitEvaluating sentiment analysis and word embedding techniques on Brexit
Evaluating sentiment analysis and word embedding techniques on Brexit
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
 
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsAn Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
 
DL.pptx
DL.pptxDL.pptx
DL.pptx
 
Measuring human and Vader performance on sentiment analysis
Measuring human and Vader performance on sentiment analysisMeasuring human and Vader performance on sentiment analysis
Measuring human and Vader performance on sentiment analysis
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
 
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUESA SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016
 
J017265860
J017265860J017265860
J017265860
 
Database Applications in Analyzing Agents
Database Applications in Analyzing AgentsDatabase Applications in Analyzing Agents
Database Applications in Analyzing Agents
 

應用文脈熵模型擴增情緒詞彙與強度於股市新聞情感分類之研究

  • 1. LOGO 應用文脈熵模型擴增情緒詞彙與強度於 股市新聞情感分類之研究 Using a Contextual Entropy Model to Expand Emotion Words and their Intensity for Sentiment Classification of Stock News 指導教授:禹良治 博士 研 究 生:朱炫碩
  • 2. Outline Introduction  Background  Motivation and purpose System Framework and Methodology  Seed word selection  Expansion of emotion words and intensity  Sentiment classification of stock news Experimental Results  Experiment setup  Seed word generation  Emotion word expansion  Comparative results  Discussion Conclusions 2
  • 3. Background Stock trend prediction using technical indices has been extensively investigated in the stock market.  ex: moving average (MA)、relative strength index (RSI) Textual data such as stock news articles are also an important factor affecting the stock price. To find such useful information from daily news due to the huge number of articles and reports is not an easy task for investors. 3
  • 4. Motivation and purpose Sentiment classification of stock news can help investors identify sentimental tendency in stock news and facilitate their investment decision making. This study focuses on mining useful features to classify the sentiment of stock news. 4
  • 5. Motivation and purpose One of the major characteristics of stock news articles is the emotion words contained within them, and these words may have different intensity. ex:  Positive emotion words: soar vs rise  Negative emotion words: collapse vs fall 5 (strong) (normal) (strong) (normal)
  • 6. Motivation and purpose To discover emotion words and their intensity, traditional approaches can be divided into two areas of research:  knowledge-based methods • Rely on exploiting expert knowledge to create affective lexicons or using existing lexicons to obtain emotion words and their intensity.  corpus-based methods • Automatically acquire emotion words with intensity from large corpora based on a set of seed words. 6
  • 7. Outline Introduction  Background  Motivation and purpose System Framework and Methodology  Seed word selection  Expansion of emotion words and intensity  Sentiment classification of stock news Experimental Results  Experiment setup  Seed word generation  Emotion word expansion  Comparative results  Discussion Conclusions 7
  • 9. Seed word generation Information Gain (IG)      wordEnpIwordGain  ,   np n np n np p np p npI     22 loglog,   0, npI if, either 0p or 0n    ii w i ii npI np np wordE , 1     9
  • 10. Emotional word expansion The methods as follows to acquire more emotion words and their intensity from the unlabeled corpus based on the given seed words.  Pointwise Mutual Information (PMI) • measure the co-occurrence strength between two words.  Contexutual Entropy Model (CE) • considers both co-occurrence strength and context distribution between the candidate words and seed. 10
  • 11. Pointwise mutual information The following are equations of PMI expansion method: 2 ( , ) ( , ) log . ( ) ( ) i j i j i j C c seed N PMI c seed C c C seed    11 1 ( , ) ( , ), j seed i seed i j seed Pseed PMI c P PMI c seed P    1 ( , ) ( , ), j seed i seed i j seed Nseed PMI c N PMI c seed N   
  • 12. Pointwise mutual information The following are equations of PMI expansion method: ( , ) ( , ) 1 , if the sentiment class of is 1 exp ( ) , 1 , if the sentiment class of is 1 exp i seed i seed iPMI c P i iPMI c N c positive Intensity c c negative                 12
  • 13. Contextual entropy model There are three steps used in the contextual entropy model. That is,  Vector representation • Using this step to represent the co-occurrence strength between each word and its context words.  Similarity measure • Measuring the difference between the probabilistic context distributions of a seed word and a candidate word.  Expansion procedure • Determine the sentiment class of each candidate word. 13
  • 14. Vector representation The contextual entropy model uses a high-dimensional vector to record the co-occurrence strength between a word and its context words. 14 k iw dm 1 d= : window size : distanced
  • 15. Similarity measure  The distance between ci and seedj can be calculated by the summation of their KL divergence of left and right context distributions. That is,  The similarity between ci and seedj can be defined as ( , ) ( , ) ( , )i j i j left left right right i j c seed c seedDist c seed Div v v Div v v  1 ( , ) 1 ( , ) i j i j CE c seed Dist c seed   15
  • 16. Expansion procedure 1 ( , ) ( , ), j seed i seed i j seed Pseed CE c P CE c seed P    1 ( , ) ( , ), j seed i seed i j seed Nseed CE c N CE c seed N    16 ( , ) ( , ) 1 , if the sentiment class of is 1 exp ( ) , 1 , if the sentiment class of is 1 exp i seed i seed iCE c P i iCE c N c positive Intensity c c negative                
  • 17. Sentiment classification of stock news This study develops two classification schemes: binary and intensity, depending on whether or not intensity is used in classification.  Binary classification scheme • Only compares the number of positive and negative emotion words contained in stock news articles without consideration of their intensity.  Intensity classification scheme • Compares the sum of the intensity of positive and negative emotion words in the articles. 17
  • 18. Binary classification scheme The binary classification scheme is defined as   if ( , ) ( , ) 0 , if ( , ) ( , ) 0 i i i i i i i i i j i j w P w D w N w D i j i j w P w D w N w D positive I w w I w w l D negative I w w I w w                          1 if ( , ) . 0 if i j i j i j w w I w w w w     18
  • 19. Intensity classification scheme The intensity classification scheme is defined as   if ( , ) ( ) ( , ) ( ) 0 , if ( , ) ( ) ( , ) ( ) 0 i i i i i i i i i j i i j i w P w D w N w D i j i i j i w P w D w N w D positive I w w Intensity w I w w Intensity w l D negative I w w Intensity w I w w Intensity w                          19
  • 20. Outline Introduction  Background  Motivation and purpose System Framework and Methodology  Seed word selection  Expansion of emotion words and intensity  Sentiment classification of stock news Experimental Results  Experiment setup  Seed word generation  Emotion word expansion  Comparative results  Discussion Conclusions 20
  • 21. Experiment setup Experimental data  Data source: a total of 7291 stock news from Yahoo!NEWS  Labeled news articles: 3262 articles  Unlabeled news articles: 4029 articles Classifiers and feature sets  Classifiers: Binary and Intensity  Seed: Using the information gain from labeled news and then manually selected by human experts to generate seed words.  PMI: seed words plus expansion emotion words using PMI from unlabeled news.  CE: seed words plus expansion emotion words using contextual entropy method from unlabeled news. 21
  • 22. Seed word generation Classification accuracy of Seed against different proportions of the labeled corpus for seed word generation (α). 22
  • 23. Emotion word expansion Classification accuracy of different window sizes ( ) against different threshold values for expanded word selection (β)  23 β 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Binary PMI 68.45 68.73 68.91 68.78 69.38 68.55 67.90 66.56 62.03 52.47 CE ( = 3) 69.56 70.21 70.16 69.84 68.22 67.02 67.11 65.87 65.82 64.76 CE ( = 4) 69.70 70.81 70.99 70.44 70.16 68.68 68.55 67.99 67.25 67.58 CE ( = 5) 69.42 70.95 70.85 70.30 69.61 69.15 68.55 67.76 66.79 66.14 Intensity PMI 69.75 70.67 71.18 71.64 71.87 72.06 71.22 69.47 65.96 61.94 CE ( = 3) 72.24 73.07 73.30 72.56 70.85 70.02 69.56 69.01 68.82 67.85 CE ( = 4) 72.38 73.63 74.00 74.13 73.12 71.73 71.32 70.67 70.62 71.13 CE ( = 5) 71.82 73.72 74.04 72.89 72.56 71.87 71.27 70.53 69.88 70.39     
  • 24. Emotion word expansion Accuracy of CE with different window sizes ( ) against different threshold values for expanded word selection (β).  24 Binary classification Intensity classification      
  • 25. Emotion word expansion Accuracy of PMI and CE against and different threshold values for expanded word selection (β). 25 Binary classification Intensity classification
  • 26. Emotion word expansion Accuracy of Seed, PMI, and CE against different proportions of the labeled corpus for seed word generation (α). 26 Binary classification Intensity classification
  • 27. Comparative results Comparative results of Seed, PMI, and CE for sentiment classification (%) Seed PMI CE Binary 67.90 70.69 73.19* Intensity 70.07 66.17 76.54* * CE vs PMI significantly different (p<0.05). 27
  • 28. Discussion The discussion of features in binary and intensity. The discussion of the number of noising words in different expanding approaches. 28
  • 29. Discussion of different intensity Example of a stock news articles <Title> 台股漲 新台幣盤升8.4分 </Title> <Time> 2011/8/16 10:07 </Time> <Content> 台股開盤上漲,美元偏弱,台北外匯市場盤初觀望氣氛濃,新台幣兌美元匯率走升;上午9時 45分新台幣匯率28.868元,升值8.4分。 匯銀人士指出,受美股收紅的激勵,台股開盤上漲71 點,台北匯市盤初瀰漫觀望氣氛,因外商銀行拋匯力道稍強,新台幣盤初走升。1000816 </Content> Feature(count) Binary Intensity PMI CE PMI CE 上漲(2) +1 +1 +1.0000 +1.0000 走升(2) +1 -1 +0.7688 -0.5088 收紅(1) +1 +1 +1.0000 +1.0000 瀰漫(1) -1 -1 -0.9746 -0.5296 拋匯(1) +1 -1 +0.9741 -0.5202 開盤(1) -1 -0.2066 弱(1) -1 -0.3359 觀望(2) -1 -0.3242 濃(1) -1 -0.7591 兌(1) -1 -0.1305 Sum -1 -1 2.5741 0.9327 29
  • 30. Discussion of noising words Example of emotion words and their intensity acquired by PMI and CE. Seed 下跌 Approach CE PMI feature intensity feature intensity 走黑 0.6366 摔破 0.9999 台股收低 0.6023 瀕臨 0.9998 指小跌 0.5940 拋匯 0.9990 年大虧 0.5759 動盪 0.9974 台股走跌 0.5504 盤跌 0.9954 30
  • 31. Outline Introduction  Background  Motivation and purpose System Framework and Methodology  Seed word selection  Expansion of emotion words and intensity  Sentiment classification of stock news Experimental Results  Seed word generation  Emotion word expansion  Comparative results  Discussion Conclusions 31
  • 32. Conclusions  Experimental results show that the use of the expanded emotion words contributed to the classification performance, and incorporating the intensity further improved the performance.  Our proposed method that considers both co-occurrence strength and context distribution can acquire more useful emotion words and less noisy.  Our future work will be devoted to the following directions.  The current approach will be extended with multi-category of news.  Investigating more significant to further improve classification performance. 32
  • 33. LOGO

Hinweis der Redaktion

  1. 本研究主題是應用文脈熵模型擴增情緒字及強度於股市新聞的情感分類研究
  2. 過去預測股票走勢大多使用技術指標,例如:MA、RSI等當作分析標的。 但是影響股票市場的因素,不僅只有技術指標,人類情感也是左右盤勢的主要因素之一,本研究認為,股市新聞是影響投資者情緒,進而造成股價走勢變動的重大因素之一。 但一般投資者無法即時讀取每日數量龐大的新聞,因此可能會造成資訊上的落差,而做出錯誤的決策。
  3. 所以本研究想透過文字探勘技術的應用,對股市新聞進行情感傾向分類,以輔助投資者做決策。 目的是希望找出能有效分析股市情感的特徵。
  4. 其中股市新聞中的情緒字以及強度是本研究的重點 本研究會將情緒字分為正向及負向類別,而且在同類別的眾多情緒字中,會有程度的上的差別,以下面正向及負向類別的情緒字為例。 例如,屬於正向類別的soar和rise都有漲的意思,但是soar卻有飛漲的意思,rise則只有漲的意思,顯示出雖然都同屬正向類別卻具有程度上的差異。 另外,collapse和fall也同屬負向的情緒字,但collapse有崩跌的意思,fall則只有跌的意思,顯示出雖同屬負向類別但也具有程度上的差異。
  5. 對於情緒字及其強度的研究,常使用的方法分為knowledge-based和corpus-based兩種,knowledge-based是使用先前已由專家定義好的字典來分析,corpus-based則是根據基礎字並以自動化的方式擴增出情緒字及其強度。 本研究是使用corpus-based這個方法。
  6. 接下來介紹系統架構和研究方法
  7. 這是本研究的架構,首先我們會先從網路抓取股市新聞,並將新聞資料分為已標記情感和未標記情感兩類,已標記新聞用於基礎字的挑選,未標記的新聞用於情感字的擴增…
  8. 亂度愈亂愈不好(值愈小愈好)
  9. **
  10. 接著介紹本研究所提出的“文脈熵模型”方法,文脈熵模型會依照這三個程序挑選情緒字和計算強度
  11. L-d+1的公式解釋:M代表距離愈近權重值就愈高 運用右邊的計算公式計算存放於左邊維度中