SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
강화학습을 자연어 처리에 이용할 수 있을까?
보상의 희소성 문제와 그 방안
김영삼
2018.8.9
네이버 테크토크
Table of contents
1. Motivation
2. Background
3. Experiments on Sentiment Polarity of Words
4. Experiment on Adverse Drug Reaction of nursing statements
5. Conclusion
1
Motivation
Basic motivation and research questions
Basic motivation
How computational reinforcement learning can be applied to
questions in NLP?
Two research questions
• Prediction problem of on-line values of words in text
• Prediction problem of on-line values of text
2
More specific research questions
• Prediction problem of on-line sentiment polarity values of words
• Prediction problem of on-line Adverse Drug Reaction of nursing
statements
Why focusing on on-line? → “language processing is known to be
on-line”
3
Background
Reinforcement learning and language learning
In Verbal Behavior, Skinner (1957) argues that language learning can be
explained by association of stimulus and reinforcement.
Chomsky (1959) criticized the arguments with the following reasons:
• Poverty of stimulus in language learning
• Poverty of rewards or penalties in language learning
4
Characteristics of RL
• Evaluative and delayed feedback
• No supervisor, only reward signal
• Time matters
• Agent’s actions affect the subsequent data it receives
• Sampling approach
• Approximated value functions
• Trial and error approach
5
Similarity of RL and language processing
Immediacy of interpretation: language processing is incremental
processing (Marslen-Wilson, 1973, 1975).
Syntactic processing is incremental.
• Syntactic parsing is not delayed.
• Syntactic reanalysis is costly.
• e.g. “The defendant examined by the lawyer turned out to be
unreliable.”
Semantic processing is also on-line.
• Reading time is increased when gender violation occurs in anaphora
resolution.
• Simple linguistic inference also comes with on-line processing.
6
Different tasks in RL and NLP
Reinforcement learning
• Robotics
• Game control
Nautral langugae processing
• POS-tagging
• Anaphora resolution
• Syntactic parsing
• Sentiment analysis
• Question and answering
• Machine translation
7
Difficulties in applying RL to NLP
Cost in exploration
• Exploration/Exploitation dilemma
• The cost is high when state/action sizes are large
• Long training time
Problem of sparsity of rewards
• Some learning problems suffer from reward sparseness in model-free
methods
• If rewards are sparse, learning will be very difficult
8
Example: random-walk experiment using TD(λ)
9
Temporal difference learning
• A core algorithm of reinforcement learning
• TD methods learn directly from episodes of experience
• TD is model-free: no knowledge of MDP / MRP
• TD learns from incomplete episodes, by bootstrapping
Temporal-difference learning seems a natural solution for on-line natural
language processing problems.
10
Markov Reward Process
MRP, a subset of Markov Decision Process, consists of the four
components.
Definition
• S is a finite set of states.
• P is a state transition probability matrix, Pss = P[St+1 = s |St = s]
• R, r ∈ R is a reward function.
• γ is a discount factor, γ ∈ [0, 1].
11
Value function
A value function is defined as below:
V (St) = E
∞
k=0
γk
Rt+k+1 St = s (1)
A value function for Monte-Carlo learning is,
V (St) = V (St) + α
∞
k=0
γk
Rt+k+1 − V (St) (2)
A value function for the simplest TD is as follows:
V (St) = V (St) + α(Rt+1 + γV (St+1) − V (St) (3)
where α is learning rate.
12
TD(λ) method
TD(λ) of Sutton (1984, 1988) combines the simplest TD and
Monte-Carlo methods in an incremental framework with the introduction
of eligibility traces.
The method is made incremental with the traces and the trace-decay
parameter, λ ∈ [0, 1], which determines where to interpolate between the
MC and TD(0) updates.
When λ = 0, the update is equivalent to TD(0) and λ = 1 provides an
every-visit MC update.
13
Eligibility traces
The eligibility trace implements the ‘backward view’ mechanism of
TD(λ).
On each step, all trace values of states decay by γλ, but increment the
trace of the visited state in a number of ways.
14
Algorithm 1: Fast TD(λ) with replacing traces
1 Initialize V (s) arbitrarily and let e(s) = 0 for all s ∈ S;
2 H ← new hash table;
3 repeat
4 while st not at end of the episode do
5 observe reward, r, and st+1;
6 δ ← r + γV (st+1) − V (st );
7 e(st ) ← 1;
8 if H not contains st then
9 insert st into H;
10 for all h ∈ H do
11 if e(h) ≤ 0.001 then
12 e(h) ← 0;
13 remove h from H;
14 continue;
15 V (h) ← V (h) + αδe(h);
16 e(h) ← γλe(h);
17 until the episode is terminal;
15
Sutton’s results
Figure 1: Performance of on-line TD(λ) on the 19-state random walk task
16
My replication of TD methods
Figure 2: Performance of TD methods on the 19-state random walk task
17
Experiments on Sentiment
Polarity of Words
Problem formulation
A movie review is represented as a sequence of words (states):
w1, w2, . . . , wt, . . . , wT where T is the length of the review.
State
A state is defined as the word type of a vocabulary in a corpus.
Reward
We regard a classification label of a text as rewards: +1 for the positive
label, −1 for the negative label.
18
Problem formulation
Figure 3: An example of a 6-state Markov Reward Process. The numbers on
the arrows indicate the rewards and which of the two values (+1 and −1) for
the terminal state is determined by the label of the review.
Figure 4: In this MRP, every reward is returned by the classification label.
19
Datasets
Movie Review Dataset We use the polarity dataset v2.0 for the
indirect evaluation (Pang and Lee, 2004), which consists of 1,000
positive and 1,000 negative movie reviews.
Stanford Sentiment Treebank The data is based on 11,855 single
sentences extracted from movie reviews and contains sentiment polarity
values for all phrases which are annotated by 3 human judges (Socher et
al., 2013).
Large Movie Review Dataset This corpus for binary sentiment
classification (Mass et al., 2011) is used to train the LSTM sentiment
classifier of our method.
20
Configuration of experiments
• Experiment 1: Hyper-parameter exploration with feature selection
paradigm (with setting in Fig. 3)
• Experiment 2: Evaluation with feature selection paradigm (with
setting in Fig. 4)
• Experiment 3: Direct evaluation with Stanford Sentiment Treebank
(with Setting in Fig. 3)
21
Setting of Experiment 1
Naive Bayes classification is performed based on the top 10,000 selected
words.
10-fold cross validation is used for each condition and the accuracies are
averaged.
Conditions of TD methods: Hyper-parameter combinations of learning
rate (0.1∼0.5) and trace-decay rate (0.1∼1.0) with step size of 0.1
Incremental means of TD values over time steps are used for estimating
words.
Compared feature selection methods
• Document Frequency
• Averaged TF-IDF
• χ2
statistic (CHI)
• Information Gain
22
Results of Experiment 1 (hyper-parameters)
Figure 5: Performance of TD with replacing traces as function of λ
23
Results of Experiment 1 (accuracies)
Method NB Accuracy
TD(1) with accumulation 0.84
TD(1) with replacing 0.83
TD(1) with Dutch 0.83
TD(1) with significance 0.83
True Online TD(1) 0.78
Simple Averages 0.64
Document Frequency 0.67
TF-IDF 0.69
χ2
statistic (CHI) 0.66
Information Gain 0.83
Table 1: 10-fold cross validation accuracies of the TD methods and the feature
selection methods
24
Setting of Experiment 2
An LSTM sentiment classifier is trained on Large Movie Review Dataset.
A movie review is split into sentences.
The LSTM sentiment classifier gives a label (+1 or −1) to each string,
which is being formed incrementally as below:
This
This is
This is a
This is a good
This is a good movie
This time, 10 accuracies from 10 samplings of training/test sets are
averaged using the last step TD values of the words.
25
Model-based reinforcement learning
• A model M is a representation of an MDP/MRP, parametrized by η.
• A model M = (Pη, Rη) represents state transitions Pη ≈ P and
rewards Rη ≈ R
St+1 ∼ Pη(St+1|St)
Rt+1 = Rη(Rt+1|St)
• Model-based RL plans value function from model.
26
Optimal value of trace-decay rate in Setting 2
Figure 6: Performance of TD with replacing traces as function of λ
27
Results of Experiment 2
Method NB Accuracy
TD(λ) with accumulation traces 0.82
TD(λ) with Dutch traces 0.82
TD(λ) with replacing traces 0.82
TD(λ) with significance traces 0.82
Information Gain 0.75
Table 2: Classification Accuracy of TD methods with Setting of Fig. 4
Note that ‘Information Gain’ method is based on the sentence-based
dataset.
28
Setting of Experiment 3
For the direct evaluation, correlation coefficients between the estimated
values and the labeled polarities are calculated.
The labels are real values ranges from 0 to 1 (Stanford Sentiment
Treebank).
TD methods follow the previous setting (with the setting of Fig 3).
Full dataset: 21,684 words in the dataset
Reduced dataset: 4,532 words which are adjective, adjective (superlative)
or adverb, adverb (comparative).
For comparison, estimation method with Bayes provabilities (Potts, 2010)
is used.
29
Results of Experiment 3 (full set)
Method Pearson (full) Spearman (full)
Bayes Prob. (Potts, 2010) 0.21 0.2
TD(1) with replacing 0.24 0.21
TD(1) with Dutch 0.24 0.21
TD(1) with significance 0.24 0.21
TD(1) with accumulation 0.24 0.21
Table 3: Correlations between the estimation results and the human-labeled
polarity values of the 24,684 words.
30
Results of Experiment 3 (reduced set)
Method Pearson (reduced) Spearman (reduced)
Bayes Prob. (Potts, 2010) 0.32 0.3
TD(1) with replacing 0.38 0.35
TD(1) with Dutch 0.38 0.35
TD(1) with significance 0.38 0.35
TD(1) with accumulation 0.38 0.34
Table 4: Correlations between the estimation results and the human-labeled
polarity values of the 4,532 words.
31
Plots of the labeled and estimated values
Figure 7: Data plots of the annotated, Bayesian and the TD(1) with
significance traces values of the total dataset (from the lowest to the highest)
32
Summary of the experiment results
• TD methods achieve the same level of performance as other feature
selection methods.
• TD-based estimations are more differential, providing more realistic
values.
• TD-based methods provide an easy tool for on-line estimation of
words.
33
Summary of the differences between the two settings
• In setting 1, TD methods with λ = 1 show best performances.
• In setting 2, TD methods with λ = 0.7 show best performances.
• TD methods with setting 2 show better performances.
• Thus, for TD methods with setting 2, best performance is obtained
with an intermediate value of λ.
• Note that TD method with setting 2 is a model-based approach.
34
Experiment on Adverse Drug
Reaction of nursing statements
Data
• Data source: Ajou University hospital
• Nursing statements from 8,316 patients
• 4,158 ADR labeled patients
• 4,158 non-ADR labeled patients
• Average number of sentences per patient: 421
• Largest number of sentences of a patient: 10,625
• Total number of sentence types: 837,293
35
Data example
Time Sentence
20120627055500 심호흡 교육함
20120627055500 구강간호 시행함
20120627063000 오심 감소함
20120627063000 침상안정 중임
20120627080000 수액 주입중임Right.arm 20G
20120627080000 정맥주사부위 통증,부종,발적 없음
20120627080000 일혈 위험약물과 증상에 대해 교육함
20120627080000 정맥주사부위 통증,부종,발적 없음
20120627080000 금식 중임
20120627080000 갈증 없음
20120627080000 수분부족 증상 관찰함
36
Other methods: NB
• 9,647 sentence types are used whose frequency is greater than 20.
• Train, Dev, Test sets at ratio of 8:1:1
• Information Gain is used to select N-best features.
• Grid-search is performed to find the best N (3700)
37
Other methods: SVM
• 9,647 sentence types are used whose frequency is greater than 20.
• Train, Dev, Test sets at ratio of 8:1:1
• Linear and RBF models are both used.
• Grid-search is performed to find the best parameters (Table 5).
Min DF Max DF Proportion Gamma C
5 0.5 0.1 4
Table 5: Parameter values used in SVM
38
Other methods: CNN
• All 837,293 sentence types are used.
• Train, Dev, Test sets at ratio of 8:1:1
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 288 statements for ADR date are used.
• In non-ADR cases, latest 288 statements for a random index are
used.
39
Other methods: LSTM
• All 837,293 sentence types are used.
• Train, Dev, Test sets at ratio of 8:1:1
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 200 statements for ADR date are used.
• In non-ADR cases, latest 200 statements for a random index are
used.
40
Problem formulation
Each statement is represented as a state of the patient:
s1, s2, . . . , st, . . . , sT where T is the length of the statements.
State
A state is defined as a functional event of a statement.
Reward
We regard a classification label of the statemetns as rewards: +1 for
ADR label, −1 for the non-ADR case.
41
Memory-based function approximation
• Non-parametric function approximation
• It retrieves training examples when a query state is given.
• And it estimates the state value from the examples.
• It provides a way to avoid the curse of dimensionality.
42
Building training examples
A dataset of 400 patients is sampled and a nursing expert marked a
statement if it seems highly related to ADR.
Sentence Label
오심 없음
구토 없음
침상안정 중임
2(구두처방)Tramadol T
주관적 진술:”머리랑 배가 아파요.” T
복부 통증 호소함 T
통증 양상 관찰함 T
통증완화 위한 다양한 방법 격려함 T
의사에게 알림 T
심호흡 교육함
구강간호 시행함
오심 감소함
43
Building training examples
The labeled statements are manullay classified into 7 events as follows:
State Meaning Example
0 Unknown event CT찍고 옴.
1 Drug related event epocelin 1g 투여 함
2 Abnormal reactions 피부 가려움 호소함부위:both arm
3 Doctor related event 의사에게 알림의사:xxx
4 Subjective response 주관적 진술:몸이 가벼워요
5 1 & 2 event Tramadol 맞은 후 구토 2회 함
6 4 & 2 event 주관적 진술:속이 계속 울렁거리네요.
Table 6: Table of defined states and example sentences
44
Building NB classifier
A Naive Bayse classifier is trained based on binary feature functions and
unigrams in sentence.
Feature functions:
• is high temp: positive if body temperature > 37.4
• is subjective: positive if subjective statement marker is present
• is drug related: positive if precription statement or drug names
appeared
• is negation: positive if negation expression is present
• is reactive: positive if ADR-realated symptoms are present
Accuracy of the classifier: 0.72
45
Memory-based reward function
The ADR environment provides sparse rewards, so reward function is
modeled as below:
def get_reward(n_state, n_idx):
if n_state == ‘adr’:
return 1
elif n_state == ‘nor’:
return -1
elif n_idx in (2, 3, 4, 5, 6):
return 1
else:
return 0
46
Training
• 8,316 episodes are split into Train, Validation, Test sets at ratio of
8:1:1.
• Replacing traces is used for eligibility traces.
Parameter Value
Learning rate (α) 0.1
Trace-decay rate (λ) 0.3
Discount factor (γ) 0.1
Table 7: Table of values of the parameters
47
Estimated state values from train set
State Meaning Value
0 Unknown event NA
1 Drug related event 0.17
2 Abnormal reactions 0.47
3 Doctor related event 0.27
4 Subjective response 0.51
5 1 & 2 event 0.18
6 4 & 2 event 0.80
Table 8: Estimated values of states using TD(λ)
48
Figure 8: Graph of state values across train datasets
49
Classification based on state values
• The state values corressponding to the states in Validation set are
averaged (the value of unknown state is regarded as 0).
• Simple logistic regression is performed to the Validation set.
• The regression classifier is tested on the Test set.
Method Accuracy
NB 0.64
SVM (linear) 0.63
SVM (RBF) 0.63
CNN 0.58
CNN (with embedding) 0.58
LSTM 0.61
LSTM (with embedding) 0.57
TD-based logistic regression 0.61
Table 9: Results of ADR classifications
50
Incremental analysis using TD learning
• The proposed method is on-line, so it allows us to monitor the sign
of ADR in real time.
• This model captures our intuitive response when reading ADR
events.
51
Conclusion
General Conclusion
• RL methods provide incremental learning, which is useful for some
practical cases.
• Model-based RL methods are nice options for cognitive processing of
texts.
• In NLP tasks, RL methods are difficult to apply because of sparsity
of rewards.
• Model-based RL methods are useful to avoid the reward sparseness
problem.
52
Future study
• Model-learning in model-based RL needs to be improved during
planning
• Thus, supervised models can be improved, while combined with RL
• Experience replay mechanism can be used for such improvement
process
• With the approach, RL methods might be used in many supervised
learning tasks.
53
References
Bo Pang and Lillian Lee. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. ACL
Bond, C., & Raehl, C. L. (2006). Adverse drug reactions in united states hospitals.
Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy, 26(5),
601–608.
Chomsky, N. (1959). A review of BF Skinner’s verbal behavior. Language, 35(1),
26–58.
Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv
Preprint ArXiv:1408.5882.
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and
documents. In International Conference on Machine Learning (pp. 1188-1196).
Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short
latencies. Nature, 244(5417), 522.
Marslen-Wilson, W. (1975). Sentence perception as an interactive parallel process.
Science,189(4198), 226–228.
A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts. (2011). Learning word
vectors for sentiment analysis. ACL
54
Potts, C. (2010). On the negativity of negation. Semantics and Linguistic Theory,
20(0), 636–659.
Seijen, H., & Sutton, R. S. (2014, January). True Online TD(lambda). In PMLR (pp.
692–700).
Skinner, B. (1957). Verbal Behavior. New York: Appleton-Century-Crofts.
R Socher, A Perelygin, J.Y. Wu, J Chuang, C.D. Manning, A.Y. Ng, and C Potts.
(2013). Recursive deep models for semantic compositionality over a sentiment
treebank. 1631:1631–1642.
Sutton, Richard S. (1988). Learning to predict by the methods of temporal differences.
Machine Learning, 3(1), 9–44.
Sutton, Richard Stuart. (1984). Temporal credit assignment in reinforcement learning
(Ph.D. Dissertation). University of Massachusetts, Amherst, MA.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT
press Cambridge.
55

Weitere ähnliche Inhalte

Was ist angesagt?

Combinatorial Problems2
Combinatorial Problems2Combinatorial Problems2
Combinatorial Problems2
3ashmawy
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 

Was ist angesagt? (19)

Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlp
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Paris Lecture 1
Paris Lecture 1Paris Lecture 1
Paris Lecture 1
 
Combinatorial Problems2
Combinatorial Problems2Combinatorial Problems2
Combinatorial Problems2
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysis
 
Zizka aimsa 2012
Zizka aimsa 2012Zizka aimsa 2012
Zizka aimsa 2012
 
Machine learning
Machine learningMachine learning
Machine learning
 
The Impact Of Semantic Handshakes
The Impact Of Semantic HandshakesThe Impact Of Semantic Handshakes
The Impact Of Semantic Handshakes
 
Ece478 12es_final_report
Ece478 12es_final_reportEce478 12es_final_report
Ece478 12es_final_report
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming ProblemPenalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problem
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
 
Toward a Natural Genetic / Evolutionary Algorithm for Multiobjective Optimiza...
Toward a Natural Genetic / Evolutionary Algorithm for Multiobjective Optimiza...Toward a Natural Genetic / Evolutionary Algorithm for Multiobjective Optimiza...
Toward a Natural Genetic / Evolutionary Algorithm for Multiobjective Optimiza...
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 

Ähnlich wie 강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)

MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
Dario Panada
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 

Ähnlich wie 강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안) (20)

MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approaches
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
Optimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmOptimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithm
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Adaptive relevance feedback in information retrieval
Adaptive relevance feedback in information retrievalAdaptive relevance feedback in information retrieval
Adaptive relevance feedback in information retrieval
 
MHT Multi Hypothesis Tracking - Part3
MHT Multi Hypothesis Tracking - Part3MHT Multi Hypothesis Tracking - Part3
MHT Multi Hypothesis Tracking - Part3
 
Reliability
ReliabilityReliability
Reliability
 
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
 
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 

Mehr von NAVER Engineering

Mehr von NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)

  • 1. 강화학습을 자연어 처리에 이용할 수 있을까? 보상의 희소성 문제와 그 방안 김영삼 2018.8.9 네이버 테크토크
  • 2. Table of contents 1. Motivation 2. Background 3. Experiments on Sentiment Polarity of Words 4. Experiment on Adverse Drug Reaction of nursing statements 5. Conclusion 1
  • 4. Basic motivation and research questions Basic motivation How computational reinforcement learning can be applied to questions in NLP? Two research questions • Prediction problem of on-line values of words in text • Prediction problem of on-line values of text 2
  • 5. More specific research questions • Prediction problem of on-line sentiment polarity values of words • Prediction problem of on-line Adverse Drug Reaction of nursing statements Why focusing on on-line? → “language processing is known to be on-line” 3
  • 7. Reinforcement learning and language learning In Verbal Behavior, Skinner (1957) argues that language learning can be explained by association of stimulus and reinforcement. Chomsky (1959) criticized the arguments with the following reasons: • Poverty of stimulus in language learning • Poverty of rewards or penalties in language learning 4
  • 8. Characteristics of RL • Evaluative and delayed feedback • No supervisor, only reward signal • Time matters • Agent’s actions affect the subsequent data it receives • Sampling approach • Approximated value functions • Trial and error approach 5
  • 9. Similarity of RL and language processing Immediacy of interpretation: language processing is incremental processing (Marslen-Wilson, 1973, 1975). Syntactic processing is incremental. • Syntactic parsing is not delayed. • Syntactic reanalysis is costly. • e.g. “The defendant examined by the lawyer turned out to be unreliable.” Semantic processing is also on-line. • Reading time is increased when gender violation occurs in anaphora resolution. • Simple linguistic inference also comes with on-line processing. 6
  • 10. Different tasks in RL and NLP Reinforcement learning • Robotics • Game control Nautral langugae processing • POS-tagging • Anaphora resolution • Syntactic parsing • Sentiment analysis • Question and answering • Machine translation 7
  • 11. Difficulties in applying RL to NLP Cost in exploration • Exploration/Exploitation dilemma • The cost is high when state/action sizes are large • Long training time Problem of sparsity of rewards • Some learning problems suffer from reward sparseness in model-free methods • If rewards are sparse, learning will be very difficult 8
  • 13. Temporal difference learning • A core algorithm of reinforcement learning • TD methods learn directly from episodes of experience • TD is model-free: no knowledge of MDP / MRP • TD learns from incomplete episodes, by bootstrapping Temporal-difference learning seems a natural solution for on-line natural language processing problems. 10
  • 14. Markov Reward Process MRP, a subset of Markov Decision Process, consists of the four components. Definition • S is a finite set of states. • P is a state transition probability matrix, Pss = P[St+1 = s |St = s] • R, r ∈ R is a reward function. • γ is a discount factor, γ ∈ [0, 1]. 11
  • 15. Value function A value function is defined as below: V (St) = E ∞ k=0 γk Rt+k+1 St = s (1) A value function for Monte-Carlo learning is, V (St) = V (St) + α ∞ k=0 γk Rt+k+1 − V (St) (2) A value function for the simplest TD is as follows: V (St) = V (St) + α(Rt+1 + γV (St+1) − V (St) (3) where α is learning rate. 12
  • 16. TD(λ) method TD(λ) of Sutton (1984, 1988) combines the simplest TD and Monte-Carlo methods in an incremental framework with the introduction of eligibility traces. The method is made incremental with the traces and the trace-decay parameter, λ ∈ [0, 1], which determines where to interpolate between the MC and TD(0) updates. When λ = 0, the update is equivalent to TD(0) and λ = 1 provides an every-visit MC update. 13
  • 17. Eligibility traces The eligibility trace implements the ‘backward view’ mechanism of TD(λ). On each step, all trace values of states decay by γλ, but increment the trace of the visited state in a number of ways. 14
  • 18. Algorithm 1: Fast TD(λ) with replacing traces 1 Initialize V (s) arbitrarily and let e(s) = 0 for all s ∈ S; 2 H ← new hash table; 3 repeat 4 while st not at end of the episode do 5 observe reward, r, and st+1; 6 δ ← r + γV (st+1) − V (st ); 7 e(st ) ← 1; 8 if H not contains st then 9 insert st into H; 10 for all h ∈ H do 11 if e(h) ≤ 0.001 then 12 e(h) ← 0; 13 remove h from H; 14 continue; 15 V (h) ← V (h) + αδe(h); 16 e(h) ← γλe(h); 17 until the episode is terminal; 15
  • 19. Sutton’s results Figure 1: Performance of on-line TD(λ) on the 19-state random walk task 16
  • 20. My replication of TD methods Figure 2: Performance of TD methods on the 19-state random walk task 17
  • 22. Problem formulation A movie review is represented as a sequence of words (states): w1, w2, . . . , wt, . . . , wT where T is the length of the review. State A state is defined as the word type of a vocabulary in a corpus. Reward We regard a classification label of a text as rewards: +1 for the positive label, −1 for the negative label. 18
  • 23. Problem formulation Figure 3: An example of a 6-state Markov Reward Process. The numbers on the arrows indicate the rewards and which of the two values (+1 and −1) for the terminal state is determined by the label of the review. Figure 4: In this MRP, every reward is returned by the classification label. 19
  • 24. Datasets Movie Review Dataset We use the polarity dataset v2.0 for the indirect evaluation (Pang and Lee, 2004), which consists of 1,000 positive and 1,000 negative movie reviews. Stanford Sentiment Treebank The data is based on 11,855 single sentences extracted from movie reviews and contains sentiment polarity values for all phrases which are annotated by 3 human judges (Socher et al., 2013). Large Movie Review Dataset This corpus for binary sentiment classification (Mass et al., 2011) is used to train the LSTM sentiment classifier of our method. 20
  • 25. Configuration of experiments • Experiment 1: Hyper-parameter exploration with feature selection paradigm (with setting in Fig. 3) • Experiment 2: Evaluation with feature selection paradigm (with setting in Fig. 4) • Experiment 3: Direct evaluation with Stanford Sentiment Treebank (with Setting in Fig. 3) 21
  • 26. Setting of Experiment 1 Naive Bayes classification is performed based on the top 10,000 selected words. 10-fold cross validation is used for each condition and the accuracies are averaged. Conditions of TD methods: Hyper-parameter combinations of learning rate (0.1∼0.5) and trace-decay rate (0.1∼1.0) with step size of 0.1 Incremental means of TD values over time steps are used for estimating words. Compared feature selection methods • Document Frequency • Averaged TF-IDF • χ2 statistic (CHI) • Information Gain 22
  • 27. Results of Experiment 1 (hyper-parameters) Figure 5: Performance of TD with replacing traces as function of λ 23
  • 28. Results of Experiment 1 (accuracies) Method NB Accuracy TD(1) with accumulation 0.84 TD(1) with replacing 0.83 TD(1) with Dutch 0.83 TD(1) with significance 0.83 True Online TD(1) 0.78 Simple Averages 0.64 Document Frequency 0.67 TF-IDF 0.69 χ2 statistic (CHI) 0.66 Information Gain 0.83 Table 1: 10-fold cross validation accuracies of the TD methods and the feature selection methods 24
  • 29. Setting of Experiment 2 An LSTM sentiment classifier is trained on Large Movie Review Dataset. A movie review is split into sentences. The LSTM sentiment classifier gives a label (+1 or −1) to each string, which is being formed incrementally as below: This This is This is a This is a good This is a good movie This time, 10 accuracies from 10 samplings of training/test sets are averaged using the last step TD values of the words. 25
  • 30. Model-based reinforcement learning • A model M is a representation of an MDP/MRP, parametrized by η. • A model M = (Pη, Rη) represents state transitions Pη ≈ P and rewards Rη ≈ R St+1 ∼ Pη(St+1|St) Rt+1 = Rη(Rt+1|St) • Model-based RL plans value function from model. 26
  • 31. Optimal value of trace-decay rate in Setting 2 Figure 6: Performance of TD with replacing traces as function of λ 27
  • 32. Results of Experiment 2 Method NB Accuracy TD(λ) with accumulation traces 0.82 TD(λ) with Dutch traces 0.82 TD(λ) with replacing traces 0.82 TD(λ) with significance traces 0.82 Information Gain 0.75 Table 2: Classification Accuracy of TD methods with Setting of Fig. 4 Note that ‘Information Gain’ method is based on the sentence-based dataset. 28
  • 33. Setting of Experiment 3 For the direct evaluation, correlation coefficients between the estimated values and the labeled polarities are calculated. The labels are real values ranges from 0 to 1 (Stanford Sentiment Treebank). TD methods follow the previous setting (with the setting of Fig 3). Full dataset: 21,684 words in the dataset Reduced dataset: 4,532 words which are adjective, adjective (superlative) or adverb, adverb (comparative). For comparison, estimation method with Bayes provabilities (Potts, 2010) is used. 29
  • 34. Results of Experiment 3 (full set) Method Pearson (full) Spearman (full) Bayes Prob. (Potts, 2010) 0.21 0.2 TD(1) with replacing 0.24 0.21 TD(1) with Dutch 0.24 0.21 TD(1) with significance 0.24 0.21 TD(1) with accumulation 0.24 0.21 Table 3: Correlations between the estimation results and the human-labeled polarity values of the 24,684 words. 30
  • 35. Results of Experiment 3 (reduced set) Method Pearson (reduced) Spearman (reduced) Bayes Prob. (Potts, 2010) 0.32 0.3 TD(1) with replacing 0.38 0.35 TD(1) with Dutch 0.38 0.35 TD(1) with significance 0.38 0.35 TD(1) with accumulation 0.38 0.34 Table 4: Correlations between the estimation results and the human-labeled polarity values of the 4,532 words. 31
  • 36. Plots of the labeled and estimated values Figure 7: Data plots of the annotated, Bayesian and the TD(1) with significance traces values of the total dataset (from the lowest to the highest) 32
  • 37. Summary of the experiment results • TD methods achieve the same level of performance as other feature selection methods. • TD-based estimations are more differential, providing more realistic values. • TD-based methods provide an easy tool for on-line estimation of words. 33
  • 38. Summary of the differences between the two settings • In setting 1, TD methods with λ = 1 show best performances. • In setting 2, TD methods with λ = 0.7 show best performances. • TD methods with setting 2 show better performances. • Thus, for TD methods with setting 2, best performance is obtained with an intermediate value of λ. • Note that TD method with setting 2 is a model-based approach. 34
  • 39. Experiment on Adverse Drug Reaction of nursing statements
  • 40. Data • Data source: Ajou University hospital • Nursing statements from 8,316 patients • 4,158 ADR labeled patients • 4,158 non-ADR labeled patients • Average number of sentences per patient: 421 • Largest number of sentences of a patient: 10,625 • Total number of sentence types: 837,293 35
  • 41. Data example Time Sentence 20120627055500 심호흡 교육함 20120627055500 구강간호 시행함 20120627063000 오심 감소함 20120627063000 침상안정 중임 20120627080000 수액 주입중임Right.arm 20G 20120627080000 정맥주사부위 통증,부종,발적 없음 20120627080000 일혈 위험약물과 증상에 대해 교육함 20120627080000 정맥주사부위 통증,부종,발적 없음 20120627080000 금식 중임 20120627080000 갈증 없음 20120627080000 수분부족 증상 관찰함 36
  • 42. Other methods: NB • 9,647 sentence types are used whose frequency is greater than 20. • Train, Dev, Test sets at ratio of 8:1:1 • Information Gain is used to select N-best features. • Grid-search is performed to find the best N (3700) 37
  • 43. Other methods: SVM • 9,647 sentence types are used whose frequency is greater than 20. • Train, Dev, Test sets at ratio of 8:1:1 • Linear and RBF models are both used. • Grid-search is performed to find the best parameters (Table 5). Min DF Max DF Proportion Gamma C 5 0.5 0.1 4 Table 5: Parameter values used in SVM 38
  • 44. Other methods: CNN • All 837,293 sentence types are used. • Train, Dev, Test sets at ratio of 8:1:1 • Pretrained paragraph vectors (Le & Mikolov, 2014) are used for embedding condition. • In ADR cases, latest 288 statements for ADR date are used. • In non-ADR cases, latest 288 statements for a random index are used. 39
  • 45. Other methods: LSTM • All 837,293 sentence types are used. • Train, Dev, Test sets at ratio of 8:1:1 • Pretrained paragraph vectors (Le & Mikolov, 2014) are used for embedding condition. • In ADR cases, latest 200 statements for ADR date are used. • In non-ADR cases, latest 200 statements for a random index are used. 40
  • 46. Problem formulation Each statement is represented as a state of the patient: s1, s2, . . . , st, . . . , sT where T is the length of the statements. State A state is defined as a functional event of a statement. Reward We regard a classification label of the statemetns as rewards: +1 for ADR label, −1 for the non-ADR case. 41
  • 47. Memory-based function approximation • Non-parametric function approximation • It retrieves training examples when a query state is given. • And it estimates the state value from the examples. • It provides a way to avoid the curse of dimensionality. 42
  • 48. Building training examples A dataset of 400 patients is sampled and a nursing expert marked a statement if it seems highly related to ADR. Sentence Label 오심 없음 구토 없음 침상안정 중임 2(구두처방)Tramadol T 주관적 진술:”머리랑 배가 아파요.” T 복부 통증 호소함 T 통증 양상 관찰함 T 통증완화 위한 다양한 방법 격려함 T 의사에게 알림 T 심호흡 교육함 구강간호 시행함 오심 감소함 43
  • 49. Building training examples The labeled statements are manullay classified into 7 events as follows: State Meaning Example 0 Unknown event CT찍고 옴. 1 Drug related event epocelin 1g 투여 함 2 Abnormal reactions 피부 가려움 호소함부위:both arm 3 Doctor related event 의사에게 알림의사:xxx 4 Subjective response 주관적 진술:몸이 가벼워요 5 1 & 2 event Tramadol 맞은 후 구토 2회 함 6 4 & 2 event 주관적 진술:속이 계속 울렁거리네요. Table 6: Table of defined states and example sentences 44
  • 50. Building NB classifier A Naive Bayse classifier is trained based on binary feature functions and unigrams in sentence. Feature functions: • is high temp: positive if body temperature > 37.4 • is subjective: positive if subjective statement marker is present • is drug related: positive if precription statement or drug names appeared • is negation: positive if negation expression is present • is reactive: positive if ADR-realated symptoms are present Accuracy of the classifier: 0.72 45
  • 51. Memory-based reward function The ADR environment provides sparse rewards, so reward function is modeled as below: def get_reward(n_state, n_idx): if n_state == ‘adr’: return 1 elif n_state == ‘nor’: return -1 elif n_idx in (2, 3, 4, 5, 6): return 1 else: return 0 46
  • 52. Training • 8,316 episodes are split into Train, Validation, Test sets at ratio of 8:1:1. • Replacing traces is used for eligibility traces. Parameter Value Learning rate (α) 0.1 Trace-decay rate (λ) 0.3 Discount factor (γ) 0.1 Table 7: Table of values of the parameters 47
  • 53. Estimated state values from train set State Meaning Value 0 Unknown event NA 1 Drug related event 0.17 2 Abnormal reactions 0.47 3 Doctor related event 0.27 4 Subjective response 0.51 5 1 & 2 event 0.18 6 4 & 2 event 0.80 Table 8: Estimated values of states using TD(λ) 48
  • 54. Figure 8: Graph of state values across train datasets 49
  • 55. Classification based on state values • The state values corressponding to the states in Validation set are averaged (the value of unknown state is regarded as 0). • Simple logistic regression is performed to the Validation set. • The regression classifier is tested on the Test set. Method Accuracy NB 0.64 SVM (linear) 0.63 SVM (RBF) 0.63 CNN 0.58 CNN (with embedding) 0.58 LSTM 0.61 LSTM (with embedding) 0.57 TD-based logistic regression 0.61 Table 9: Results of ADR classifications 50
  • 56. Incremental analysis using TD learning • The proposed method is on-line, so it allows us to monitor the sign of ADR in real time. • This model captures our intuitive response when reading ADR events. 51
  • 58. General Conclusion • RL methods provide incremental learning, which is useful for some practical cases. • Model-based RL methods are nice options for cognitive processing of texts. • In NLP tasks, RL methods are difficult to apply because of sparsity of rewards. • Model-based RL methods are useful to avoid the reward sparseness problem. 52
  • 59. Future study • Model-learning in model-based RL needs to be improved during planning • Thus, supervised models can be improved, while combined with RL • Experience replay mechanism can be used for such improvement process • With the approach, RL methods might be used in many supervised learning tasks. 53
  • 60. References Bo Pang and Lillian Lee. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ACL Bond, C., & Raehl, C. L. (2006). Adverse drug reactions in united states hospitals. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy, 26(5), 601–608. Chomsky, N. (1959). A review of BF Skinner’s verbal behavior. Language, 35(1), 26–58. Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv Preprint ArXiv:1408.5882. Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196). Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short latencies. Nature, 244(5417), 522. Marslen-Wilson, W. (1975). Sentence perception as an interactive parallel process. Science,189(4198), 226–228. A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts. (2011). Learning word vectors for sentiment analysis. ACL 54
  • 61. Potts, C. (2010). On the negativity of negation. Semantics and Linguistic Theory, 20(0), 636–659. Seijen, H., & Sutton, R. S. (2014, January). True Online TD(lambda). In PMLR (pp. 692–700). Skinner, B. (1957). Verbal Behavior. New York: Appleton-Century-Crofts. R Socher, A Perelygin, J.Y. Wu, J Chuang, C.D. Manning, A.Y. Ng, and C Potts. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. 1631:1631–1642. Sutton, Richard S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. Sutton, Richard Stuart. (1984). Temporal credit assignment in reinforcement learning (Ph.D. Dissertation). University of Massachusetts, Amherst, MA. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press Cambridge. 55