강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)

강화학습을 자연어 처리에 이용할 수 있을까?
보상의 희소성 문제와 그 방안
김영삼
2018.8.9
네이버 테크토크

Table of contents
1. Motivation
2. Background
3. Experiments on Sentiment Polarity of Words
4. Experiment on Adverse Drug Reaction of nursing statements
5. Conclusion
1

Basic motivation and research questions
Basic motivation
How computational reinforcement learning can be applied to
questions in NLP?
Two research questions
• Prediction problem of on-line values of words in text
• Prediction problem of on-line values of text
2

More speciﬁc research questions
• Prediction problem of on-line sentiment polarity values of words
• Prediction problem of on-line Adverse Drug Reaction of nursing
statements
Why focusing on on-line? → “language processing is known to be
on-line”
3

Reinforcement learning and language learning
In Verbal Behavior, Skinner (1957) argues that language learning can be
explained by association of stimulus and reinforcement.
Chomsky (1959) criticized the arguments with the following reasons:
• Poverty of stimulus in language learning
• Poverty of rewards or penalties in language learning
4

Characteristics of RL
• Evaluative and delayed feedback
• No supervisor, only reward signal
• Time matters
• Agent’s actions aﬀect the subsequent data it receives
• Sampling approach
• Approximated value functions
• Trial and error approach
5

Similarity of RL and language processing
Immediacy of interpretation: language processing is incremental
processing (Marslen-Wilson, 1973, 1975).
Syntactic processing is incremental.
• Syntactic parsing is not delayed.
• Syntactic reanalysis is costly.
• e.g. “The defendant examined by the lawyer turned out to be
unreliable.”
Semantic processing is also on-line.
• Reading time is increased when gender violation occurs in anaphora
resolution.
• Simple linguistic inference also comes with on-line processing.
6

Diﬀerent tasks in RL and NLP
Reinforcement learning
• Robotics
• Game control
Nautral langugae processing
• POS-tagging
• Anaphora resolution
• Syntactic parsing
• Sentiment analysis
• Question and answering
• Machine translation
7

Difficulties in applying RL to NLP
Cost in exploration
• Exploration/Exploitation dilemma
• The cost is high when state/action sizes are large
• Long training time
Problem of sparsity of rewards
• Some learning problems suffer from reward sparseness in model-free
methods
• If rewards are sparse, learning will be very difficult
8

Example: random-walk experiment using TD(λ)
9

Temporal diﬀerence learning
• A core algorithm of reinforcement learning
• TD methods learn directly from episodes of experience
• TD is model-free: no knowledge of MDP / MRP
• TD learns from incomplete episodes, by bootstrapping
Temporal-diﬀerence learning seems a natural solution for on-line natural
language processing problems.
10

Markov Reward Process
MRP, a subset of Markov Decision Process, consists of the four
components.
Deﬁnition
• S is a ﬁnite set of states.
• P is a state transition probability matrix, Pss = P[St+1 = s |St = s]
• R, r ∈ R is a reward function.
• γ is a discount factor, γ ∈ [0, 1].
11

Value function
A value function is deﬁned as below:
V (St) = E
∞
k=0
γk
Rt+k+1 St = s (1)
A value function for Monte-Carlo learning is,
V (St) = V (St) + α
∞
k=0
γk
Rt+k+1 − V (St) (2)
A value function for the simplest TD is as follows:
V (St) = V (St) + α(Rt+1 + γV (St+1) − V (St) (3)
where α is learning rate.
12

TD(λ) method
TD(λ) of Sutton (1984, 1988) combines the simplest TD and
Monte-Carlo methods in an incremental framework with the introduction
of eligibility traces.
The method is made incremental with the traces and the trace-decay
parameter, λ ∈ [0, 1], which determines where to interpolate between the
MC and TD(0) updates.
When λ = 0, the update is equivalent to TD(0) and λ = 1 provides an
every-visit MC update.
13

Eligibility traces
The eligibility trace implements the ‘backward view’ mechanism of
TD(λ).
On each step, all trace values of states decay by γλ, but increment the
trace of the visited state in a number of ways.
14

Algorithm 1: Fast TD(λ) with replacing traces
1 Initialize V (s) arbitrarily and let e(s) = 0 for all s ∈ S;
2 H ← new hash table;
3 repeat
4 while st not at end of the episode do
5 observe reward, r, and st+1;
6 δ ← r + γV (st+1) − V (st );
7 e(st ) ← 1;
8 if H not contains st then
9 insert st into H;
10 for all h ∈ H do
11 if e(h) ≤ 0.001 then
12 e(h) ← 0;
13 remove h from H;
14 continue;
15 V (h) ← V (h) + αδe(h);
16 e(h) ← γλe(h);
17 until the episode is terminal;
15

Sutton’s results
Figure 1: Performance of on-line TD(λ) on the 19-state random walk task
16

My replication of TD methods
Figure 2: Performance of TD methods on the 19-state random walk task
17

Experiments on Sentiment
Polarity of Words

Problem formulation
A movie review is represented as a sequence of words (states):
w1, w2, . . . , wt, . . . , wT where T is the length of the review.
State
A state is deﬁned as the word type of a vocabulary in a corpus.
Reward
We regard a classiﬁcation label of a text as rewards: +1 for the positive
label, −1 for the negative label.
18

Problem formulation
Figure 3: An example of a 6-state Markov Reward Process. The numbers on
the arrows indicate the rewards and which of the two values (+1 and −1) for
the terminal state is determined by the label of the review.
Figure 4: In this MRP, every reward is returned by the classiﬁcation label.
19

Datasets
Movie Review Dataset We use the polarity dataset v2.0 for the
indirect evaluation (Pang and Lee, 2004), which consists of 1,000
positive and 1,000 negative movie reviews.
Stanford Sentiment Treebank The data is based on 11,855 single
sentences extracted from movie reviews and contains sentiment polarity
values for all phrases which are annotated by 3 human judges (Socher et
al., 2013).
Large Movie Review Dataset This corpus for binary sentiment
classiﬁcation (Mass et al., 2011) is used to train the LSTM sentiment
classiﬁer of our method.
20

Conﬁguration of experiments
• Experiment 1: Hyper-parameter exploration with feature selection
paradigm (with setting in Fig. 3)
• Experiment 2: Evaluation with feature selection paradigm (with
setting in Fig. 4)
• Experiment 3: Direct evaluation with Stanford Sentiment Treebank
(with Setting in Fig. 3)
21

Setting of Experiment 1
Naive Bayes classiﬁcation is performed based on the top 10,000 selected
words.
10-fold cross validation is used for each condition and the accuracies are
averaged.
Conditions of TD methods: Hyper-parameter combinations of learning
rate (0.1∼0.5) and trace-decay rate (0.1∼1.0) with step size of 0.1
Incremental means of TD values over time steps are used for estimating
words.
Compared feature selection methods
• Document Frequency
• Averaged TF-IDF
• χ2
statistic (CHI)
• Information Gain
22

Results of Experiment 1 (hyper-parameters)
Figure 5: Performance of TD with replacing traces as function of λ
23

Results of Experiment 1 (accuracies)
Method NB Accuracy
TD(1) with accumulation 0.84
TD(1) with replacing 0.83
TD(1) with Dutch 0.83
TD(1) with signiﬁcance 0.83
True Online TD(1) 0.78
Simple Averages 0.64
Document Frequency 0.67
TF-IDF 0.69
χ2
statistic (CHI) 0.66
Information Gain 0.83
Table 1: 10-fold cross validation accuracies of the TD methods and the feature
selection methods
24

An LSTM sentiment classiﬁer is trained on Large Movie Review Dataset.
A movie review is split into sentences.
The LSTM sentiment classiﬁer gives a label (+1 or −1) to each string,
which is being formed incrementally as below:
This
This is
This is a
This is a good
This is a good movie
This time, 10 accuracies from 10 samplings of training/test sets are
averaged using the last step TD values of the words.
25

Model-based reinforcement learning
• A model M is a representation of an MDP/MRP, parametrized by η.
• A model M = (Pη, Rη) represents state transitions Pη ≈ P and
rewards Rη ≈ R
St+1 ∼ Pη(St+1|St)
Rt+1 = Rη(Rt+1|St)
• Model-based RL plans value function from model.
26

Optimal value of trace-decay rate in Setting 2
Figure 6: Performance of TD with replacing traces as function of λ
27

Results of Experiment 2
Method NB Accuracy
TD(λ) with accumulation traces 0.82
TD(λ) with Dutch traces 0.82
TD(λ) with replacing traces 0.82
TD(λ) with signiﬁcance traces 0.82
Information Gain 0.75
Table 2: Classiﬁcation Accuracy of TD methods with Setting of Fig. 4
Note that ‘Information Gain’ method is based on the sentence-based
dataset.
28

For the direct evaluation, correlation coeﬃcients between the estimated
values and the labeled polarities are calculated.
The labels are real values ranges from 0 to 1 (Stanford Sentiment
Treebank).
TD methods follow the previous setting (with the setting of Fig 3).
Full dataset: 21,684 words in the dataset
Reduced dataset: 4,532 words which are adjective, adjective (superlative)
or adverb, adverb (comparative).
For comparison, estimation method with Bayes provabilities (Potts, 2010)
is used.
29

Results of Experiment 3 (full set)
Method Pearson (full) Spearman (full)
Bayes Prob. (Potts, 2010) 0.21 0.2
TD(1) with replacing 0.24 0.21
TD(1) with Dutch 0.24 0.21
TD(1) with signiﬁcance 0.24 0.21
TD(1) with accumulation 0.24 0.21
Table 3: Correlations between the estimation results and the human-labeled
polarity values of the 24,684 words.
30

Results of Experiment 3 (reduced set)
Method Pearson (reduced) Spearman (reduced)
Bayes Prob. (Potts, 2010) 0.32 0.3
TD(1) with replacing 0.38 0.35
TD(1) with Dutch 0.38 0.35
TD(1) with signiﬁcance 0.38 0.35
TD(1) with accumulation 0.38 0.34
Table 4: Correlations between the estimation results and the human-labeled
polarity values of the 4,532 words.
31

Plots of the labeled and estimated values
Figure 7: Data plots of the annotated, Bayesian and the TD(1) with
signiﬁcance traces values of the total dataset (from the lowest to the highest)
32

Summary of the experiment results
• TD methods achieve the same level of performance as other feature
selection methods.
• TD-based estimations are more diﬀerential, providing more realistic
values.
• TD-based methods provide an easy tool for on-line estimation of
words.
33

Summary of the diﬀerences between the two settings
• In setting 1, TD methods with λ = 1 show best performances.
• In setting 2, TD methods with λ = 0.7 show best performances.
• TD methods with setting 2 show better performances.
• Thus, for TD methods with setting 2, best performance is obtained
with an intermediate value of λ.
• Note that TD method with setting 2 is a model-based approach.
34

Experiment on Adverse Drug
Reaction of nursing statements

Data
• Data source: Ajou University hospital
• Nursing statements from 8,316 patients
• 4,158 ADR labeled patients
• 4,158 non-ADR labeled patients
• Average number of sentences per patient: 421
• Largest number of sentences of a patient: 10,625
• Total number of sentence types: 837,293
35

Data example
Time Sentence
20120627055500 심호흡 교육함
20120627055500 구강간호 시행함
20120627063000 오심 감소함
20120627063000 침상안정 중임
20120627080000 수액 주입중임Right.arm 20G
20120627080000 정맥주사부위 통증,부종,발적 없음
20120627080000 일혈 위험약물과 증상에 대해 교육함
20120627080000 정맥주사부위 통증,부종,발적 없음
20120627080000 금식 중임
20120627080000 갈증 없음
20120627080000 수분부족 증상 관찰함
36

Other methods: NB
• 9,647 sentence types are used whose frequency is greater than 20.
• Train, Dev, Test sets at ratio of 8:1:1
• Information Gain is used to select N-best features.
• Grid-search is performed to ﬁnd the best N (3700)
37

Other methods: SVM
• 9,647 sentence types are used whose frequency is greater than 20.
• Linear and RBF models are both used.
• Grid-search is performed to ﬁnd the best parameters (Table 5).
Min DF Max DF Proportion Gamma C
5 0.5 0.1 4
Table 5: Parameter values used in SVM
38

Other methods: CNN
• All 837,293 sentence types are used.
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 288 statements for ADR date are used.
• In non-ADR cases, latest 288 statements for a random index are
used.
39

Other methods: LSTM
• All 837,293 sentence types are used.
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 200 statements for ADR date are used.
• In non-ADR cases, latest 200 statements for a random index are
used.
40

Problem formulation
Each statement is represented as a state of the patient:
s1, s2, . . . , st, . . . , sT where T is the length of the statements.
State
A state is deﬁned as a functional event of a statement.
Reward
We regard a classiﬁcation label of the statemetns as rewards: +1 for
ADR label, −1 for the non-ADR case.
41

Memory-based function approximation
• Non-parametric function approximation
• It retrieves training examples when a query state is given.
• And it estimates the state value from the examples.
• It provides a way to avoid the curse of dimensionality.
42

Building training examples
A dataset of 400 patients is sampled and a nursing expert marked a
statement if it seems highly related to ADR.
Sentence Label
오심 없음
구토 없음
침상안정 중임
2(구두처방)Tramadol T
주관적 진술:”머리랑 배가 아파요.” T
복부 통증 호소함 T
통증 양상 관찰함 T
통증완화 위한 다양한 방법 격려함 T
의사에게 알림 T
심호흡 교육함
구강간호 시행함
오심 감소함
43

Building training examples
The labeled statements are manullay classiﬁed into 7 events as follows:
State Meaning Example
0 Unknown event CT찍고 옴.
1 Drug related event epocelin 1g 투여 함
2 Abnormal reactions 피부 가려움 호소함부위:both arm
3 Doctor related event 의사에게 알림의사:xxx
4 Subjective response 주관적 진술:몸이 가벼워요
5 1 & 2 event Tramadol 맞은 후 구토 2회 함
6 4 & 2 event 주관적 진술:속이 계속 울렁거리네요.
Table 6: Table of deﬁned states and example sentences
44

Building NB classifier
A Naive Bayse classifier is trained based on binary feature functions and
unigrams in sentence.
Feature functions:
• is high temp: positive if body temperature > 37.4
• is subjective: positive if subjective statement marker is present
• is drug related: positive if precription statement or drug names
appeared
• is negation: positive if negation expression is present
• is reactive: positive if ADR-realated symptoms are present
Accuracy of the classifier: 0.72
45

Memory-based reward function
The ADR environment provides sparse rewards, so reward function is
modeled as below:
def get_reward(n_state, n_idx):
if n_state == ‘adr’:
return 1
elif n_state == ‘nor’:
return -1
elif n_idx in (2, 3, 4, 5, 6):
return 1
else:
return 0
46

Training
• 8,316 episodes are split into Train, Validation, Test sets at ratio of
8:1:1.
• Replacing traces is used for eligibility traces.
Parameter Value
Learning rate (α) 0.1
Trace-decay rate (λ) 0.3
Discount factor (γ) 0.1
Table 7: Table of values of the parameters
47

Estimated state values from train set
State Meaning Value
0 Unknown event NA
1 Drug related event 0.17
2 Abnormal reactions 0.47
3 Doctor related event 0.27
4 Subjective response 0.51
5 1 & 2 event 0.18
6 4 & 2 event 0.80
Table 8: Estimated values of states using TD(λ)
48

Figure 8: Graph of state values across train datasets
49

Classification based on state values
• The state values corressponding to the states in Validation set are
averaged (the value of unknown state is regarded as 0).
• Simple logistic regression is performed to the Validation set.
• The regression classifier is tested on the Test set.
Method Accuracy
NB 0.64
SVM (linear) 0.63
SVM (RBF) 0.63
CNN 0.58
CNN (with embedding) 0.58
LSTM 0.61
LSTM (with embedding) 0.57
TD-based logistic regression 0.61
Table 9: Results of ADR classifications
50

Incremental analysis using TD learning
• The proposed method is on-line, so it allows us to monitor the sign
of ADR in real time.
• This model captures our intuitive response when reading ADR
events.
51

General Conclusion
• RL methods provide incremental learning, which is useful for some
practical cases.
• Model-based RL methods are nice options for cognitive processing of
texts.
• In NLP tasks, RL methods are diﬃcult to apply because of sparsity
of rewards.
• Model-based RL methods are useful to avoid the reward sparseness
problem.
52

Future study
• Model-learning in model-based RL needs to be improved during
planning
• Thus, supervised models can be improved, while combined with RL
• Experience replay mechanism can be used for such improvement
process
• With the approach, RL methods might be used in many supervised
learning tasks.
53

References
Bo Pang and Lillian Lee. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. ACL
Bond, C., & Raehl, C. L. (2006). Adverse drug reactions in united states hospitals.
Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy, 26(5),
601–608.
Chomsky, N. (1959). A review of BF Skinner’s verbal behavior. Language, 35(1),
26–58.
Kim, Y. (2014). Convolutional neural networks for sentence classiﬁcation. ArXiv
Preprint ArXiv:1408.5882.
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and
documents. In International Conference on Machine Learning (pp. 1188-1196).
Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short
latencies. Nature, 244(5417), 522.
Marslen-Wilson, W. (1975). Sentence perception as an interactive parallel process.
Science,189(4198), 226–228.
A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts. (2011). Learning word
vectors for sentiment analysis. ACL
54

Potts, C. (2010). On the negativity of negation. Semantics and Linguistic Theory,
20(0), 636–659.
Seijen, H., & Sutton, R. S. (2014, January). True Online TD(lambda). In PMLR (pp.
692–700).
Skinner, B. (1957). Verbal Behavior. New York: Appleton-Century-Crofts.
R Socher, A Perelygin, J.Y. Wu, J Chuang, C.D. Manning, A.Y. Ng, and C Potts.
(2013). Recursive deep models for semantic compositionality over a sentiment
treebank. 1631:1631–1642.
Sutton, Richard S. (1988). Learning to predict by the methods of temporal diﬀerences.
Machine Learning, 3(1), 9–44.
Sutton, Richard Stuart. (1984). Temporal credit assignment in reinforcement learning
(Ph.D. Dissertation). University of Massachusetts, Amherst, MA.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT
press Cambridge.
55

강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie 강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)

Ähnlich wie 강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안) (20)

Mehr von NAVER Engineering

Mehr von NAVER Engineering (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)