발표자: 김영삼(서울대 박사)
발표일: 2018.8.
2015년 아타리 게임 컨트롤 과제와 2016년 알파고의 세계바둑 제패와 함께 강화학습은 많은 기계학습 연구자들의 관심을 얻게 되었으나, 자연어 처리 분야에서 강화학습은 아직까지 그 뚜렷한 활용전략이 나타나지 않고 있다. 본 토크에서는 강화학습이 자연어 처리 문제에 적용하기 어려운 주요 배경 중의 하나로, '보상의 희소성' 문제를 지적하고, 이를 해결하기 위한 방법으로 모형기반 강화학습과 기억기반 접근법의 활용 가능성을 논의하고자 한다. 이 가능성을 예시하기 위해 temporal-difference learning을 이용한 단어의 감정값 측정과 의료 진술문 상태값 측정과제를 수행하였고, 이를 중심으로 그 활용방법과 의미를 논의한다.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)
1. 강화학습을 자연어 처리에 이용할 수 있을까?
보상의 희소성 문제와 그 방안
김영삼
2018.8.9
네이버 테크토크
2. Table of contents
1. Motivation
2. Background
3. Experiments on Sentiment Polarity of Words
4. Experiment on Adverse Drug Reaction of nursing statements
5. Conclusion
1
4. Basic motivation and research questions
Basic motivation
How computational reinforcement learning can be applied to
questions in NLP?
Two research questions
• Prediction problem of on-line values of words in text
• Prediction problem of on-line values of text
2
5. More specific research questions
• Prediction problem of on-line sentiment polarity values of words
• Prediction problem of on-line Adverse Drug Reaction of nursing
statements
Why focusing on on-line? → “language processing is known to be
on-line”
3
7. Reinforcement learning and language learning
In Verbal Behavior, Skinner (1957) argues that language learning can be
explained by association of stimulus and reinforcement.
Chomsky (1959) criticized the arguments with the following reasons:
• Poverty of stimulus in language learning
• Poverty of rewards or penalties in language learning
4
8. Characteristics of RL
• Evaluative and delayed feedback
• No supervisor, only reward signal
• Time matters
• Agent’s actions affect the subsequent data it receives
• Sampling approach
• Approximated value functions
• Trial and error approach
5
9. Similarity of RL and language processing
Immediacy of interpretation: language processing is incremental
processing (Marslen-Wilson, 1973, 1975).
Syntactic processing is incremental.
• Syntactic parsing is not delayed.
• Syntactic reanalysis is costly.
• e.g. “The defendant examined by the lawyer turned out to be
unreliable.”
Semantic processing is also on-line.
• Reading time is increased when gender violation occurs in anaphora
resolution.
• Simple linguistic inference also comes with on-line processing.
6
10. Different tasks in RL and NLP
Reinforcement learning
• Robotics
• Game control
Nautral langugae processing
• POS-tagging
• Anaphora resolution
• Syntactic parsing
• Sentiment analysis
• Question and answering
• Machine translation
7
11. Difficulties in applying RL to NLP
Cost in exploration
• Exploration/Exploitation dilemma
• The cost is high when state/action sizes are large
• Long training time
Problem of sparsity of rewards
• Some learning problems suffer from reward sparseness in model-free
methods
• If rewards are sparse, learning will be very difficult
8
13. Temporal difference learning
• A core algorithm of reinforcement learning
• TD methods learn directly from episodes of experience
• TD is model-free: no knowledge of MDP / MRP
• TD learns from incomplete episodes, by bootstrapping
Temporal-difference learning seems a natural solution for on-line natural
language processing problems.
10
14. Markov Reward Process
MRP, a subset of Markov Decision Process, consists of the four
components.
Definition
• S is a finite set of states.
• P is a state transition probability matrix, Pss = P[St+1 = s |St = s]
• R, r ∈ R is a reward function.
• γ is a discount factor, γ ∈ [0, 1].
11
15. Value function
A value function is defined as below:
V (St) = E
∞
k=0
γk
Rt+k+1 St = s (1)
A value function for Monte-Carlo learning is,
V (St) = V (St) + α
∞
k=0
γk
Rt+k+1 − V (St) (2)
A value function for the simplest TD is as follows:
V (St) = V (St) + α(Rt+1 + γV (St+1) − V (St) (3)
where α is learning rate.
12
16. TD(λ) method
TD(λ) of Sutton (1984, 1988) combines the simplest TD and
Monte-Carlo methods in an incremental framework with the introduction
of eligibility traces.
The method is made incremental with the traces and the trace-decay
parameter, λ ∈ [0, 1], which determines where to interpolate between the
MC and TD(0) updates.
When λ = 0, the update is equivalent to TD(0) and λ = 1 provides an
every-visit MC update.
13
17. Eligibility traces
The eligibility trace implements the ‘backward view’ mechanism of
TD(λ).
On each step, all trace values of states decay by γλ, but increment the
trace of the visited state in a number of ways.
14
18. Algorithm 1: Fast TD(λ) with replacing traces
1 Initialize V (s) arbitrarily and let e(s) = 0 for all s ∈ S;
2 H ← new hash table;
3 repeat
4 while st not at end of the episode do
5 observe reward, r, and st+1;
6 δ ← r + γV (st+1) − V (st );
7 e(st ) ← 1;
8 if H not contains st then
9 insert st into H;
10 for all h ∈ H do
11 if e(h) ≤ 0.001 then
12 e(h) ← 0;
13 remove h from H;
14 continue;
15 V (h) ← V (h) + αδe(h);
16 e(h) ← γλe(h);
17 until the episode is terminal;
15
22. Problem formulation
A movie review is represented as a sequence of words (states):
w1, w2, . . . , wt, . . . , wT where T is the length of the review.
State
A state is defined as the word type of a vocabulary in a corpus.
Reward
We regard a classification label of a text as rewards: +1 for the positive
label, −1 for the negative label.
18
23. Problem formulation
Figure 3: An example of a 6-state Markov Reward Process. The numbers on
the arrows indicate the rewards and which of the two values (+1 and −1) for
the terminal state is determined by the label of the review.
Figure 4: In this MRP, every reward is returned by the classification label.
19
24. Datasets
Movie Review Dataset We use the polarity dataset v2.0 for the
indirect evaluation (Pang and Lee, 2004), which consists of 1,000
positive and 1,000 negative movie reviews.
Stanford Sentiment Treebank The data is based on 11,855 single
sentences extracted from movie reviews and contains sentiment polarity
values for all phrases which are annotated by 3 human judges (Socher et
al., 2013).
Large Movie Review Dataset This corpus for binary sentiment
classification (Mass et al., 2011) is used to train the LSTM sentiment
classifier of our method.
20
25. Configuration of experiments
• Experiment 1: Hyper-parameter exploration with feature selection
paradigm (with setting in Fig. 3)
• Experiment 2: Evaluation with feature selection paradigm (with
setting in Fig. 4)
• Experiment 3: Direct evaluation with Stanford Sentiment Treebank
(with Setting in Fig. 3)
21
26. Setting of Experiment 1
Naive Bayes classification is performed based on the top 10,000 selected
words.
10-fold cross validation is used for each condition and the accuracies are
averaged.
Conditions of TD methods: Hyper-parameter combinations of learning
rate (0.1∼0.5) and trace-decay rate (0.1∼1.0) with step size of 0.1
Incremental means of TD values over time steps are used for estimating
words.
Compared feature selection methods
• Document Frequency
• Averaged TF-IDF
• χ2
statistic (CHI)
• Information Gain
22
27. Results of Experiment 1 (hyper-parameters)
Figure 5: Performance of TD with replacing traces as function of λ
23
28. Results of Experiment 1 (accuracies)
Method NB Accuracy
TD(1) with accumulation 0.84
TD(1) with replacing 0.83
TD(1) with Dutch 0.83
TD(1) with significance 0.83
True Online TD(1) 0.78
Simple Averages 0.64
Document Frequency 0.67
TF-IDF 0.69
χ2
statistic (CHI) 0.66
Information Gain 0.83
Table 1: 10-fold cross validation accuracies of the TD methods and the feature
selection methods
24
29. Setting of Experiment 2
An LSTM sentiment classifier is trained on Large Movie Review Dataset.
A movie review is split into sentences.
The LSTM sentiment classifier gives a label (+1 or −1) to each string,
which is being formed incrementally as below:
This
This is
This is a
This is a good
This is a good movie
This time, 10 accuracies from 10 samplings of training/test sets are
averaged using the last step TD values of the words.
25
30. Model-based reinforcement learning
• A model M is a representation of an MDP/MRP, parametrized by η.
• A model M = (Pη, Rη) represents state transitions Pη ≈ P and
rewards Rη ≈ R
St+1 ∼ Pη(St+1|St)
Rt+1 = Rη(Rt+1|St)
• Model-based RL plans value function from model.
26
31. Optimal value of trace-decay rate in Setting 2
Figure 6: Performance of TD with replacing traces as function of λ
27
32. Results of Experiment 2
Method NB Accuracy
TD(λ) with accumulation traces 0.82
TD(λ) with Dutch traces 0.82
TD(λ) with replacing traces 0.82
TD(λ) with significance traces 0.82
Information Gain 0.75
Table 2: Classification Accuracy of TD methods with Setting of Fig. 4
Note that ‘Information Gain’ method is based on the sentence-based
dataset.
28
33. Setting of Experiment 3
For the direct evaluation, correlation coefficients between the estimated
values and the labeled polarities are calculated.
The labels are real values ranges from 0 to 1 (Stanford Sentiment
Treebank).
TD methods follow the previous setting (with the setting of Fig 3).
Full dataset: 21,684 words in the dataset
Reduced dataset: 4,532 words which are adjective, adjective (superlative)
or adverb, adverb (comparative).
For comparison, estimation method with Bayes provabilities (Potts, 2010)
is used.
29
34. Results of Experiment 3 (full set)
Method Pearson (full) Spearman (full)
Bayes Prob. (Potts, 2010) 0.21 0.2
TD(1) with replacing 0.24 0.21
TD(1) with Dutch 0.24 0.21
TD(1) with significance 0.24 0.21
TD(1) with accumulation 0.24 0.21
Table 3: Correlations between the estimation results and the human-labeled
polarity values of the 24,684 words.
30
35. Results of Experiment 3 (reduced set)
Method Pearson (reduced) Spearman (reduced)
Bayes Prob. (Potts, 2010) 0.32 0.3
TD(1) with replacing 0.38 0.35
TD(1) with Dutch 0.38 0.35
TD(1) with significance 0.38 0.35
TD(1) with accumulation 0.38 0.34
Table 4: Correlations between the estimation results and the human-labeled
polarity values of the 4,532 words.
31
36. Plots of the labeled and estimated values
Figure 7: Data plots of the annotated, Bayesian and the TD(1) with
significance traces values of the total dataset (from the lowest to the highest)
32
37. Summary of the experiment results
• TD methods achieve the same level of performance as other feature
selection methods.
• TD-based estimations are more differential, providing more realistic
values.
• TD-based methods provide an easy tool for on-line estimation of
words.
33
38. Summary of the differences between the two settings
• In setting 1, TD methods with λ = 1 show best performances.
• In setting 2, TD methods with λ = 0.7 show best performances.
• TD methods with setting 2 show better performances.
• Thus, for TD methods with setting 2, best performance is obtained
with an intermediate value of λ.
• Note that TD method with setting 2 is a model-based approach.
34
40. Data
• Data source: Ajou University hospital
• Nursing statements from 8,316 patients
• 4,158 ADR labeled patients
• 4,158 non-ADR labeled patients
• Average number of sentences per patient: 421
• Largest number of sentences of a patient: 10,625
• Total number of sentence types: 837,293
35
42. Other methods: NB
• 9,647 sentence types are used whose frequency is greater than 20.
• Train, Dev, Test sets at ratio of 8:1:1
• Information Gain is used to select N-best features.
• Grid-search is performed to find the best N (3700)
37
43. Other methods: SVM
• 9,647 sentence types are used whose frequency is greater than 20.
• Train, Dev, Test sets at ratio of 8:1:1
• Linear and RBF models are both used.
• Grid-search is performed to find the best parameters (Table 5).
Min DF Max DF Proportion Gamma C
5 0.5 0.1 4
Table 5: Parameter values used in SVM
38
44. Other methods: CNN
• All 837,293 sentence types are used.
• Train, Dev, Test sets at ratio of 8:1:1
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 288 statements for ADR date are used.
• In non-ADR cases, latest 288 statements for a random index are
used.
39
45. Other methods: LSTM
• All 837,293 sentence types are used.
• Train, Dev, Test sets at ratio of 8:1:1
• Pretrained paragraph vectors (Le & Mikolov, 2014) are used for
embedding condition.
• In ADR cases, latest 200 statements for ADR date are used.
• In non-ADR cases, latest 200 statements for a random index are
used.
40
46. Problem formulation
Each statement is represented as a state of the patient:
s1, s2, . . . , st, . . . , sT where T is the length of the statements.
State
A state is defined as a functional event of a statement.
Reward
We regard a classification label of the statemetns as rewards: +1 for
ADR label, −1 for the non-ADR case.
41
47. Memory-based function approximation
• Non-parametric function approximation
• It retrieves training examples when a query state is given.
• And it estimates the state value from the examples.
• It provides a way to avoid the curse of dimensionality.
42
48. Building training examples
A dataset of 400 patients is sampled and a nursing expert marked a
statement if it seems highly related to ADR.
Sentence Label
오심 없음
구토 없음
침상안정 중임
2(구두처방)Tramadol T
주관적 진술:”머리랑 배가 아파요.” T
복부 통증 호소함 T
통증 양상 관찰함 T
통증완화 위한 다양한 방법 격려함 T
의사에게 알림 T
심호흡 교육함
구강간호 시행함
오심 감소함
43
49. Building training examples
The labeled statements are manullay classified into 7 events as follows:
State Meaning Example
0 Unknown event CT찍고 옴.
1 Drug related event epocelin 1g 투여 함
2 Abnormal reactions 피부 가려움 호소함부위:both arm
3 Doctor related event 의사에게 알림의사:xxx
4 Subjective response 주관적 진술:몸이 가벼워요
5 1 & 2 event Tramadol 맞은 후 구토 2회 함
6 4 & 2 event 주관적 진술:속이 계속 울렁거리네요.
Table 6: Table of defined states and example sentences
44
50. Building NB classifier
A Naive Bayse classifier is trained based on binary feature functions and
unigrams in sentence.
Feature functions:
• is high temp: positive if body temperature > 37.4
• is subjective: positive if subjective statement marker is present
• is drug related: positive if precription statement or drug names
appeared
• is negation: positive if negation expression is present
• is reactive: positive if ADR-realated symptoms are present
Accuracy of the classifier: 0.72
45
51. Memory-based reward function
The ADR environment provides sparse rewards, so reward function is
modeled as below:
def get_reward(n_state, n_idx):
if n_state == ‘adr’:
return 1
elif n_state == ‘nor’:
return -1
elif n_idx in (2, 3, 4, 5, 6):
return 1
else:
return 0
46
52. Training
• 8,316 episodes are split into Train, Validation, Test sets at ratio of
8:1:1.
• Replacing traces is used for eligibility traces.
Parameter Value
Learning rate (α) 0.1
Trace-decay rate (λ) 0.3
Discount factor (γ) 0.1
Table 7: Table of values of the parameters
47
53. Estimated state values from train set
State Meaning Value
0 Unknown event NA
1 Drug related event 0.17
2 Abnormal reactions 0.47
3 Doctor related event 0.27
4 Subjective response 0.51
5 1 & 2 event 0.18
6 4 & 2 event 0.80
Table 8: Estimated values of states using TD(λ)
48
55. Classification based on state values
• The state values corressponding to the states in Validation set are
averaged (the value of unknown state is regarded as 0).
• Simple logistic regression is performed to the Validation set.
• The regression classifier is tested on the Test set.
Method Accuracy
NB 0.64
SVM (linear) 0.63
SVM (RBF) 0.63
CNN 0.58
CNN (with embedding) 0.58
LSTM 0.61
LSTM (with embedding) 0.57
TD-based logistic regression 0.61
Table 9: Results of ADR classifications
50
56. Incremental analysis using TD learning
• The proposed method is on-line, so it allows us to monitor the sign
of ADR in real time.
• This model captures our intuitive response when reading ADR
events.
51
58. General Conclusion
• RL methods provide incremental learning, which is useful for some
practical cases.
• Model-based RL methods are nice options for cognitive processing of
texts.
• In NLP tasks, RL methods are difficult to apply because of sparsity
of rewards.
• Model-based RL methods are useful to avoid the reward sparseness
problem.
52
59. Future study
• Model-learning in model-based RL needs to be improved during
planning
• Thus, supervised models can be improved, while combined with RL
• Experience replay mechanism can be used for such improvement
process
• With the approach, RL methods might be used in many supervised
learning tasks.
53
60. References
Bo Pang and Lillian Lee. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. ACL
Bond, C., & Raehl, C. L. (2006). Adverse drug reactions in united states hospitals.
Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy, 26(5),
601–608.
Chomsky, N. (1959). A review of BF Skinner’s verbal behavior. Language, 35(1),
26–58.
Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv
Preprint ArXiv:1408.5882.
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and
documents. In International Conference on Machine Learning (pp. 1188-1196).
Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short
latencies. Nature, 244(5417), 522.
Marslen-Wilson, W. (1975). Sentence perception as an interactive parallel process.
Science,189(4198), 226–228.
A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts. (2011). Learning word
vectors for sentiment analysis. ACL
54
61. Potts, C. (2010). On the negativity of negation. Semantics and Linguistic Theory,
20(0), 636–659.
Seijen, H., & Sutton, R. S. (2014, January). True Online TD(lambda). In PMLR (pp.
692–700).
Skinner, B. (1957). Verbal Behavior. New York: Appleton-Century-Crofts.
R Socher, A Perelygin, J.Y. Wu, J Chuang, C.D. Manning, A.Y. Ng, and C Potts.
(2013). Recursive deep models for semantic compositionality over a sentiment
treebank. 1631:1631–1642.
Sutton, Richard S. (1988). Learning to predict by the methods of temporal differences.
Machine Learning, 3(1), 9–44.
Sutton, Richard Stuart. (1984). Temporal credit assignment in reinforcement learning
(Ph.D. Dissertation). University of Massachusetts, Amherst, MA.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT
press Cambridge.
55