발표일: 2018.1.
발표자: 최윤재 (Edward Choi, Georgia Tech 박사과정)
Since 2012, deep learning, or representation learning has shown impressive progress in computer vision, speech recognition, and natural language processing. The power of deep learning comes from combining expressive models with large labeled data. This allowed the machine to extract useful information from high-dimensional data, which was a human responsibility before the rise of deep learning.
Massive data have been collected in healthcare since the introduction of electronic healthcare records (EHR), and the amount of data is more than human medical experts can process. It is expected that, in this regard, deep learning can play a significant role in healthcare as it did in vision and language. However, computational healthcare requires predictive models to be both accurate and interpretable.
My talk will introduce how to use recurrent neural networks (RNN), one of the building blocks in deep learning, to process longitudinal EHR data and predict a future event. Specifically, I will focus on predicting a heart failure onset given a patients’ 18 months record. Building on top of this, I will address the interpretability issue of deep learning models, and propose a method to make predictions that is both accurate and interpretable.
24. Limitation of RNN
• Transparency
• RNN is a blackbox
• Feed input, receive output
• Hard to tell what caused the outcome
• Outcome 0.9
• Was it because of “Justice”?
• Was it because of “impressive”?
• Was it because of “Christmas”?
24
32. Attention models
• Attention, what is it good for?
• c is an explicit combination of all past information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word
• We can tell which word was used the most/least to the outcome
c
𝛼#
𝛼$ 𝛼%
𝛼#&
32
33. Attention models
• Attention, what is it good for?
• Now c is an explicit combination of all past information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word
• We can tell which word was used the most/least to the outcome
• Attentions 𝛼. are generated using an MLP
c
𝛼#
𝛼$ 𝛼%
𝛼#&
33
37. Structure of EHR
• Assumption so far
• Word sequence = Dx sequence
• Justice, League, is, as, impressive, as, …
• Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
37
38. Structure of EHR
• Assumption so far
• Word sequence = Dx sequence
• Justice, League, is, as, impressive, as, …
• Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
38
39. Structure of EHR
• Assumption so far
• Word sequence = Dx sequence
• Justice, League, is, as, impressive, as, …
• Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ...
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
39
55. RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55
an RNN. To find the j-th word in the target language, we generate attentions ↵i
word in the original sentence. Then, we compute the context vector cj =
P
i ↵j
i hi
j-th word in the target language. In general, the attention mechanism allows the mo
word (or words) in the given sentence when generating each word in the target lan
In this work, we define a temporal attention mechanism to provide interpreta
healthcare. Doctors generally pay attention to specific clinical information (e.g., k
timing when reviewing EHR data. We exploit this insight to develop a temporal atte
doctors’ practice, which will be introduced next.
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is to delegate a
the prediction responsibility to the attention weights generation process. RNNs bec
due to the recurrent weights feeding past information to the hidden layer. Theref
visit-level and the variable-level (individual coordinates of xi) influence, we use a
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m the size of t
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a more sophisticat
56. RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m th100
dimension, Wemb 2 Rm⇥r
the embedding matrix to learn. We can easily cho101
but still interpretable representation such as multilayer perceptron (MLP)102
used for representation learning in EHR data [10].103
We use two sets of weights for the visit-level attention and the variable-lev104
The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105
embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106
each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1,
where gi 2 Rp
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the109
at time step i and w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are110
The hyperparameters p and q determine the hidden layer size of RNN↵ a111
3
56
57. RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated
table representation such as multilayer perceptron (MLP) [13, 29] which has been
ntation learning in EHR data [10].
of weights for the visit-level attention and the variable-level attention, respectively.
. . . , ↵i are the visit-level attention weights that govern the influence of each visit
. . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on
of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows,
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1, . . . , i, (Step 3)
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the hidden layer of RNN
nd w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are the parameters to learn.
meters p and q determine the hidden layer size of RNN↵ and RNN , respectively.
3
57
58. RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
records, they typically study the patient’s most recent records first, and go back in time.
ationally, running the RNN in reversed time order has several advantages as well: The reverse
der allows us to generate e’s and ’s that dynamically change their values when making
ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different
timestamp and makes the attention generation process computationally more stable.1
erate the context vector ci for a patient up to the i-th visit as follows,
ci =
iX
j=1
↵j j vj, (Step 4)
denotes element-wise multiplication. We use the context vector ci 2 Rm
to predict the true
2 {0, 1}s
as follows,
byi = Softmax(Wci + b), (Step 5)
W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-entropy to calculate the
ation loss as follows,
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log(1 byi)
⌘
(1)
we sum the cross entropy errors from all dimensions of byi. In case of real-valued output
, we can change the cross-entropy in Eq. (1) to for example mean squared error.
our attention mechanism can be viewed as the inverted architecture of the standard attention
ism for NLP [2] where the words are encoded using RNN and generate the attention weights
58
59. RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci 2123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-en125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log
where we sum the cross entropy errors from all dimensions of byi. In case127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean squ128
Overall, our attention mechanism can be viewed as the inverted architecture of129
mechanism for NLP [2] where the words are encoded using RNN and generate130
using MLP. Our method, on the other hand, uses MLP to embed the visit in131
interpretation and uses RNN to generate two sets of attention weights, reco132
information as well as mimicking the behavior of physicians.133
59
60. RETAIN: Calculating the Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
60
61. RETAIN: Calculating the Contributions
the past records, they typically study the patient’s most recent records fi117
Computationally, running the RNN in reversed time order has several advan118
time order allows us to generate e’s and ’s that dynamically change th119
predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120
at each timestamp and makes the attention generation process computation121
We generate the context vector ci for a patient up to the i-th visit as follow122
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
where we sum the cross entropy errors from all dimensions of byi. In ca127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean128
Overall, our attention mechanism can be viewed as the inverted architecture129
mechanism for NLP [2] where the words are encoded using RNN and gene130
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
61
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j i,
62. RETAIN: Calculating the Contributions
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j i,
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
62
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is
the prediction responsibility to the attention weights generation proces
due to the recurrent weights feeding past information to the hidden l
visit-level and the variable-level (individual coordinates of xi) influen
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a mor
representation such as multilayer perceptron (MLP) [13, 28] which has
in EHR data [10].
We use two sets of weights for the visit-level attention and the vari
scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th
v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig
the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
k-th column of E
63. RETAIN: Calculating the Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Inside the iteration over k
63
e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in
e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(y , x ) = ↵ W( e ) x , (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
Scalars in the front
64. RETAIN: Calculating the Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
64
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
65. RETAIN: Calculating the Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Contribution of the k-th code in the j-th visit
65
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
e visit embedding vi is the sum of the columns of E weighted by each element of xi,
en as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
68. Heart failure prediction
• Performance measure
• Area under the ROC curve (AUC)
• Competing models
• Logistic regression
• Aggregate all past codes into a fixed-size vector. Feed it to LR
• MLP
• Aggregate all past codes into a fixed-size vector. Feed it to MLP
• Two-layer RNN
• Visits are fed to the RNN, whose hidden layers are fed to another RNN.
• RNN+attention (Bahdanau et al. 2014)
• Visits are fed to RNN. Visit-level attentions are generated by MLP
• RETAIN
68
69. Heart failure prediction
Models AUC Training time / epoch Test time for 5K patients
Logistic Regression 0.7900 ± 0.0111 0.15s 0.11s
MLP 0.8256 ± 0.0096 0.25s 0.11s
Two-layer RNN 0.8706 ± 0.0080 10.3s 0.57s
RNN+attention 0.8624 ± 0.0079 6.7s 0.48s
RETAIN 0.8705 ± 0.0081 10.8s 0.63s
• RETAIN as accurate as RNN
• Requires similar training time & test time
• RETAIN is interpretable!
• RNN is a blackbox
69
71. Conclusion
• RETAIN: interpretable prediction framework
• As accurate as RNN
• Interpretable prediction
• Predictions can be explained
• Can be extended to general prognosis
• What are the likely disease he/she will have in the future?
• Can be used for any sequences with the two-layer structure
• E.g. online shopping
71