SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
Interpretable	Deep	Learning	
for	Healthcare
Edward	Choi	(mp2893@gatech.edu)
Jimeng Sun	(jsun@cc.gatech.edu)
SunLab (sunlab.org)
Index
• Healthcare	&	Machine	Learning
• Sequence	Prediction	with	RNN
• Attention	mechanism &	interpretable	prediction
• Proposed	model:	RETAIN
• Experiments	&	results
• Conclusion
2
Healthcare	
&	
Machine	Learning
SunLab &	Healthcare
• SunLab &	Collaborators
ProviderGovernment University Company
4
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
5
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
6
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
7
Observation
Window
Diagnosis
Date
Prediction
Window
Index Date
Time
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
8
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
9
Recurrent	Neural	Network	(RNN)
Sequence	Prediction	
with	RNN
Sequence	Prediction	- NLP
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Is	the	given	sentence	positive	or	negative	?
• “Justice”	“League”	“is”	“as”	“impressive”	“as”	“a”	“preschool”	“Christmas”	“play”
• Each	word	is	a	symbol
• Outcome:	0,	1	(binary)
• The	sentence	is	either	positive	or	negative.
11
Sequence	Prediction	- EHR
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Given	a	diagnosis	history,	will	the	patient	have	heart	failure?
• Hypertension,	Hypertension,	Diabetes,	CKD,	CKD,	Diabetes,	MI
• Each	diagnosis	is	a	symbol
• Outcome:	0, 1	(binary)
• Either	you	have	HF,	or	you	don’t
12
What	is	sequence	prediction?
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Where	is	the	boundary	between	exons	and	introns	in	the	DNA	
sequence?
• What	is	the	French	translation	of	the	given	English	sentence?
• Given	a	diagnosis	history,	what	will	he/she	have	in	the	next	visit?
13
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
• “justice	leagues	was	as	impressive	as	a	preschool	christmas play”
0,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	… x	(a	vector	with	1M	elements.	One	for	each	word)
boy
cat
justice
preschool
fun
14
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
15
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
Output	y
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
y =	𝝈(wo
Th) (generate	an	outcome	0.0~1.0)
16
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
Output	y
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
y =	𝝈(wo
Th) (generate	an	outcome	0.0~1.0)
17
Sequence	prediction	with	RNN
• Now	let’s	use	Recurrent	Neural	Network	(RNN)
• Same	sentiment	classification	(positive	or	negative?)
Hidden	Layer	h1
h1=	𝝈(Wi
Tx1)	
x1 (a	vector	with	1M	elements.	Only	“justice”	is	1.)
18
0,	0,	0,	1,	0,	0,	0,	0,	0,	0,	0,	0,	…
justice
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h2
League
h1
Justice
x1 x2
h2=	𝝈(Wh
Th1 + Wi
Tx2)	
19
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h9
h10
h2
League Christmas play
h1
Justice
x1 x2 x9 x10
h10=	𝝈(Wh
Th9 + Wi
Tx10)	
20
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h10
Output
y =	𝝈(wo
Th10)
Outcome	0.0	~	1.0
21
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h10
Output
y =	𝝈(wo
Th10)
Outcome	0.0	~	1.0
22
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
23
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
• Outcome	0.9
• Was	it	because	of	“Justice”?
• Was	it	because	of	“impressive”?
• Was	it	because	of	“Christmas”?
24
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
h9
h10
h2
League Christmas play
h1
Justice
All	inputs	accumulated	here
25
Attention	mechanism
&
Interpretable	Prediction
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
27
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
h9
h10
h2
League Christmas play
h1
Justice 28
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
h9
h10
h2
League Christmas play
h1
Justice
c
𝛼#
𝛼$ 𝛼%
𝛼#& 𝛼# + 𝛼$ + ⋯ + 𝛼#& = 1
𝒄 = 𝛼# 𝒉# + 𝛼$ 𝒉$ + ⋯ + 𝛼#& 𝒉#&
29
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
Outputc
𝛼#
𝛼$ 𝛼%
𝛼#&
y =	𝝈(wo
Tc)
h9
h10
h2
League Christmas play
h1
Justice 30
Attention	models
• Attention,	what	is	it	good	for?
31
Attention	models
• Attention,	what	is	it	good	for?
• c is	an	explicit	combination	of	all	past	information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote	the	usefulness	from	each	word
• We	can	tell	which	word	was	used	the	most/least	to	the	outcome
c
𝛼#
𝛼$ 𝛼%
𝛼#&
32
Attention	models
• Attention,	what	is	it	good	for?
• Now	c is	an	explicit	combination	of	all	past	information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote	the	usefulness	from	each	word
• We	can	tell	which	word	was	used	the	most/least	to	the	outcome
• Attentions	𝛼. are	generated	using	an	MLP
c
𝛼#
𝛼$ 𝛼%
𝛼#&
33
Attention	Example
• English-French	translation
• Bahdanau,	Cho,	Bengio 2014
(a)
(c)
Figure3:FoursamplealignmentsfoundbyRN
correspondtothewordsinthesourcesentence(
respectively.Eachpixelshowstheweight↵ijoft
targetword(seeEq.(6)),ingrayscale(0:black,
randomlyselectedsamplesamongthesentencesw
10and20wordsfromthetestset.
Oneofthemotivationsbehindtheproposedappr
inthebasicencoder–decoderapproach.Wecon
encoder–decoderapproachtounderperformwith
manceofRNNencdecdramaticallydropsasthelen
bothRNNsearch-30andRNNsearch-50aremore
50,especially,showsnoperformancedeterioratio
superiorityoftheproposedmodeloverthebasic
34
RETAIN:	Interpretable	Sequence	
Prediction	for	Healthcare	
(NIPS	2016)
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
36
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
37
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
38
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
39
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (i.e.	visit)
1,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	… x1 (First	visit	vector	with	40K	elements.	One	for	each	medical	code)
cough fever
tylenol
pneumonia
40
Cough
Visit 1
Fever
Tylenol
IV fluid
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Input	Layer	x1
Embedding	Layer	v1
x1 (a	multi-hot	vector	with	40K	elements.	One	for	each	code)
v1 =	tanh(Wv
Tx1) (Transform	x to	a	compact	representation)
41
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Input	Layer	x1
Embedding	Layer	v1
x1 (a	multi-hot	vector	with	40K	elements.	One	for	each	code)
v1 =	tanh(Wv
Tx1) (Transform	x to	a	compact	representation)
Hidden	Layer	h1
h1=	𝝈(Wi
Tv1)	
42
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
h2=	𝝈(Wh
Th1 + Wi
Tv2)	
43
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
xT
vT
Hidden	Layer	hT
hT=	𝝈(Wh
ThT-1 + Wi
TvT-1)	
44
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Hidden	Layer	hT
Output
y =	𝝈(wo
ThT)
Outcome	0.0	~	1.0
45
RETAIN:	Motivation
• Which	visit	contributes	more	to	the	final	prediction?
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
xT
vT
Hidden	Layer	hT
46
RETAIN:	Motivation
• Within	a	single	visit,	which	code	contributes	more	to	the	prediction?
v1
Hidden	Layer	h1
v2
Hidden	Layer	h2
vT
Hidden	Layer	hT
1,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	…
cough fever tylenol pneumonia 47
RETAIN:	Design	Choices
48
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
RETAIN:	Design	Choices
49
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
MLP	embeds	the	visitsRNN	embeds	the	visits
RETAIN:	Design	Choices
50
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
RNN	generates	
attention	for	
the	visits
MLP	generates	
attentions	for	
the	visits
RETAIN:	Design	Choices
51
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Another	RNN	generates
attentions	for	the	codes	
within	each	visit
RETAIN:	Design	Choices
52
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Visits	are	combined	for	prediction Visits	are	combined	for	prediction
RETAIN:	Design	Choices
53
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Less	interpretable	end-to-end Interpretable	end-to-end
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict54
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55
an RNN. To find the j-th word in the target language, we generate attentions ↵i
word in the original sentence. Then, we compute the context vector cj =
P
i ↵j
i hi
j-th word in the target language. In general, the attention mechanism allows the mo
word (or words) in the given sentence when generating each word in the target lan
In this work, we define a temporal attention mechanism to provide interpreta
healthcare. Doctors generally pay attention to specific clinical information (e.g., k
timing when reviewing EHR data. We exploit this insight to develop a temporal atte
doctors’ practice, which will be introduced next.
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is to delegate a
the prediction responsibility to the attention weights generation process. RNNs bec
due to the recurrent weights feeding past information to the hidden layer. Theref
visit-level and the variable-level (individual coordinates of xi) influence, we use a
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m the size of t
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a more sophisticat
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m th100
dimension, Wemb 2 Rm⇥r
the embedding matrix to learn. We can easily cho101
but still interpretable representation such as multilayer perceptron (MLP)102
used for representation learning in EHR data [10].103
We use two sets of weights for the visit-level attention and the variable-lev104
The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105
embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106
each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1,
where gi 2 Rp
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the109
at time step i and w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are110
The hyperparameters p and q determine the hidden layer size of RNN↵ a111
3
56
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated
table representation such as multilayer perceptron (MLP) [13, 29] which has been
ntation learning in EHR data [10].
of weights for the visit-level attention and the variable-level attention, respectively.
. . . , ↵i are the visit-level attention weights that govern the influence of each visit
. . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on
of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows,
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1, . . . , i, (Step 3)
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the hidden layer of RNN
nd w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are the parameters to learn.
meters p and q determine the hidden layer size of RNN↵ and RNN , respectively.
3
57
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
records, they typically study the patient’s most recent records first, and go back in time.
ationally, running the RNN in reversed time order has several advantages as well: The reverse
der allows us to generate e’s and ’s that dynamically change their values when making
ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different
timestamp and makes the attention generation process computationally more stable.1
erate the context vector ci for a patient up to the i-th visit as follows,
ci =
iX
j=1
↵j j vj, (Step 4)
denotes element-wise multiplication. We use the context vector ci 2 Rm
to predict the true
2 {0, 1}s
as follows,
byi = Softmax(Wci + b), (Step 5)
W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-entropy to calculate the
ation loss as follows,
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log(1 byi)
⌘
(1)
we sum the cross entropy errors from all dimensions of byi. In case of real-valued output
, we can change the cross-entropy in Eq. (1) to for example mean squared error.
our attention mechanism can be viewed as the inverted architecture of the standard attention
ism for NLP [2] where the words are encoded using RNN and generate the attention weights
58
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci 2123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-en125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log
where we sum the cross entropy errors from all dimensions of byi. In case127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean squ128
Overall, our attention mechanism can be viewed as the inverted architecture of129
mechanism for NLP [2] where the words are encoded using RNN and generate130
using MLP. Our method, on the other hand, uses MLP to embed the visit in131
interpretation and uses RNN to generate two sets of attention weights, reco132
information as well as mimicking the behavior of physicians.133
59
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
60
RETAIN:	Calculating	the	Contributions
the past records, they typically study the patient’s most recent records fi117
Computationally, running the RNN in reversed time order has several advan118
time order allows us to generate e’s and ’s that dynamically change th119
predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120
at each timestamp and makes the attention generation process computation121
We generate the context vector ci for a patient up to the i-th visit as follow122
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
where we sum the cross entropy errors from all dimensions of byi. In ca127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean128
Overall, our attention mechanism can be viewed as the inverted architecture129
mechanism for NLP [2] where the words are encoded using RNN and gene130
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
61
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
RETAIN:	Calculating	the	Contributions
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
62
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is
the prediction responsibility to the attention weights generation proces
due to the recurrent weights feeding past information to the hidden l
visit-level and the variable-level (individual coordinates of xi) influen
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a mor
representation such as multilayer perceptron (MLP) [13, 28] which has
in EHR data [10].
We use two sets of weights for the visit-level attention and the vari
scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th
v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig
the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
k-th column	of	E
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Inside	the	iteration	over	k
63
e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in
e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(y , x ) = ↵ W( e ) x , (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
Scalars	in	the	front
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
64
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Contribution	of	the	k-th code	in	the	j-th visit
65
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
e visit embedding vi is the sum of the columns of E weighted by each element of xi,
en as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
Experiments	&	Results
Heart	Failure	(HF)	Prediction
• Objective
• Given	a	patient	record,	predict	whether	he/she	will	be	diagnosed	with	HF	in	the	
future
• 34K	patients	from	Sutter	PAMF
• 4K	cases,	30K	controls
• Use	18-months	history	before	being	diagnosed	with	HF
• Number	of	medical	codes
• 283	diagnosis	codes
• 96	medication	codes
• 238	procedure	codes
67
617	medical	codes
Heart	failure	prediction
• Performance	measure
• Area	under	the	ROC	curve	(AUC)
• Competing	models
• Logistic	regression
• Aggregate	all	past	codes	into	a	fixed-size	vector.	Feed	it	to	LR
• MLP
• Aggregate	all	past	codes	into	a	fixed-size	vector.	Feed	it	to	MLP
• Two-layer	RNN
• Visits	are	fed	to	the	RNN,	whose	hidden	layers	are	fed	to	another	RNN.
• RNN+attention (Bahdanau et	al.	2014)
• Visits	are	fed	to	RNN.	Visit-level	attentions	are	generated	by	MLP
• RETAIN
68
Heart	failure	prediction
Models AUC Training time	/	epoch Test	time	for	5K	patients
Logistic	Regression 0.7900	± 0.0111	 0.15s 0.11s
MLP 0.8256	± 0.0096 0.25s 0.11s
Two-layer	RNN 0.8706	± 0.0080	 10.3s 0.57s
RNN+attention 0.8624	± 0.0079 6.7s 0.48s
RETAIN 0.8705	± 0.0081 10.8s 0.63s
• RETAIN	as	accurate	as	RNN
• Requires	similar	training	time	&	test	time
• RETAIN	is	interpretable!
• RNN	is	a	blackbox
69
RETAIN	visualization
• Demo
70
Conclusion
• RETAIN:	interpretable	prediction	framework
• As	accurate	as	RNN
• Interpretable	prediction
• Predictions	can	be	explained
• Can	be	extended	to	general	prognosis
• What	are	the	likely	disease	he/she	will	have	in	the	future?
• Can	be	used	for	any	sequences	with	the	two-layer	structure
• E.g.	online	shopping
71
Interpretable	Deep	Learning	
for	Healthcare
Edward	Choi	(mp2893@gatech.edu)
Jimeng Sun	(jsun@cc.gatech.edu)
SunLab (sunlab.org)
How	to	generate	the	attentions	𝛼.?
• Use	another	neural	network	model
Input	Layer	x
Hidden	Layer	h
Output	y
x
h =	𝝈(Wh
Tx)
y =	wo
Th (outcome	−∞~ + ∞)
Let’s	call	this	function	y=a(x)
73
How	to	generate	the	attentions	𝛼.?
• Use	function	a(x)	for	each	word:	Justice,	League,	…,	Christmas,	play
• Feed	the	scores	y1,	y2,	…,	y10 into	the	Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp	( 𝑦.)
∑ exp	( 𝑦:)#&
:;#
Christmas
a(x9)
y9
74
How	to	generate	the	attentions	𝛼.?
• Use	function	a(x)	for	each	word:	Justice,	League,	…,	Christmas,	play
• Feed	the	scores	y1,	y2,	…,	y10 into	the	Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp	( 𝑦.)
∑ exp	( 𝑦:)#&
:;#
Christmas
a(x9)
y9
Softmax function	ensures	𝛼.’s	sum	to	1	
Return75

Weitere ähnliche Inhalte

Was ist angesagt?

Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...polochau
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
 
Explainable AI in Healthcare
Explainable AI in HealthcareExplainable AI in Healthcare
Explainable AI in Healthcarevonaurum
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictionsAnton Kulesh
 
Alzheimer's disease classification using Deep learning Neural a Network and G...
Alzheimer's disease classification using Deep learning Neural a Network and G...Alzheimer's disease classification using Deep learning Neural a Network and G...
Alzheimer's disease classification using Deep learning Neural a Network and G...Yubraj Gupta
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Saurabh Kaushik
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Krishnaram Kenthapadi
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableAditya Bhattacharya
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AIBill Liu
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 

Was ist angesagt? (20)

Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Explainable AI in Healthcare
Explainable AI in HealthcareExplainable AI in Healthcare
Explainable AI in Healthcare
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictions
 
NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21 NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21
 
Alzheimer's disease classification using Deep learning Neural a Network and G...
Alzheimer's disease classification using Deep learning Neural a Network and G...Alzheimer's disease classification using Deep learning Neural a Network and G...
Alzheimer's disease classification using Deep learning Neural a Network and G...
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
Deep learning and Healthcare
Deep learning and HealthcareDeep learning and Healthcare
Deep learning and Healthcare
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Ai black box
Ai black boxAi black box
Ai black box
 

Ähnlich wie Interpretable deep learning for healthcare

A_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptxA_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptxAnweshReddy22
 
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User StudiesYONG ZHENG
 
How to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis TestsHow to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis TestsGoLeanSixSigma.com
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 
Thinking statistically v3
Thinking statistically v3Thinking statistically v3
Thinking statistically v3Stephen Senn
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Matthias Schuurmans
 
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeReagan Pannell
 
Estimation & estimate Prof. rasheda samad,
Estimation & estimate Prof.  rasheda samad, Estimation & estimate Prof.  rasheda samad,
Estimation & estimate Prof. rasheda samad, rashedadr
 
SEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShareSEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShareRichard Nadvornik
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSGolden Helix
 
A Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in TweetsA Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in TweetsCrowdTruth
 
Lean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in GwentLean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in GwentLean Enterprise Academy
 
Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017scanFOAM
 

Ähnlich wie Interpretable deep learning for healthcare (20)

A_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptxA_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptx
 
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
How to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis TestsHow to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis Tests
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Thinking statistically v3
Thinking statistically v3Thinking statistically v3
Thinking statistically v3
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22
 
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
 
Estimation & estimate Prof. rasheda samad,
Estimation & estimate Prof.  rasheda samad, Estimation & estimate Prof.  rasheda samad,
Estimation & estimate Prof. rasheda samad,
 
SEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShareSEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShare
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
Dip
DipDip
Dip
 
Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2
 
A Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in TweetsA Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in Tweets
 
Lean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in GwentLean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in Gwent
 
Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017
 
A3 thinking nhsiq 2014
A3 thinking  nhsiq 2014A3 thinking  nhsiq 2014
A3 thinking nhsiq 2014
 
Daming
DamingDaming
Daming
 
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
 

Mehr von NAVER Engineering

디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIXNAVER Engineering
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)NAVER Engineering
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트NAVER Engineering
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호NAVER Engineering
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라NAVER Engineering
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기NAVER Engineering
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정NAVER Engineering
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기NAVER Engineering
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)NAVER Engineering
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드NAVER Engineering
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기NAVER Engineering
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활NAVER Engineering
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출NAVER Engineering
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우NAVER Engineering
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...NAVER Engineering
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법NAVER Engineering
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며NAVER Engineering
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기NAVER Engineering
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기NAVER Engineering
 

Mehr von NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Interpretable deep learning for healthcare

  • 2. Index • Healthcare & Machine Learning • Sequence Prediction with RNN • Attention mechanism & interpretable prediction • Proposed model: RETAIN • Experiments & results • Conclusion 2
  • 4. SunLab & Healthcare • SunLab & Collaborators ProviderGovernment University Company 4
  • 5. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 5
  • 6. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 6
  • 7. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 7 Observation Window Diagnosis Date Prediction Window Index Date Time
  • 8. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 8 Cough Visit 1 Fever Fever Visit 2 Chill Fever Visit 3 Pneumonia Chest X-ray Tylenol IV fluid
  • 9. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 9 Recurrent Neural Network (RNN)
  • 11. Sequence Prediction - NLP • Given a sequence of symbols, predict a certain outcome. • Is the given sentence positive or negative ? • “Justice” “League” “is” “as” “impressive” “as” “a” “preschool” “Christmas” “play” • Each word is a symbol • Outcome: 0, 1 (binary) • The sentence is either positive or negative. 11
  • 12. Sequence Prediction - EHR • Given a sequence of symbols, predict a certain outcome. • Given a diagnosis history, will the patient have heart failure? • Hypertension, Hypertension, Diabetes, CKD, CKD, Diabetes, MI • Each diagnosis is a symbol • Outcome: 0, 1 (binary) • Either you have HF, or you don’t 12
  • 13. What is sequence prediction? • Given a sequence of symbols, predict a certain outcome. • Where is the boundary between exons and introns in the DNA sequence? • What is the French translation of the given English sentence? • Given a diagnosis history, what will he/she have in the next visit? 13
  • 14. Sequence prediction with MLP • Let’s start with a simple Multi-layer Perceptron (MLP) • Sentiment classification (positive or negative?) • “justice leagues was as impressive as a preschool christmas play” 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x (a vector with 1M elements. One for each word) boy cat justice preschool fun 14
  • 23. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome 23
  • 24. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome • Outcome 0.9 • Was it because of “Justice”? • Was it because of “impressive”? • Was it because of “Christmas”? 24
  • 25. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome h9 h10 h2 League Christmas play h1 Justice All inputs accumulated here 25
  • 27. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions 27
  • 28. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions h9 h10 h2 League Christmas play h1 Justice 28
  • 29. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions h9 h10 h2 League Christmas play h1 Justice c 𝛼# 𝛼$ 𝛼% 𝛼#& 𝛼# + 𝛼$ + ⋯ + 𝛼#& = 1 𝒄 = 𝛼# 𝒉# + 𝛼$ 𝒉$ + ⋯ + 𝛼#& 𝒉#& 29
  • 30. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions Outputc 𝛼# 𝛼$ 𝛼% 𝛼#& y = 𝝈(wo Tc) h9 h10 h2 League Christmas play h1 Justice 30
  • 32. Attention models • Attention, what is it good for? • c is an explicit combination of all past information • 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word • We can tell which word was used the most/least to the outcome c 𝛼# 𝛼$ 𝛼% 𝛼#& 32
  • 33. Attention models • Attention, what is it good for? • Now c is an explicit combination of all past information • 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word • We can tell which word was used the most/least to the outcome • Attentions 𝛼. are generated using an MLP c 𝛼# 𝛼$ 𝛼% 𝛼#& 33
  • 34. Attention Example • English-French translation • Bahdanau, Cho, Bengio 2014 (a) (c) Figure3:FoursamplealignmentsfoundbyRN correspondtothewordsinthesourcesentence( respectively.Eachpixelshowstheweight↵ijoft targetword(seeEq.(6)),ingrayscale(0:black, randomlyselectedsamplesamongthesentencesw 10and20wordsfromthetestset. Oneofthemotivationsbehindtheproposedappr inthebasicencoder–decoderapproach.Wecon encoder–decoderapproachtounderperformwith manceofRNNencdecdramaticallydropsasthelen bothRNNsearch-30andRNNsearch-50aremore 50,especially,showsnoperformancedeterioratio superiorityoftheproposedmodeloverthebasic 34
  • 36. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... 36
  • 37. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Benzonatate Fever Pneumonia Amoxicillin Chest X-ray Time 37
  • 38. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Benzonatate Fever Pneumonia Amoxicillin Chest X-ray Time 38
  • 39. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Visit 1 Fever Fever Visit 2 Chill Fever Visit 3 Pneumonia Chest X-ray Tylenol IV fluid 39
  • 40. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (i.e. visit) 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x1 (First visit vector with 40K elements. One for each medical code) cough fever tylenol pneumonia 40 Cough Visit 1 Fever Tylenol IV fluid
  • 41. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (aka visit) Input Layer x1 Embedding Layer v1 x1 (a multi-hot vector with 40K elements. One for each code) v1 = tanh(Wv Tx1) (Transform x to a compact representation) 41
  • 42. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (aka visit) Input Layer x1 Embedding Layer v1 x1 (a multi-hot vector with 40K elements. One for each code) v1 = tanh(Wv Tx1) (Transform x to a compact representation) Hidden Layer h1 h1= 𝝈(Wi Tv1) 42
  • 48. RETAIN: Design Choices 48 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN
  • 49. RETAIN: Design Choices 49 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN MLP embeds the visitsRNN embeds the visits
  • 50. RETAIN: Design Choices 50 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN RNN generates attention for the visits MLP generates attentions for the visits
  • 51. RETAIN: Design Choices 51 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Another RNN generates attentions for the codes within each visit
  • 52. RETAIN: Design Choices 52 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Visits are combined for prediction Visits are combined for prediction
  • 53. RETAIN: Design Choices 53 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Less interpretable end-to-end Interpretable end-to-end
  • 54. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict54
  • 55. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55 an RNN. To find the j-th word in the target language, we generate attentions ↵i word in the original sentence. Then, we compute the context vector cj = P i ↵j i hi j-th word in the target language. In general, the attention mechanism allows the mo word (or words) in the given sentence when generating each word in the target lan In this work, we define a temporal attention mechanism to provide interpreta healthcare. Doctors generally pay attention to specific clinical information (e.g., k timing when reviewing EHR data. We exploit this insight to develop a temporal atte doctors’ practice, which will be introduced next. 2.2 Reverse Time Attention Model RETAIN Figure 2 shows the high-level overview of our model. One key idea is to delegate a the prediction responsibility to the attention weights generation process. RNNs bec due to the recurrent weights feeding past information to the hidden layer. Theref visit-level and the variable-level (individual coordinates of xi) influence, we use a input vector xi. That is, we define vi = Exi, where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m the size of t E 2 Rm⇥r the embedding matrix to learn. We can easily choose a more sophisticat
  • 56. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m th100 dimension, Wemb 2 Rm⇥r the embedding matrix to learn. We can easily cho101 but still interpretable representation such as multilayer perceptron (MLP)102 used for representation learning in EHR data [10].103 We use two sets of weights for the visit-level attention and the variable-lev104 The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105 embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106 each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107 We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108 gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1), ej = w> ↵ gj + b↵, for j = 1, . . . , i ↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1) j = tanh W hj + b for j = 1, where gi 2 Rp is the hidden layer of RNN↵ at time step i, hi 2 Rq the109 at time step i and w↵ 2 Rp , b↵ 2 R, W 2 Rm⇥q and b 2 Rm are110 The hyperparameters p and q determine the hidden layer size of RNN↵ a111 3 56
  • 57. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated table representation such as multilayer perceptron (MLP) [13, 29] which has been ntation learning in EHR data [10]. of weights for the visit-level attention and the variable-level attention, respectively. . . . , ↵i are the visit-level attention weights that govern the influence of each visit . . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m. Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows, gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1), ej = w> ↵ gj + b↵, for j = 1, . . . , i ↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2) hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1) j = tanh W hj + b for j = 1, . . . , i, (Step 3) is the hidden layer of RNN↵ at time step i, hi 2 Rq the hidden layer of RNN nd w↵ 2 Rp , b↵ 2 R, W 2 Rm⇥q and b 2 Rm are the parameters to learn. meters p and q determine the hidden layer size of RNN↵ and RNN , respectively. 3 57
  • 58. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict records, they typically study the patient’s most recent records first, and go back in time. ationally, running the RNN in reversed time order has several advantages as well: The reverse der allows us to generate e’s and ’s that dynamically change their values when making ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different timestamp and makes the attention generation process computationally more stable.1 erate the context vector ci for a patient up to the i-th visit as follows, ci = iX j=1 ↵j j vj, (Step 4) denotes element-wise multiplication. We use the context vector ci 2 Rm to predict the true 2 {0, 1}s as follows, byi = Softmax(Wci + b), (Step 5) W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross-entropy to calculate the ation loss as follows, L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> log(1 byi) ⌘ (1) we sum the cross entropy errors from all dimensions of byi. In case of real-valued output , we can change the cross-entropy in Eq. (1) to for example mean squared error. our attention mechanism can be viewed as the inverted architecture of the standard attention ism for NLP [2] where the words are encoded using RNN and generate the attention weights 58
  • 59. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict ci = iX j=1 ↵j j vj, where denotes element-wise multiplication. We use the context vector ci 2123 label yi 2 {0, 1}s as follows,124 byi = Softmax(Wci + b), where W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross-en125 classification loss as follows,126 L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> log where we sum the cross entropy errors from all dimensions of byi. In case127 yi 2 Rs , we can change the cross-entropy in Eq. (1) to for example mean squ128 Overall, our attention mechanism can be viewed as the inverted architecture of129 mechanism for NLP [2] where the words are encoded using RNN and generate130 using MLP. Our method, on the other hand, uses MLP to embed the visit in131 interpretation and uses RNN to generate two sets of attention weights, reco132 information as well as mimicking the behavior of physicians.133 59
  • 60. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 60
  • 61. RETAIN: Calculating the Contributions the past records, they typically study the patient’s most recent records fi117 Computationally, running the RNN in reversed time order has several advan118 time order allows us to generate e’s and ’s that dynamically change th119 predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120 at each timestamp and makes the attention generation process computation121 We generate the context vector ci for a patient up to the i-th visit as follow122 ci = iX j=1 ↵j j vj, where denotes element-wise multiplication. We use the context vector ci123 label yi 2 {0, 1}s as follows,124 byi = Softmax(Wci + b), where W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross125 classification loss as follows,126 L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> where we sum the cross entropy errors from all dimensions of byi. In ca127 yi 2 Rs , we can change the cross-entropy in Eq. (1) to for example mean128 Overall, our attention mechanism can be viewed as the inverted architecture129 mechanism for NLP [2] where the words are encoded using RNN and gene130 e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 61 n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
  • 62. RETAIN: Calculating the Contributions n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 62 2.2 Reverse Time Attention Model RETAIN Figure 2 shows the high-level overview of our model. One key idea is the prediction responsibility to the attention weights generation proces due to the recurrent weights feeding past information to the hidden l visit-level and the variable-level (individual coordinates of xi) influen input vector xi. That is, we define vi = Exi, where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m E 2 Rm⇥r the embedding matrix to learn. We can easily choose a mor representation such as multilayer perceptron (MLP) [13, 28] which has in EHR data [10]. We use two sets of weights for the visit-level attention and the vari scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m. We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) k-th column of E
  • 63. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the Inside the iteration over k 63 e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(y , x ) = ↵ W( e ) x , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) Scalars in the front
  • 64. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 64 1 i predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5)
  • 65. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the Contribution of the k-th code in the j-th visit 65 p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) e visit embedding vi is the sum of the columns of E weighted by each element of xi, en as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating 1 i predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5)
  • 67. Heart Failure (HF) Prediction • Objective • Given a patient record, predict whether he/she will be diagnosed with HF in the future • 34K patients from Sutter PAMF • 4K cases, 30K controls • Use 18-months history before being diagnosed with HF • Number of medical codes • 283 diagnosis codes • 96 medication codes • 238 procedure codes 67 617 medical codes
  • 68. Heart failure prediction • Performance measure • Area under the ROC curve (AUC) • Competing models • Logistic regression • Aggregate all past codes into a fixed-size vector. Feed it to LR • MLP • Aggregate all past codes into a fixed-size vector. Feed it to MLP • Two-layer RNN • Visits are fed to the RNN, whose hidden layers are fed to another RNN. • RNN+attention (Bahdanau et al. 2014) • Visits are fed to RNN. Visit-level attentions are generated by MLP • RETAIN 68
  • 69. Heart failure prediction Models AUC Training time / epoch Test time for 5K patients Logistic Regression 0.7900 ± 0.0111 0.15s 0.11s MLP 0.8256 ± 0.0096 0.25s 0.11s Two-layer RNN 0.8706 ± 0.0080 10.3s 0.57s RNN+attention 0.8624 ± 0.0079 6.7s 0.48s RETAIN 0.8705 ± 0.0081 10.8s 0.63s • RETAIN as accurate as RNN • Requires similar training time & test time • RETAIN is interpretable! • RNN is a blackbox 69
  • 71. Conclusion • RETAIN: interpretable prediction framework • As accurate as RNN • Interpretable prediction • Predictions can be explained • Can be extended to general prognosis • What are the likely disease he/she will have in the future? • Can be used for any sequences with the two-layer structure • E.g. online shopping 71
  • 74. How to generate the attentions 𝛼.? • Use function a(x) for each word: Justice, League, …, Christmas, play • Feed the scores y1, y2, …, y10 into the Softmax function League playJustice a(x1) y1 a(x2) y2 a(x10) y10 𝛼. = exp ( 𝑦.) ∑ exp ( 𝑦:)#& :;# Christmas a(x9) y9 74
  • 75. How to generate the attentions 𝛼.? • Use function a(x) for each word: Justice, League, …, Christmas, play • Feed the scores y1, y2, …, y10 into the Softmax function League playJustice a(x1) y1 a(x2) y2 a(x10) y10 𝛼. = exp ( 𝑦.) ∑ exp ( 𝑦:)#& :;# Christmas a(x9) y9 Softmax function ensures 𝛼.’s sum to 1 Return75