Time series data is increasingly ubiquitous. This trend is especially obvious in health and wellness, with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. In 2009, intensive care units in the United States treated nearly 55,000 patients per day, generating digital-health databases containing millions of individual measurements, most of those forming time series. In the first quarter of 2015 alone, over 11 million health-related wearables were shipped by vendors. Recording hundreds of measurements per day per user, these devices are fueling a health time series data explosion. As a result, we will need ever more sophisticated tools to unlock the true value of this data to improve the lives of patients worldwide.
Deep learning, specifically with recurrent neural networks (RNNs), has emerged as a central tool in a variety of complex temporal-modeling problems, such as speech recognition. However, RNNs are also among the most challenging models to work with, particularly outside the domains where they are widely applied. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using RNNs. DL4J provides a reliable, efficient implementation of many deep learning models embedded within an enterprise-ready open source data ecosystem (e.g., Hadoop and Spark), making it well suited to complex clinical data. Josh, David, and Zachary offer an overview of deep learning and RNNs and explain how they are implemented in DL4J. They then demonstrate a workflow example that uses a pipeline based on DL4J and Canova to prepare publicly available clinical data from PhysioNet and apply the DL4J RNN.
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Modeling Electronic Health Records with Recurrent Neural Networks
1. Modeling electronic health
records with recurrent
neural networks
David C. Kale,1,2
Zachary C. Lipton,3
Josh Patterson4
STRATA - San Jose - 2016
1
University of Southern California
2
Virtual PICU, Children’s Hospital Los Angeles
3
University of California San Diego
4
Patterson Consulting
2. Outline
• Machine (and deep) learning
• Sequence learning with recurrent neural networks
• Clinical sequence classification using LSTM RNNs
• A real world case study using DL4J
• Conclusion and looking forward
8. When/why does this
fail?
• Sometimes the correct function cannot be
encoded a priori — (what is spam?)
• The optimal solution might change over time
• Programmers are expensive
12. Activation Functions
• At internal nodes common choices for the activation
function are the sigmoid, tanh, and ReLU functions.
• At output, activation function could be linear
(regression), sigmoid (multilabel classification) or
softmax (multi-class classification)
13. Training w Backpropagation
• Goal: calculate the rate of change of the loss
function with respect to each parameter (weight) in
the model
• Update the weights by gradient following:
16. Deep Networks
• Used to be difficult (seemed impossible) to train
nets with many layers of hidden layers
• TLDR: Turns out we just needed to do everything
1000x faster…
17. Outline
• Machine (and deep) learning
• Sequence learning with recurrent neural networks
• Clinical sequence classification using LSTM RNNs
• A real world case study using DL4J
• Conclusion and looking forward
20. We would like to capture
temporal/sequential dynamics in
the data
• Standard approaches address sequential structure:
Markov models
Conditional Random Fields
Linear dynamical systems
• Problem:
We desire a system to learn representations,
capture nonlinear structure,
and capture long term sequential relationships.
29. Outline
• Machine (and deep) learning
• Sequence learning with recurrent neural networks
• Clinical sequence classification using LSTM RNNs
• A real world case study using DL4J
• Conclusion and looking forward
33. • Sparse, irregular, unaligned sampling in time, across variables
• Sample selection bias (e.g., more likely to record abnormal)
• Entire sequences (non-random) missing
HR
RR
Admit Discharge
Challenges: sampling rates, missingness
ETCO2
Figures courtesy of Ben Marlin, UMass Amherst
34. HR
HR
Admit
Admit
Discharge
Discharge
Challenges: alignment, variable length
• Observations begin at time of admission, not at onset of illness
• Sequences vary in length from hours to weeks (or longer)
• Variable dynamics across patients, even with same disease
• Longterm dependencies: future state depends on earlier condition
Figures courtesy of Ben Marlin, UMass Amherst
35. PhysioNet Challenge 2012
• Task: predict mortality from only first 48 hours of data
• Classic models (SAPS, Apache, PRISM): experts features + regression
• Useful: quantifying illness at admission, standardized performance
• Not accurate enough to be used for decision support
• Each record includes
• patient descriptors (age, gender, weight, height, unit)
• irregular sequences of ~40 vitals, labs from first 48 hours
• One treatment variable: mechanical ventilation
• Binary outcome: in-hospital survival or mortality (~13% mortality)
• Only 4000 labeled records publicly available (“set A”)
• 4000 unlabeled records (“set B”) used for tuning during competition (we didn’t use)
• 4000 test examples (“set C”) not available
• Very challenging task: temporal outcome, unobserved treatment effects
• Winning entry score: minimum(Precision, Recall) = 0.5353
https://www.physionet.org/challenge/2012/
36. yt = σ(Vst + c)
st = φ(Wst-1 + Uxt + b)
PhysioNet Challenge 2012: predict in-hospital mortality from
observations x1, x2, x3, …, xT during first 48 hours of ICU stay.
Solution: recurrent neural network (RNN)*
p(ymort = 1 | x1, x2, x3, …, xT) ≈ p(ymort = 1 | sT), with st = f(st-1, xt)
• Efficient parameterization: st represents exponential # states vs. # nodes
• Can encode (“remember”) longer histories
• During learning, pass future info backward via backprop through time
sT
yT
s2
y2
s1
y1
s0
x1 x2 xT
* We actually use
a long short-term
memory network
37. Outline
• Machine (and deep) learning
• Sequence learning with recurrent neural networks
• Clinical sequence classification using LSTM RNNs
• A real world case study using DL4J
• Conclusion and looking forward
38. PhysioNet Raw Data
• Set-a
– Directory of single files
– One file per patient
– 48 hours of ICU data
• Format
– Header Line
– 6 Descriptor Values at 00:00
• Collected at Admission
– 37 Irregularly sampled columns
• Over 48 hours
Time,Parameter,Value
00:00,RecordID,132601
00:00,Age,74
00:00,Gender,1
00:00,Height,177.8
00:00,ICUType,2
00:00,Weight,75.9
00:15,pH,7.39
00:15,PaCO2,39
00:15,PaO2,137
00:56,pH,7.39
00:56,PaCO2,37
00:56,PaO2,222
01:26,Urine,250
01:26,Urine,635
01:31,DiasABP,70
01:31,FiO2,1
01:31,HR,103
01:31,MAP,94
01:31,MechVent,1
01:31,SysABP,154
01:34,HCT,24.9
01:34,Platelets,115
01:34,WBC,16.4
01:41,DiasABP,52
01:41,HR,102
01:41,MAP,65
01:41,SysABP,95
01:56,DiasABP,64
01:56,GCS,3
01:56,HR,104
01:56,MAP,85
01:56,SysABP,132
…
39. Preparing Input Data
• Input was 3D Tensor (3d Matrix)
– Mini-batch as first dimension
– Feature Columns as second dimension
– Timesteps as third dimension
• At Mini-batch size of 20, 43 columns, and 202 Timesteps
– We have 173,720 values per Tensor input
40. A Single Training
Example
0 1 2 3 4 …
albumin 0.0 0.0 0.5 0.0 0.0
alp 0.0 0.1 0.0 0.0 0.2
alt 0.0 0.0 0.0 0.9 0.0
ast 0.0 0.0 0.0 0.0 0.4
…
timesteps
Vectorcolumns
Values
albumin 0.0
alp 1.0
alt 0.5
ast 0.0
…
Vectorcolumns
A single training example gets the added dimension of
timesteps for each column
42. Uneven Time Steps and Masking
0 1 2 3 4 …
albumin 0.0 0.0 0.5 0.0 0.0
alp 0.0 0.1 0.0 0.0 0.0
alt 0.0 0.0 0.0 0.9 0.0
ast 0.0 0.0 0.0 0.0 0.0
…
1.0 1.0 1.0 1.0 0.0 0.0
Single Input
(columns + timesteps)
Input Mask
(only timesteps)
43. DL4J
• “The Hadoop of Deep Learning”
– Command line driven
– Java, Scala, and Python APIs
– ASF 2.0 Licensed
• Java implementation
– Parallelization (Yarn, Spark)
– GPU support
• Also Supports multi-GPU per host
• Runtime Neutral
– Local
– Hadoop / YARN + Spark
– AWS
• https://github.com/deeplearning4j/deeplearning4j
44. RNNs in DL4J
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
.learningRate( learningRate )
.rmsDecay(0.95)
.seed(12345)
.regularization(true)
.l2(0.001)
.list(3)
.layer(0, new GravesLSTM.Builder().nIn(iter.inputColumns()).nOut(lstmLayerSize)
.updater(Updater.RMSPROP)
.activation("tanh").weightInit(WeightInit.DISTRIBUTION)
.dist(new UniformDistribution(-0.08, 0.08)).build())
.layer(1, new GravesLSTM.Builder().nIn(lstmLayerSize).nOut(lstmLayerSize)
.updater(Updater.RMSPROP)
.activation("tanh").weightInit(WeightInit.DISTRIBUTION)
.dist(new UniformDistribution(-0.08, 0.08)).build())
.layer(2, new RnnOutputLayer.Builder(LossFunction.MCXENT).activation("softmax”)
.updater(Updater.RMSPROP)
.nIn(lstmLayerSize).nOut(nOut).weightInit(WeightInit.DISTRIBUTION)
.dist(new UniformDistribution(-0.08, 0.08)).build())
.pretrain(false).backprop(true)
.build();
for (int epoch = 0; epoch < max_epochs; ++epoch)
net.fit(dataset_iter);
45. Experimental Results
• Winning entry: min(P,R) = 0.5353 (two others over 0.5)
• Trained on full set A (4K), tuned on set B (4K), tested
on set C
• All used extensively hand-engineered features
• Our best model so far: min(P,R) = 0.4907
• 60/20/20 training/validation/test split of set A
• LSTM with 2 x 300-cell layers on inputs
• Different test sets so not directly comparable
• Disadvantage: much smaller training set
• Required no feature engineering or domain knowledge
46. Map sequences into fixed vector representation
• Not perfectly separable in 2D but some cluster structure related to mortality
• Can repurpose “representation” for other tasks (e.g., searching for similar
patients, clustering, etc.)
47. Final comments
• We believe we could improve performance to well over 0.5
• overfitting: training min(P,R) > 0.6 (vs. test: 0.49)
• smaller or simpler RNN layers, adding dropout, multitask training
• Flexible NN architectures well suited to complex clinical data
• but likely will demand much larger data sets
• may be better matched to “raw” signals (e.g., waveforms)
• More general challenges
• missing (or unobserved) inputs and outcomes
• treatment effects confound predictive models
• outcomes often have temporal components
(posing as binary classification ignores that)
• You can try it out: https://github.com/jpatanooga/dl4j-rnn-timeseries-examples/
See related paper to appear at ICLR 2016: http://arxiv.org/abs/1511.03677
48. Questions?
Thank you for your time and attention
Gibson & Patterson. Deep Learning: A
Practitioner’s Approach. O’Reilly, Q2 2016.
Lipton, et al. A Critical Review of
RNNs. arXiv.
Lipton & Kale. Learning to Diagnose
with LSTM RNNs. ICLR 2016.
49. Sepp Hochreiter
Father of LSTMs,* renowned beer thief
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9 (8): 1735-1780, 1997.
Hinweis der Redaktion
But how do we produce functions? We need a function for that
Neural network is a graphical model of computation.
Graph composed of “nodes” and “edges”
which informally model “neurons” and “synapses”
A little computation takes place at each node.
For a node j:
First; calculate a linear combination of the outputs from neurons connected by an incoming edge.
we call “a” the input activation
Then apply a (usually nonlinear) activation function: in this example we show the logit function.
Some examples of activation functions are the sigmoidal units (sigmoid, hyperbolic tangent)
And rectifier (learns faster because has more seriously nonzero derivative)
Simple training by stochastic gradient descent.
First randomly sample from dataset.
Second, calculate derivative of some objective / loss function on an example
Third, update weights to minimize loss on that example.
Forward pass.
First we set the value of each input node to the input vector.
We calculate the output prediction and calculate that loss with respect to the true label.
Then we calculate the delta value for each node. Delta is the derivative of the loss with respect to that nodes linear input.
Image classification is a case where this has been dramatically successful.
Each image is provided without context (think Facebook upload) and is assigned to accurate object categories.
It’s harder to imagine how someone might represent an arbitrarily sized document as a fixed length vector.
A recurrent net is like a feedforward neural network but augmented by the inclusion of recurrent edges.
At any given “time step” computation is feedforward, but recurrent edges span adjacent time steps.
Can view the recurrent neural network as a deep network with an output at each layers, and weight tying across time steps.
Each hidden layer depends both on the input and the previous state’s hidden layer.
Notice that it’s essentially a feedforward network in this view.
With ReLU activation in hidden node, effect of input on output decays if weight on recurrent edge is small, explodes if it’s large.
LSTM cells composed to form a network in the same way as ordinary hidden nodes in normal RNNs
All patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma ICUs. ICU stays of less than 48 hours have been excluded.
Up to 42 variables were recorded at least once during the first 48 hours after admission to the ICU. Not all variables are available in all cases, however. Six of these variables are general descriptors (collected on admission), and the remainder are time series, for which multiple observations may be available.
One possible solution: recurrent neural nets, which combine Markov structure with hidden states consisting of learned latent features.
Real valued, distributed states can encode many more states/histories for the same number of nodes (vs. discrete hidden states).
Learning via backpropagation through time.
All patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma ICUs. ICU stays of less than 48 hours have been excluded.
Up to 42 variables were recorded at least once during the first 48 hours after admission to the ICU. Not all variables are available in all cases, however. Six of these variables are general descriptors (collected on admission), and the remainder are time series, for which multiple observations may be available.
No alignment attempted per timestep across records, just indexing each recorded timestep (simpler way to find long term dependencies)
Alternative was: (60sec) x (60min) x (48h) == 172,800 timesteps (not easy to model)
Dataset statistics inform the vectorization process for ZMZUV and normalization