Take control of your SAP testing with UiPath Test Suite
Comp7404 ai group_project_15apr2018_v2.1
1. A BRIEF INTRODUCTION ON
GATED RECURRENT UNIT
COMP7404 – GROUP U
Image: Illustration of gated recurrent unit. From Chung et al., 2014:
https://arxiv.org/abs/1412.3555
2. AGENDA
1) A Quick Recap on Deep Learning Architectures
① Standard Neural Network (NN)
② Recurrent Neural Network (RNN)
2) A Deep Dive in Gated Recurrent Unit (GRU)
3) Rainfall Project Overview
① Competition Overview
② Source Data
③ Code and Demo: Rainfall Prediction
4) Q&A
3. LIMITATIONS OF STANDARD NEURAL
NETWORK
Source: A Critical Review of Recurrent Neural Networks for Sequence Learning, Lipton et al., 2015; Simulation of Neural Networks (in German) (1st ed.)., Zell, Andreas, 1994
Major Constraints:
Input and output of fixed length; not
efficient with sequential data
Do not well exploit features learned
previously
Image: Coursera: Sequence Models, Andrew Ng
Image: Wrist mounted device sleep graph, Lucid Dreaming App
Est-ce que vous
êtes prêt?
Are you ready?
4. RECURRENT NEURAL NETWORK (RNN)
Source: Understanding LSTM Networks, Olah, 2015; A Critical Review of
Recurrent Neural Networks for Sequence Learning, Lipton et al,. 2015;
The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej
2015
Has loops to deal with sequential data
Can handle vectors of variable length
10/04/2018 Predicting E-commerce Consumer Behavior using Recurrent Neural Networks
Figure 2 showssome output h, we call thisthe hidden state of the cell.
So, what exactly are cells? Let’shave quick look at the two cell
architecturesmentioned above; LSTM and GRU. We will be testing both
architectureswhen training the models.
Long Short-Term Memory (LSTM)
The Long Short-Term Memory (LSTM) cell, like the basic cell,
computesa new state and an output from an input and the previous
state. The (hidden) state of LSTM issplit in two vectors, one for short-
term state and one for long-term state. Figure 3 illustratesthe
architecture of the LSTM cell. In gure 3, cisthe long-term state and h
isthe short-term state. LSTM receivesboth short-term and long-term
statesfrom previoustimesteps[5].
Figure 2: Cell folded/unfolded. Inspired by Aurélien Géron.
Recurrent architecture
10/04/2018 Predicting E-commerce Consumer Behavior using Recurrent Neural Networks
Figure 2 showssome output h, we call thisthe hidden state of the cell.
So, what exactly are cells?Let’shave quick look at the two cell
architecturesmentioned above; LSTM and GRU. We will be testing both
architectureswhen training the models.
Long Short-Term Memory (LSTM)
TheLong Short-Term Memory (LSTM) cell, like thebasic cell,
computesa new state and an output from an input and the previous
state. The (hidden) state of LSTM issplit in two vectors, onefor short-
term state and onefor long-term state. Figure3 illustratesthe
architectureof theLSTM cell. In gure 3, cisthe long-term state and h
isthe short-term state. LSTM receivesboth short-term and long-term
statesfrom previoustimesteps[5].
Figure 2: Cell folded/unfolded. Inspired by Aurélien Géron.#Update the hidden state
𝒉 𝒕 = 𝒇 𝑾(𝒙)
. 𝒙 𝒕 + 𝑽(𝒉)
. 𝒉 𝒕−𝟏
#Compute the output vector
𝑦𝑡 = 𝑔(𝑊(𝑦)
. ℎ 𝑡)
Notations
𝑥𝑡 = input at time t
ℎ 𝑡 = hidden state at time t
𝑦𝑡 = output at time t
f = activation functionliketanh
g = activation function for the output like a sigmoid
𝑊(𝑥)
, 𝑉(ℎ)
, 𝑊(𝑦)
= parameters
(Like standard RNN )
Image classification
Sentiment analysis,
video recognition
Text translationMusic
generation
5. Unable to handle the “long-term dependencies”
well in practice “Vanishing Gradient” It was a handwritten
application from Steve
Jobs for jobs in HP. We
use online application
for jobs nowadays. How
do jobs and career
mean for college
graduates? Those who
have read the
bibliography of Jobs
may have a different
viewpoint …
Loops make it a longer
path and more
complicated to
calculate the derivative
Limitations with RNNs
Solution:
Use a more sophisticated architecture which allows for shorter path and less multiplication to calculate the gradient.
6. Learns how to keep memories for long distance dependencies
Avoid vanishing gradient problem
mer Behavior using Recurrent Neural Networks
thisthe hidden state of the cell.
quick look at the two cell
and GRU. We will be testing both
s.
STM)
nspired by Aurélien Géron.
g E-commerce Consumer Behavior using Recurrent Neural Networks
output h, we call thisthe hidden state of the cell.
ells? Let’shave quick look at the two cell
ed above; LSTM and GRU. We will be testing both
aining the models.
Memory (LSTM)
Memory (LSTM) cell, like the basic cell,
and an output from an input and the previous
Cell folded/unfolded. Inspired by Aurélien Géron.
Equations
Different diagrams in the literature :
Update gate 𝒛𝒕 = 𝑓 𝑊(𝑧)
. 𝒙 𝒕 + 𝑉(𝑧)
. 𝒉 𝒕−𝟏
Reset gate 𝒓 𝒕 = 𝑓 𝑊(𝑟)
. 𝒙 𝒕 + 𝑉(𝑟)
. 𝒉 𝒕−𝟏
Reset gate memory 𝒉 𝒕 = 𝑔 𝑊(ℎ)
. 𝒙 𝒕 + 𝑉(ℎ)
. 𝒓 𝒕 ∗ 𝒉 𝒕−𝟏
Memory to transmit 𝒉 𝒕 = (1 − 𝒛𝒕)∗ 𝒉 𝒕−𝟏 + 𝒛𝒕 ∗ 𝒉 𝒕
Gated Recurrent Unit (GRU)
11/04/2018 Understanding GRU networks – Towards Data Science
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be 5/14
First, let’sintroducethe notations:
If you are not familiar with theabove terminology, I recommend
watching these tutorialsabout “sigmoid” and “tanh” function and
“Hadamard product” operation.
#1. Update gate
We start with calculating theupdate gate z_t for time step t using the
formula:
Gated Recurrent Unit
12/04/2018 Predicting E-commerce Consumer Behavior using Recurrent Neural Networks
https://blog.nirida.ai/predicting-e-commerce-consumer-behavior-using-recurrent-neural-networks-36e37f1aed22 9/43
Figure 4 illustratestheGRU cell. In GRU, both state vectorsof the LSTM
havebeen merged into a singlevector, h. Instead of threegate
controllers, GRU usestwo; one controlling both the forget gate and the
input gate, while thereisno output gate. Thefull state vector isthe
output at every timestep.
Figure 4: GRU cell. Inspired by Christopher Olah.
Jobs is founder of a company, which company has its headquarter … . Jobs …
𝑧𝑡 = 1
ℎ 𝑡 = 1 … … … … ℎ 𝑡 = 1
Source: Cho et al. 2014, Coursera: Sequence Models, Andrew Ng
7. COMPETITION: HOW MUCH DID IT RAIN?
https://www.kaggle.com/c/how-much-did-it-rain-ii/data
Our solution
o Goal: showcase the application of GRU
o Methodology: train GRU using the radar snapshot data in
order to predict the rainfall
o Tools: Keras
Data preprocessing challenges
o Irregular radar measurement times
o Outliers
o Overfitting
Training data and test data may not be fully independent
Source: www. kaggle.com/c/how-much-did-it-rain-ii/data;
Scripts and sources (incl.
image, script quotation,
etc.) to be updated / Danny
8. TRAINING DATA
Training data sample:
Source: www. kaggle.com/c/how-much-did-it-rain-ii/data;
Scripts and sources (incl.
image, script quotation,
etc.) to be updated / Danny
9. DATA PREPROCESSING AND TESTING
Source: www. kaggle.com/c/how-much-did-it-rain-ii/data;
Scripts and sources (incl.
image, script quotation,
etc.) to be updated / Danny
11. DEMO
Source: []
Items to be included in GitHub uploads:
• Demo with source and all dependencies and detailed instructions in
markdown format on how to run the demo to be uploaded as a single
zip file.
• Please ensure that the instructions on how to run the demo are
sufficiently detailed. You won’t be able to get a good grade if we are
not able to run you demo.
Items to be included in this PPT:
• The presentation must include this live demo. If a live demo is not
possible, a video demo can be presented. The file size limit for this zip
file is 100MB.
• A link to GitHub version. The link should be included as a QR code in
this PPT.
Slides, scripts and sources
(incl. image, script
quotation, codes, etc.) to be
updated / Paul
A major limitation of the standard neural network we learnt in the class is that in general it only works with fixed-length vectors. The size of the input should be fixed to the size of output, which means standard network is not efficient working with various length such as sequential data, sequences of texts, videos; sounds, or time series found in medical, financial, or industrial data.
Another limitation is that a standard neural network is not good at remembering the features learned previously. For example, it would not work perfectly if you want to build a model that can learn to recognize names in the text. A situation could be Jobs has appeared as the last name in the previous text and you’d like the model to recognize Jobs as a name in the latter text.
Recurrent neural network has a more sophisticated architecture in order to fix the limitations. The main idea is that it introduces loops which allow for information to persist between time steps.
The flexibility given by the loops allows for working with sequential data of various length. A few examples: one to one, many, etc
This diagram will help us understand how a recurrent model works:
When x_t enters the network unit, it is multiplied by its own weight Wx. This should be a familiar step to you as we learnt this function in class already. You will see the improvement of loops from the second part of the equation: h_(t-1) which holds the information from the previous t-1 units with its own weight Vh. Both results are added together and squashed by an activation function. (The output of the network unit is 𝑦 which is also calculated by a nonlinear function, usually a sigmoid. It is simply a way to have the result between 0 and 1 and convert it into probabilities).
Remark : Unlike feed forward neural networks, which have different parameters at each layer, a RNN shares the same parameters (U, V, W) through all steps.
It is difficult to train a RNN model. A key problem is called vanishing gradient. We have learnt in class how to minimize the loss function using the gradient descent method. However, this method could cause problems under loops of RNNs as now we have to do a backpropagation through the layers as well as each time unit, which makes derivative calculation much more complex given the compounding effect. As a result, the long term dependencies will be dominated by the short term dependencies. We are not going to speak in full details given the complexity of this topic. Instead, we attached some links for your further reference.
Let’s use the same example of “Jobs” to illustrate the idea. Jobs as a last name could occur multiple times at the beginning of a very long, think of extremely long sentence. Though the RNN could be able to identify the word Jobs in the latter part of the sentence, in reality it might cause problems due to vanishing gradient as it is too far away from the beginning.
How to deal with those problems. Many techniques have been proposed. We’ll explain one of them today. The main idea is to change the architecture using a more sophisticated activation function in order to create a shorter path and avoid a vanishing gradient.
This new structure presented here is called GRU (Gated Recurrent Unit), it is a less complex version of another architecture called LSTM (Long Short Term Memory) which was introduced by another group. Btw, it has been tested that both structures perfom equally well in most cases.
Most papers we read takes a lot of time to clearly explain the GRU with their own diagrams. It is very unlikely for us to explain GRU in full details in a few minutes given its complexity. Instead, we attached links containing explanations for your further reference and we will focus on the equations.
The main idea is to add two more vectors to standard RNNs, called update gate and reset gate, which decide how much relevant information to keep and how much irrelevant to skip or delete from the past. Those gates help to learn long time dependencies and avoid vanishing gradient.
So how GRU and these gates work?
Update gate and reset gate have an equation similar to the standard RNN: calculate a non-linear function (usually a sigmoid) of a linear combination of the new input xt and the previous information ht-1.
The difference between the two gates is their weight and the way they are used :
We want the update gate to determine how much information from the previous time step to be passed to the next and the reset gate to determine how much information to be skipped.
How we do that?
We first create a memory vector ht hat to store relevant information from the past through the reset gate. When a reset gate is close to 0, the ht-1 hat will be skipped.
And we combine the memory vector ht hat and the update gate to calculate the vector ht which contains the information to be transmitted to the network.
Let’s take our name example again:
The relevant information for recognizing “Jobs” as a name was at the beginning of the sequence. The model will learn to have update gate zt close to 1, so ht will keep the previous information fully, while 1 – zt will then be close to 0 and all the current irrelevant information will be skipped.
If the irrelevant information is at the beginning, with the learning capability of the model, it will set the reset vector close to 0, in order to skip the irrelevant information from the past.
It is amazing that the model can learn what to transmit or reset using both gates. The beauty of these simple equations is being extremely powerful yet not hard to implement.
Now it’s time to implement this GRU ideas to our rain problem.
# In the competition, U.S. National Weather Service upgraded their radar in order to improve the rainfall predictors. The new radar is called polarimetric radars which can provide higher quality data than conventional Doppler radars.
Rainfall measurements are very important in the agricultural field.
In the old days rain gauges are used to measure the rainfall for each hour. However, though they can measure the rainfall in a specific location accurately, rainfalls are different from one location to another.
In order to have a widespread coverage, nowadays weather radars are used to measure the hourly rainfall. The technology of the weather radars is improving. One type of radars, the polarimetric one, are able to provide higher quality data than conventional Doppler radars because they transmit radio wave pulses with both horizontal and vertical orientations.
In this competition, data from the U.S. National Weather Service are used. We are asked to predict the hourly rain gauge reading from a set of snapshots of radar values obtained in the same hour.
#The training data is collected in first 20 days in each month between Apr and Aug 2014 of US midwestern corn-growing states
#The test data consists of data from the same radars for the rest of days in that month. We are required to predict the gauge observation at the end of each hour.
The data is collected in midwestern US during the five-month corn growing season from Apr to Aug 2014.
The data from the first 20 days of each month is used for training our model. Each record consists of snapshots of radar values obtained in each hour and the corresponding hourly gauge reading.
and the data from the rest of those 10/11 days
With the training data, we train our GRU model.
With the test data, we exam our GRU model and predict the “Expected”rainfall data.
However, When analyzing the data source, we found we are facing three major challenges:
Irregular radar measurement times (the time series of the observation are not regular)
Outliers (some noise data in the training data)
Overfitting (in the same year, we cannot say the rainfall data in the first 20 days of that months has no relationship with the data collected in rest of the days in that month. So Training data and test data may not be fully independent)
Let’s see next slides
Let me brief more on the training data:
As we can see here, there are totally 24 columns:
The first column represent the hour of the gauge observation;
The second column is the minutes of the gauge observation in that hour Id; (Here We can see the first challenge we’ve just said here: Irregular radar measurement times. For example, for the hour one, the gauge observation are collected at minutes 3,16,25,35 etc for around 6 times; however, in the hour two, the gauge observation are collected at minutes 1,6,11,16 etc for around 12 times)
The third column is the distance from the radar to the gauge (km);
Other columns are the all kinds of parameters that the polarmetric radar has collected.
The last column is the volume of the rainfall (in mm) of that gauge observation in the end of that hour.
For example,
Here, we highlighted this row in Red. It is the observation of the first minute in the second hour and the distance from the radar to the observation point is 2 kilometers. And the volume of the rainfall for the end of the second hour is 1.016 mm
Let’s go to next slide to see the challenges and what we need to do
We are required to generate a file with two columns (As you may see the snapshot of the sample solution data provided by the organizer on the right hand side):
One is Column “Id”: A unique number for the set of observations over an hour at a gauge.
The other is Column “Expected”: Actual gauge observation in mm at the end of the hour.
Here you can see the snapshot of the training data on the left hand side, there are a lot of outliers for the data, This is the second challenge we’ve just said:
From the sample solution data, we can see the acceptable or reasonable volume of rainfall should below 164 mm
However, from the training data, we can see some Unexpected data which is 32740 mm which does not make sense.
The snapshot of test data shown on the below. We can see it is same format but only ‘missing’ the “expected” column.
We are required to predict the gauge observation at the end of each hour for it.
My teammate Paul will introduce our codes and our results…