Artificial Intelligence on Data Centric Platform

By: Fernando Velasco
Training Big Data Spain
http://xlic.es/v/7208D8

© Stratio 2016. Confidential, All Rights Reserved. 2
A man needs three names

● Mathematician

● Data Scientist
● Mathematician

● Data Scientist
● Mathematician
● Stratian

● Data Scientist
● Mathematician
● Stratian
fvelasco@stratio.com

1
2
3
4
5
© Stratio 2016. Confidential, All Rights Reserved.
INDEX
Introduction
Data Centric Environment
● Distributed TensorFlow example. Keras
Neural Nets
● BackPropagation
Recurrent Neural Networks
● LSTM
Autoencoders
● Data Augmentation
● VAE

Who are we? Where do we come from? Where are we going?
9

10

Data Centricity
15
Mobile APP Campaign
Management
E-commerce
Digital Marketing
Legacy
Application
Call centerSAP : ERP
ATG
TPV APP
CRM

Data Centricity
16
DATA
Mobile APP Campaign
Management
E-commerce
Digital Marketing
Legacy
Application
ATG
TPV APP
CRM
Data Intelligence
API DaaS

Data Centricity
17
DATA
Mobile APP Campaign
Management
E-commerce
Digital Marketing
Legacy
Application
ATG
TPV APP
CRM
Data Intelligence
API DaaS
● Unique data at the center, surrounded by
applications that use it in real-time, gaining
maximum data intelligence
● In order to allow simultaneous updates, the
consistency is eventual.
● Applications use the microservices in the DaaS
layer to access the Data
● The Data Intelligence layer provides access to the
applications to the Data Intelligence
● Applications are developed through microservices
orchestration

Environment Resume
Multiuser Environment
Manage users and provision of notebooks
Analytic Environment
User 1 front-end
User N front-end
User 1 back-end
User code
Analytic Environment
User n back-end

tf.motivation
19
● Growing Community: One of the main reasons to use TensorFlow is the huge
community behind it. TensorFlow is widely known and used.
● Great Technical Capabilities:
- Multi-GPU support
- Distributed training
- Queues for operations like data loading and preprocessing on the graph.
- Graph visualization using TensorBoard.
- Model checkpointing.
- High Performance and GPU memory usage optimization
● High-quality metaframeworks: Keras because of TensorFlow and perhaps also
TensorFlow because of Keras. Once again both lead the list of Deep Learning libraries.
Continuous release schedule and maintenance. New features and tests are
integrated first so that early adopters can try them before documentation. This is great for such a
big community and allows the framework to keep improving.

Distribution strategies: Data vs. Model Parallelism
When splitting the training of a neural network across multiple
compute nodes, two strategies are commonly employed:
● Data parallelism: individual instances of the model are
created on each node and fed different training samples; this
allows for higher training throughput.
● Model parallelism: a single instance of the model is split
across multiple nodes allowing for larger models, ones which
may not necessarily fit in the memory of a single node, to be
trained.
● Mixed: if desired, these two strategies can also be composed
resulting in multiple instances of a given model with each
instance spanning multiple

Distributed Computation Synchrony
There are many ways to specify Distributed structure in
TensorFlow. Possible approaches include:
Asynchronous training: In this approach, each replica of the
graph has an independent training loop that executes
without coordination. It is compatible with both forms of
replication above.
Synchronous training: In this approach, all of the replicas
read the same values for the current parameters, compute
gradients in parallel, and then apply them together. It is
compatible with in-graph replication (e.g. using gradient
averaging), and between-graph replication.

tf.motivation
22
- Multi-GPU support
big community and allows the framework to keep inproving.

23
Stimulating the brain

Let me introduce you to my friend Cajal. He knew something about neurons
24

25
dendrite

26
dendrite
axon

27
dendrite
axon
synapses: impulse transmission

Building the structures: how can we define a neuron?
28

Layers, layers, layers
29
Activation
Functions

Layers, layers, layers
30
Activation
Functions

BackPropagation Basics
31
Input hidden hidden hidden
Output

32
Forward Propagation: get a result
Output

33
Output
Error
Estimation:
evaluate
performances

34
Backward Propagation: who’s to blame?
Output
Error
Estimation:
evaluate
performances

35
Backward Propagation: who’s to blame?
Output
Error
Estimation:
evaluate
performances
● A cost function C is
defined
● Every parameter has
its impact on the cost
given some training
examples
● Impacts are computed
in terms of derivations
● Use the chain rule to
propagate error
backwards

Funciones de activación: Salidas
● Lineales:
● sf
● Binomiales : sigmoide
● ad
● Multinomiales: softmax
Activation
Functions

Sigmoid and Relu functions
- Bounded
- Probability-like function
- Dense computation
- Differentiable
- On many examples of fully
connected layers

- Bounded
- Dense computation
- Differentiable
connected layers
We are too cool to speak
about linear activators, aren’t
we?
Not entirely...

- Sparse activation
- Efficient computation
- “Differentiable”
- Unbounded
- Potential Dying Relu
- Convolutional-friendly
- Bounded
- Dense computation
- Differentiable
connected layers
We are too cool to speak
about linear activators, aren’t
we?
Not entirely...

Hyperbolic Tangent
- Bounded
- Positive/negative values
- Dense computation
- Differentiable
- Nice to LSTM-like thinking

Softmax
- Represents probability on a
categorical distribution
- Multiclass normalization
- Bounded
- Differentiable
- Used on final layers

Differentiation
is the key

On the ease of Derivations
● Sigmoid
● Hyperbolic Tangent ● ReLU
● Softmax

On the ease of Derivations
● Sigmoid
● Hyperbolic Tangent ● ReLU
● Softmax
Handset value

● Lineales:
● sf
● Binomiales : sigmoide
● ad
● Multinomiales: softmax
Activation
Functions
Loss
Functions

Regression error
● The most classic measure
● Penalizes highly big mistakes
● Less interpretable ● Scale invariant
● Symmetric
● Interpretable
● Harder differentiability and
convergence
convergence
● Penalizes less on higher
mistakes
● Interpretable

Regression error
● The most classic measure
● Penalizes highly big mistakes
● Less interpretable ● Scale invariant
● Symmetric
● Interpretable
convergence
convergence
● Penalizes less on higher
mistakes
● Interpretable
The choice is always problem-dependent

Funciones de coste
● Regresión:
● Clasificación:
The shortest way is not
always the best one

Classification and Categorical Cross- Entropy
● Categorical Cross-Entropy
Where indexes i and j stand for each example and resp. class, the ys stand for the true labels
and the ps stand for their assigned probabilities
On two classes it turns into the easy-to-understand, most common
When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error
closeness of a prediction, as it takes into account the closeness of a prediction .
Derivation also eases calculus compared with RMSE

Classification and Categorical Cross- Entropy
● Categorical Cross-Entropy
Where indexes i and j stand for each example and resp. class, the ys stand for the true labels
and the ps stand for their assigned probabilities
On two classes it turns into the easy-to-understand, most common
When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error
closeness of a prediction, as it takes into account the closeness of a prediction .
Derivation also eases calculus compared with RMSE
Classificator 1
Classificator 2

Regularization: Norm penalties
● Add a penalty to the loss function:
● L2:
○ Keep weights near zero.
○ Simplest one, differentiable.
● L1:
○ Sparse results, feature selection.
○ Not differentiable, slower.

Regularization: Dropout
● Randomly drop neurons (along with
their connections) during training.
● Acts like adding noise.
● Very effective, computationally
inexpensive.
● Ensemble of all sub-networks
generated.

Optimization: Challenges
● The difficulty in training neural
networks is mainly attributed to their
optimization part.
● Plateaus, saddle points and local
minima grows exponentially with the
dimension
● Classical convex optimization
algorithms don’t perform well.

Optimization: Batch Gradient descent
● Goes over the whole training set.
● Very expensive.
● There isn’t an easy way to
incorporate new data to training set.

Optimization: Mini-Batch Gradient descent
● Stochastic Gradient Descent (SGD)
● Randomly sample a small number of
examples (minibatch)
● Estimate cost function and gradient:
● Batch size: Length of the minibatch
● Iteration: Every time we update the
weights
● Epoch: One pass over the whole training
set.
● k = 1 => online learning
● Small batches => regularization

Optimization: Variants
● Momentum:The momentum algorithm accumulates an exponentially decaying moving
average of past gradients and continues to move in their direction.
● AdaGrad: The learning rate is adapted component-wise, and is given by the square root of
sum of squares of the historical.
● RMSProp: modiﬁes AdaGrad to perform better in the non-convex setting by changing the
gradient accumulation into an exponentially weighted moving average
● ADAM(Adaptive Moment): Combination of RMSPROP and momentum.

Momentum basics
Negative of the gradient
Momentum
Real Movement

A fistful of cool applications
63
Not all that
wander are lost

64
Not all that
wander are lost
Object
Classification and
Detection

65
Not all that
wander are lost
Object
Classification and
Detection
RBM on
Recommender
Systems

66
Not all that
wander are lost
Object
Classification and
Detection
Instant Visual
translation
RBM on
Recommender
Systems

67
Not all that
wander are lost
Object
Classification and
Detection
Instant Visual
translation
RBM on
Recommender
Systems
Generative
Models
(GAN/VAE)

tf.motivation
69
- Multi-GPU support
big community and allows the framework to keep inproving.

Welcome to the jungle!
● Me Tarzán, you Cheetah. Human friendly interface. User actions are
minimized in order to ease the process, isolating users from the backend.
● Territorial behaviors are allowed. Several backends can be used:
Tensorflow, CNTK and Theano (poor Theano!), but there is also another
interesting property on modularization: every model is a sequence of
standalone modules plugged together with as little restrictions as possible,
and allowing us to fully configure cost functions, optimizers, initializations,
activation functions ...
● Keeps your model herd a-growin’. New modules are simple to add, and
existing modules provide ample examples.
● Kaa is our friend. We love Python! It makes the lives of data scientists easier.
the code is compact, easier to debug, and allows for ease of extensibility.

Ever felt lost in Automatic Translation?
72

Índice Analítico
Introducción: ¿por qué combinar modelos?
Boosting & Bagging basics
Demo:
○ Implementación de Adaboost con árboles
binarios
○ Feature Selection con Random Forest
1
2
3
Not all that
wander are lost
What do we say to those who think
machine translation sucks?
Not today!

Neural Machine Translation Idea
74
Demo:
binarios
1
2
3
Not all that
wander are lost
Encoder:
words => hidden state
Decoder :
hidden state => words
Hidden states are not entirely universal languages!!

Attention Basics
75
● Not every word is a one-to one
translation
● Whole-Weighting combination
increases computation time
● Some other more human
approaches can be taken (e.g:
reinforcement learning)

BaseSlide
76
Sequential Data

Sequence Statement
77
● Most machine learning algorithms are designed for
independent, unordered data.
● Many real problems uses sequential data:
○ Time series, behavior, audio signals…
○ t does not have to be time, can be spatial measure
(images), or any order measure (Recommender
systems)
● The sequences are a natural way of representing reality:
vision, hearing, action-reaction, words, sentences, etc.
● Don’t forget order matters!!

Introducing Recurrent Neural Networks
78
● Neural Networks with recurrent connections,
specialized in processing sequential data.
● Recurrent connections allows a ‘memory’ of
previous inputs.
● Can scale to long sequences (variable length),
not practical for other types of nets.
● Same parameters for every timestep (t) =>
generalize
RNN images by Christopher Olah

Recurrent Neural Networks Architecture
79
0 1 2 t
Looping the loop: Backpropagation Through Time
● Same idea as in the standard backpropagation, but the recurrent net needs
to be unfolded through time for a certain amount of timesteps.
● The weight changes calculated for each network copy are summed before
individual weights are adapted.
● The set of weights for each copy(time step) always remain the same.

BackPropagation Through Time
80
0 1 2 t
● Cost function:
● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop.
● Chain rule application lead to a lot of derivation products.
where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc)

81
0 1 2 t
● Cost function:

82
0 1 2 t
● Cost function:
BackProp

83
0 1 2 t
● Cost function:
BackProp

84
0 1 2 t
● Cost function:
BackProp

BaseSlide
85
Beware of the
Vanishing Gradient!!

Gradients in time
86
● Backpropagating the error in time involves
as many recurrent derivation terms as
timesteps on the net.
● It can be problematic if matrix W is too large
or too low in terms of its values.
● Thus, the very first terms would have no
influence on the result as there is no
memory related to them

Short-term modulation
Long-term modulation

LSTM Briefing (Sepp Hochreiter and Jürgen Schmidhuber, 1997)
88
● And up to three outputs, two of them
are states: Long and short.
● Third output (if exists or considered) is
similar to the classic output
● Timesteps are still the key
● From now on, we are going to have
two connections (states)
● Each timestep receives an input
LSTM images also by Christopher Olah

LSTM Briefing (II)
90
● Each timestep may have one or
more units
● Each state corresponds to each
kind of memory at play: Long
and Short
● Inside each cell, there are four
questions asked:
○ Which part of the Long memory has
to be deleted?
○ From the new info, is there anything
interesting to be remembered?
○ If there is, How do we combine it
along with the Long memory?
○ What is the Short term impression
for this step?

LSTM Briefing (II)
91
more units
and Short
questions asked:
to be deleted?
for this step?
forget gate
f

LSTM Briefing (II)
92
more units
and Short
questions asked:
to be deleted?
for this step?
input gate
forget gate
if

LSTM Briefing (II)
93
more units
and Short
questions asked:
to be deleted?
for this step?
input gate
forget gate
candidate gate
cif

LSTM Briefing (II)
94
more units
and Short
questions asked:
to be deleted?
for this step?
input gate
forget gate
candidate gate
output gate
ocif

BaseSlide
95
Focusing on forget gate the question is answered as follows:
where the h is the activation the b is the associated bias and the W is
the weight matrix on the forget gate. Or, in a more explicit way:
Where the Wfx is the input weight matrix (the classic one) and Whh is
the hidden state matrix between timesteps.
On a similar way, one can express input and output equations this very
same way: hit and hot
Anyway, there are some differences on the candidate gate ones,
mainly related to its activation function: the hyperbolic tangent. On the
same notation:

BaseSlide
96
Focusing on forget gate the question is answered as follows:
where the h is the activation the b is the associated bias and the W is
the weight matrix on the forget gate. Or, in a more explicit way:
Where the Wfx is the input weight matrix (the classic one) and Whh is
the hidden state matrix between timesteps.
On a similar way, one can express input and output equations this very
same way: hit and hot
Anyway, there are some differences on the candidate gate ones,
mainly related to its activation function: the hyperbolic tangent. On the
same notation:tanh values in a [-1, 1] range. This way we are
able to add and subtract on the Long term
memory

Inside a LSTM Cell (II)
97
And finally, we can update states, including
output. This way:
Or on simpler words, we forget what is to be
forgotten and we add what is to be added.
At the very end, with the same tanh idea,
we put Short and Long terms together:

Cool Applications
99
Not all that
wander are lost
CNN + LSTM to describe pictures Film scripts. Yes, it’s for real

BaseSlide
100
The man who creates the network
should write the code
Demo Time!!

AutoEncoders

Autoencoders (Idea)
103
Output
● Supervised neural networks try to predict
labels from input data
● It is not always possible to obtain labels
● Unsupervised learning can help obtain data
structure.
● What if we turn the output to be the input?

Autoencoders (Idea)
104
This is not the Generative
Model you are looking for
Input image

Autoencoders (Idea)
105
Input image

Autoencoders (Idea)
106
Input image

Autoencoders (Idea)
107
Input image

Autoencoders (Idea)
108
Input image Output image

Autoencoders (Idea)
109
It tries to predict x from x, but no labels are needed.
The idea is learning an approximation of the identity
function.
Along the way, some restrictions are placed:
typically the hidden layers compress the data.
The original input is represented at the output, even
if it comes from noisy or corrupted data.

Autoencoders (Encoder and decoder)
110

111
Encode Decode

112
The latent space is commonly a narrow hidden layer
between encoder and decoder
It learns the data structure
Encoder and decoder can share the same
(inversed) structure or be different.
Each one can have its own depth (number of layers)
and complexity.
Encode Decode
Latent Space

Autoencoders BackPropagation
113
Encode Decode
Latent Space

Autoencoders BackPropagation
114
A cost function can be defined taking into account
differences between input and
Decoded(Encoded(Input))
This allows BackProp to be carried along Encoder
and Decoder
To prevent function composition to be the Identity,
some regularizations can be taken
One of the most common is just reducing the latent
space dimension (i.e: compressing the data on
the encoding)
Encode Decode
Latent Space
BackPropagation

Autoencoders Applications
115
Reduction of dimensionality
Data Structure/Feature learning
Denoising or data cleaning
Pre-training deep networks

Data Augmentation
116

Data Augmentation
117

Data Augmentation
118
● Specialized image and video classification tasks often
have insufficient data.
● Traditional transformations consist of using a
combination of affine transformations to manipulate the
training data
● Data augmentation has been shown to produce promising
ways to increase the accuracy of classification tasks.
● While traditional augmentation is very effective alone, other
techniques enabled by generative models have proved to be
even better

Generative Models (Idea)
119
Generative Models
“What I cannot create, I do
not understand.”
—Richard Feynman

Generative Models (Idea)
120
● They model how the data was generated in
order to categorize a signal.
● Instead of modeling P(y|x) as the usual
discriminative models, the distribution under
the hood is P(x, y)
● The number of parameters is significantly
smaller than the amount of data on which they
are trained.
● This forces the models to discover the data
essence
● What the model does is understanding the
world around the data, and provide good data
representations of it

Generative Models Applications
121
● Generate potentially unfeasible examples for
Reinforcement Learning
● Denoising/Pretraining
● Structured prediction exploration in RL
● Entirely plausible generation of images to
depict image/video
● Feature understanding

Generative Models Applications
122
● Generate potentially unfeasible examples for
Reinforcement Learning
● Denoising/Pretraining
● Structured prediction exploration in RL
● Entirely plausible generation of images to
depict image/video
● Feature understanding

Variational Autoencoder Idea (I)
123
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network

Variational Autoencoder Idea (II)
124
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network

125
Output image
Latent Space
Mean Vector
Standard Deviation
Vector
Decoder
Network

126
Latent Space
Mean Vector
Standard Deviation
Vector
Decoder
Network

127
Output image
Latent Space
Mean Vector
Standard Deviation
Vector
Decoder
Network
Sample on Latent Space => Generate new representations
Prior distribution

Keras
Introducing
Keras
Demogorgon smile
generation is beyond the
state of the art

Latent Space Distribution (I)
129
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network

Latent Space Distribution (II): VAE Loss function
130
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● Encoder and decoder can be denoted as conditional
probability representations of data:

131
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● Typically the encoder reduces dimensions as decoder
increases it . So, when reconstructing the inputs some
information is lost. This information loss can be
measured using the reconstruction log-likelihood:

132
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● In order to keep the latent image distribution under
control, we can introduce a regularizer into the loss
function. The Kullback-Leibler divergence between the
encoder distribution and a given and known distribution,
such as the standard Gaussian:

133
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● In order to keep the latent image distribution under
control, we can introduce a regularizer into the loss
function. The Kullback-Leibler divergence between the
encoder distribution and a given and known distribution,
such as the standard Gaussian:
● With this penalty in the loss encoder, outputs are forced
to be sufficiently diverse: similar inputs will be kept close
(smoothly) together in the latent space.

Relu
Distribution
Divergence (K-L)
Reconstruction
Loss

Latent Space Distribution (III): Probability overview
135
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network● The VAE contains a specific probability model of
data x and latent variables z.
● We can write the joint probability of the model as
p(x,z): “how likely is observation x under the joint
distribution”.
● By definition, p(x, z)=p(x∣z)p(z)
● In order to generate the data, the process is as
follows:
For each datapoint i:
- Draw latent variables zi∼p(z)
- Draw datapoint xi∼p(x∣z)
● We need to figure out p(z) and p(x|z)
● The likelihood is the representation to be learnt from
the decoder
● Encoder likelihood can be used to estimate
parameters from the prior.

Variational Autoencoder: BackProp +reparametrization trick
136
● VAEs are built by using Backpropagation on
the previously defined loss function.
● Mean and variance estimations doesn’t get us
Z but its distribution parameters.
● In order to get Z we could sample directly from
the true posterior given the parameters, but
sampling cannot be differentiated.
● Instead a trick can be applied so that the non-
differentiable part is left outside the network
● By stating
we can remove the sampling from the
backprop part

Índice Analítico
Demo:
binarios
1
2
3
Not all that
wander are lost
Any Questions?

Artificial Intelligence on Data Centric Platform

Artificial Intelligence on Data Centric Platform

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Artificial Intelligence on Data Centric Platform

Ähnlich wie Artificial Intelligence on Data Centric Platform (20)

Mehr von Stratio

Mehr von Stratio (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Artificial Intelligence on Data Centric Platform

Hinweis der Redaktion