Digital Transformation starts with data. What if a solution existed that put data at the center, in a single place, serving all applications around it? This training will include a demonstration in a distributed data-centric platform which provides a data intelligence layer, composed of artificial intelligence models able to make use of a whole company’s data.
Nowadays, one of the most innovative techniques in the realm of artificial intelligence is Deep Neural Nets. Among the many applications, language modelling, machine translation and image generation are receiving particular attention. Deep nets are also powerful in predictive modelling ambits such as stock pricing and the energy industry. We will address a few case studies modeled with TensorFlow, running on Stratio’s data-centric product in a distributed cluster.
By: Fernando Velasco
18. Environment Resume
Multiuser Environment
Manage users and provision of notebooks
Analytic Environment
User 1 front-end
User N front-end
User 1 back-end
User code
Analytic Environment
User n back-end
20. Distribution strategies: Data vs. Model Parallelism
When splitting the training of a neural network across multiple
compute nodes, two strategies are commonly employed:
● Data parallelism: individual instances of the model are
created on each node and fed different training samples; this
allows for higher training throughput.
● Model parallelism: a single instance of the model is split
across multiple nodes allowing for larger models, ones which
may not necessarily fit in the memory of a single node, to be
trained.
● Mixed: if desired, these two strategies can also be composed
resulting in multiple instances of a given model with each
instance spanning multiple
21. Distributed Computation Synchrony
There are many ways to specify Distributed structure in
TensorFlow. Possible approaches include:
Asynchronous training: In this approach, each replica of the
graph has an independent training loop that executes
without coordination. It is compatible with both forms of
replication above.
Synchronous training: In this approach, all of the replicas
read the same values for the current parameters, compute
gradients in parallel, and then apply them together. It is
compatible with in-graph replication (e.g. using gradient
averaging), and between-graph replication.
36. Funciones de activación: Salidas
● Lineales:
● sf
● Binomiales : sigmoide
● ad
● Multinomiales: softmax
Activation
Functions
37. Sigmoid and Relu functions
- Bounded
- Probability-like function
- Dense computation
- Differentiable
- On many examples of fully
connected layers
38. Sigmoid and Relu functions
- Bounded
- Probability-like function
- Dense computation
- Differentiable
- On many examples of fully
connected layers
We are too cool to speak
about linear activators, aren’t
we?
Not entirely...
39. Sigmoid and Relu functions
- Sparse activation
- Efficient computation
- “Differentiable”
- Unbounded
- Potential Dying Relu
- Convolutional-friendly
- Bounded
- Probability-like function
- Dense computation
- Differentiable
- On many examples of fully
connected layers
We are too cool to speak
about linear activators, aren’t
we?
Not entirely...
43. On the ease of Derivations
● Sigmoid
● Hyperbolic Tangent ● ReLU
● Softmax
44. On the ease of Derivations
● Sigmoid
● Hyperbolic Tangent ● ReLU
● Softmax
Handset value
45. Funciones de activación: Salidas
● Lineales:
● sf
● Binomiales : sigmoide
● ad
● Multinomiales: softmax
Activation
Functions
Loss
Functions
46. Regression error
● The most classic measure
● Penalizes highly big mistakes
● Less interpretable ● Scale invariant
● Symmetric
● Interpretable
● Harder differentiability and
convergence
● Harder differentiability and
convergence
● Penalizes less on higher
mistakes
● Interpretable
47. Regression error
● The most classic measure
● Penalizes highly big mistakes
● Less interpretable ● Scale invariant
● Symmetric
● Interpretable
● Harder differentiability and
convergence
● Harder differentiability and
convergence
● Penalizes less on higher
mistakes
● Interpretable
The choice is always problem-dependent
48. Funciones de coste
● Regresión:
● Clasificación:
The shortest way is not
always the best one
49. Classification and Categorical Cross- Entropy
● Categorical Cross-Entropy
Where indexes i and j stand for each example and resp. class, the ys stand for the true labels
and the ps stand for their assigned probabilities
On two classes it turns into the easy-to-understand, most common
When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error
closeness of a prediction, as it takes into account the closeness of a prediction .
Derivation also eases calculus compared with RMSE
50. Classification and Categorical Cross- Entropy
● Categorical Cross-Entropy
Where indexes i and j stand for each example and resp. class, the ys stand for the true labels
and the ps stand for their assigned probabilities
On two classes it turns into the easy-to-understand, most common
When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error
closeness of a prediction, as it takes into account the closeness of a prediction .
Derivation also eases calculus compared with RMSE
Classificator 1
Classificator 2
51. Classification and Categorical Cross- Entropy
● Categorical Cross-Entropy
Where indexes i and j stand for each example and resp. class, the ys stand for the true labels
and the ps stand for their assigned probabilities
On two classes it turns into the easy-to-understand, most common
When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error
closeness of a prediction, as it takes into account the closeness of a prediction .
Derivation also eases calculus compared with RMSE
Classificator 1
Classificator 2
53. Regularization: Norm penalties
● Add a penalty to the loss function:
● L2:
○ Keep weights near zero.
○ Simplest one, differentiable.
● L1:
○ Sparse results, feature selection.
○ Not differentiable, slower.
54. Regularization: Dropout
● Randomly drop neurons (along with
their connections) during training.
● Acts like adding noise.
● Very effective, computationally
inexpensive.
● Ensemble of all sub-networks
generated.
55. Regularization: Dropout
● Randomly drop neurons (along with
their connections) during training.
● Acts like adding noise.
● Very effective, computationally
inexpensive.
● Ensemble of all sub-networks
generated.
56. Regularization: Dropout
● Randomly drop neurons (along with
their connections) during training.
● Acts like adding noise.
● Very effective, computationally
inexpensive.
● Ensemble of all sub-networks
generated.
58. Optimization: Challenges
● The difficulty in training neural
networks is mainly attributed to their
optimization part.
● Plateaus, saddle points and local
minima grows exponentially with the
dimension
● Classical convex optimization
algorithms don’t perform well.
59. Optimization: Batch Gradient descent
● Goes over the whole training set.
● Very expensive.
● There isn’t an easy way to
incorporate new data to training set.
60. Optimization: Mini-Batch Gradient descent
● Stochastic Gradient Descent (SGD)
● Randomly sample a small number of
examples (minibatch)
● Estimate cost function and gradient:
● Batch size: Length of the minibatch
● Iteration: Every time we update the
weights
● Epoch: One pass over the whole training
set.
● k = 1 => online learning
● Small batches => regularization
61. Optimization: Variants
● Momentum:The momentum algorithm accumulates an exponentially decaying moving
average of past gradients and continues to move in their direction.
● AdaGrad: The learning rate is adapted component-wise, and is given by the square root of
sum of squares of the historical.
● RMSProp: modifies AdaGrad to perform better in the non-convex setting by changing the
gradient accumulation into an exponentially weighted moving average
● ADAM(Adaptive Moment): Combination of RMSPROP and momentum.
71. Welcome to the jungle!
● Me Tarzán, you Cheetah. Human friendly interface. User actions are
minimized in order to ease the process, isolating users from the backend.
● Territorial behaviors are allowed. Several backends can be used:
Tensorflow, CNTK and Theano (poor Theano!), but there is also another
interesting property on modularization: every model is a sequence of
standalone modules plugged together with as little restrictions as possible,
and allowing us to fully configure cost functions, optimizers, initializations,
activation functions ...
● Keeps your model herd a-growin’. New modules are simple to add, and
existing modules provide ample examples.
● Kaa is our friend. We love Python! It makes the lives of data scientists easier.
the code is compact, easier to debug, and allows for ease of extensibility.
73. Índice Analítico
Introducción: ¿por qué combinar modelos?
Boosting & Bagging basics
Demo:
○ Implementación de Adaboost con árboles
binarios
○ Feature Selection con Random Forest
1
2
3
Not all that
wander are lost
What do we say to those who think
machine translation sucks?
Not today!
137. Índice Analítico
Introducción: ¿por qué combinar modelos?
Boosting & Bagging basics
Demo:
○ Implementación de Adaboost con árboles
binarios
○ Feature Selection con Random Forest
1
2
3
Not all that
wander are lost
Any Questions?
Pon el simplescreenrecorder
Teoría vieja. Ahora es el momento
Representation learning
14
Two main software modules
PyStratio is a Python package providing complete access to all SparkML distributed algorithms via PySpark, as well to Stratio Crossdata.
RStratio is an R package that relies on SparkR and provides wrappers for SparkML distributed algorithms, a feature not supported by the official SparkR releases.
Integration with other distributed libraries such as H2O and Tensorspark
TensorFlow over Spark
REST API
That eases and speeds up the process of moving a trained model from development to production environments. The model becomes accessible through a web REST interface, and can be applied to get real-time predictions or on massive batch data.
Two main software modules
PyStratio is a Python package providing complete access to all SparkML distributed algorithms via PySpark, as well to Stratio Crossdata.
RStratio is an R package that relies on SparkR and provides wrappers for SparkML distributed algorithms, a feature not supported by the official SparkR releases.
Integration with other distributed libraries such as H2O and Tensorspark
TensorFlow over Spark
REST API
That eases and speeds up the process of moving a trained model from development to production environments. The model becomes accessible through a web REST interface, and can be applied to get real-time predictions or on massive batch data.
Two main software modules
PyStratio is a Python package providing complete access to all SparkML distributed algorithms via PySpark, as well to Stratio Crossdata.
RStratio is an R package that relies on SparkR and provides wrappers for SparkML distributed algorithms, a feature not supported by the official SparkR releases.
Integration with other distributed libraries such as H2O and Tensorspark
TensorFlow over Spark
REST API
That eases and speeds up the process of moving a trained model from development to production environments. The model becomes accessible through a web REST interface, and can be applied to get real-time predictions or on massive batch data.
Distributed cluster
Each user launchs its own environment
JupyterHub
each notebook is run independientemente
los datos están en local por facilitar el acceso y porque son pocos, pero el acceso a un datastore completo sería más data centric
We will execute an algorithm as simple as approximating Pi via Monte Carlo
Each node executes its tasks on an independent way.
Es in-graph
Es síncrono
The graph can be viewed via tensorBoard.
We will execute an algorithm as simple as approximating Pi via Monte Carlo
Each node executes its tasks on an independent way.
Es in-graph
Es síncrono
The graph can be viewed via tensorBoard.
Sin entrar en detalle, la parte del output * (1-output) con el cross-entropy se va, y no con el RMSE, con lo que no hay problemas de cuando damos probabilidades muy altas
Sin entrar en detalle, la parte del output * (1-output) con el cross-entropy se va, y no con el RMSE, con lo que no hay problemas de cuando damos probabilidades muy altas
Sin entrar en detalle, la parte del output * (1-output) con el cross-entropy se va, y no con el RMSE, con lo que no hay problemas de cuando damos probabilidades muy altas
setting α to 0 results in no regularization. Larger values of α correspond to more regularization.
Diferentes F, diferentes resultados
La derivada en L2 se incremena linealmente con w, en L1 es constante sign(w)
Optimization algorithms that use the entire training set are called batch or deterministic gradient methods,
For instance, if our train set has 1500 examples, and our batch size is 500, then it
will take 3 iterations to complete 1 epoch.
Keras (κέρας) means horn in Greek. It is a reference to a literary image from ancient Greek and Latin literature, first found in the Odyssey, where dream spirits (Oneiroi, singular Oneiros) are divided between those who deceive men with false visions, who arrive to Earth through a gate of ivory, and those who announce a future that will come to pass, who arrive through a gate of horn. It's a play on the words κέρας (horn) / κραίνω (fulfill), and ἐλέφας (ivory) / ἐλεφαίρομαι (deceive).
"Oneiroi are beyond our unravelling --who can be sure what tale they tell? Not all that men look for comes to pass. Two gates there are that give passage to fleeting Oneiroi; one is made of horn, one of ivory. The Oneiroi that pass through sawn ivory are deceitful, bearing a message that will not be fulfilled; those that come out through polished horn have truth behind them, to be accomplished for men who see them." Homer, Odyssey 19. 562 ff (Shewring translation).
ejemplo: yuxtaposición roja
Corto plazo: neurotransmisores de una neurona a la siguiente
medio plazo: activación de terminales que no hacían nada
Largo plazo: nuevos terminales. Implica variaciones en expresión génica, remodelaciones de la célula, etc
This is our programme. You will be able to see information and updates + our schedule by scanning this QR code