TensorFlow is a wonderful tool for rapidly implementing neural networks. In this presentation, we will learn the basics of TensorFlow and show how neural networks can be built with just a few lines of code. We will highlight some of the confusing bits of TensorFlow as a way of developing the intuition necessary to avoid common pitfalls when developing your own models. Additionally, we will discuss how to roll our own Recurrent Neural Networks. While many tutorials focus on using built in modules, this presentation will focus on writing neural networks from scratch enabling us to build flexible models when Tensorflow’s high level components can’t quite fit our needs.
About Nathan Lintz:
Nathan Lintz is a research scientist at indico Data Solutions where he is responsible for developing machine learning systems in the domains of language detection, text summarization, and emotion recognition. Outside of work, Nathan is currently writting a book on TensorFlow as an extension to his tutorial repository https://github.com/nlintz/TensorFlow-Tutorials
Link to video https://www.youtube.com/watch?v=op1QJbC2g0E&feature=youtu.be
27. import tensorflow as tf
from tensorflow.examples.tutorials.mnist import
input_data
mnist = input_data.read_data_sets('MNIST_data',
one_hot=True)
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
Placeholders
28. import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
m = tf.get_variable('m', [784, 10])
b = tf.get_variable('b', [10])
Y_pred = tf.nn.xw_plus_b(X, m, b)
Parameters
and
Operations
29. import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
m = tf.get_variable('m', [784, 10])
b = tf.get_variable('b', [10])
Y_pred = tf.nn.xw_plus_b(X, m, b)
cost =
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
Y_pred, Y_true))
Cost
30. import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
m = tf.get_variable('m', [784, 10])
b = tf.get_variable('b', [10])
Y_pred = tf.nn.xw_plus_b(X, m, b)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(Y_pred,
Y_true))
optimzer =
tf.train.GradientDescentOptimizer(learning_rate=0.5)
.minimize(cost)
Optimizer
31. import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
m = tf.get_variable('m', [784, 10])
b = tf.get_variable('b', [10])
Y_pred = tf.nn.xw_plus_b(X, m, b)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(Y_pred,
Y_true))
optimzer =
tf.train.GradientDescentOptimizer(learning_rate=0.5) .minimize(cost)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for i in range(2000):
trX, trY = mnist.train.next_batch(128)
sess.run(optimzer, feed_dict={X: trX, Y_true: trY})
Train Code
54. Scaling Predictions
X (pixels)
[784]
m b
softmax(
mx + b)
Y_true
[10]
X = tf.placeholder(tf.float32, [128, 784])
Y_true = tf.placeholder(tf.float32, [128, 10])
m = tf.get_variable('m', [784, 10])
b = tf.get_variable('b', [10])
Y_pred = tf.nn.xw_plus_b(X, m, b)
cost =
tf.reduce_mean(tf.nn.softmax_cross_entrop
y_with_logits(Y_pred, Y_true))
VS.
cost =
tf.reduce_mean(tf.nn.softmax_cross_entrop
y_with_logits(tf.nn.softmax(Y_pred)
, Y_true))
64. Placeholders
X = tf.placeholder(tf.float32, [None, 784])
model = …
cost = …
optimizer = …
for i in range(1000):
trX, trY = mnist.train.next_batch(128)
sess.run(optimzer, feed_dict={X: trX, Y_true: trY})
65. Placeholders
X = tf.placeholder(tf.float32, [None, 784])
model = …
cost = …
optimizer = …
for i in range(1000):
trX, trY = mnist.train.next_batch(128)
sess.run(optimzer, feed_dict={X: trX, Y_true: trY})
66. Placeholders
X = tf.placeholder(tf.float32, [None, 784])
model = …
cost = …
optimizer = …
for i in range(1000):
trX, trY = mnist.train.next_batch(512)
sess.run(optimzer, feed_dict={X: trX, Y_true: trY})
67. Advanced Tensorflow: Building RNNs
Note – Most of the code for the generation is “pseudo-code” meant mostly
to illustrate my point. If you wish to see the actual code, feel free to email
me and I’ll send you a copy.
77. X = tf.placeholder(tf.float32, [27, 128, 28]) # first 27 rows of image
Y = tf.placeholder(tf.float32, [27, 128, 28]) # last 27 rows of image
m_output = tf.get_variable(tf.float32, [256, 28])
b_output = tf.get_variable(tf.float32, [28])
states = rnn(X)
output_img = tf.map_fn(lambda x: tf.nn.xw_plus_b(x, m_output,
b_output),
tf.pack(states))
cost =
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(output_
img, Y))
Language Model
78. def generate(num_steps):
states = [tf.zeros([batch_size, hidden_dim])]
for _ in range(num_steps):
next_output = tf.sigmoid(tf.nn.xw_plus_b(states[-1],
m_output,
b_output))
outputs.append(next_output)
state = gru.step_(states[-1], outputs[-1]))
states.append(state)
return tf.pack(outputs)
Language Model (Generate)
Welcome, my name is Nathan Lintz. I am a researcher at Indico Data Solutions and I spend a lot of time writing tensorflow. In this presentation we will learn how to build basic models in tensorflow, some tips and tricks to avoid common tensorflow pitfalls, and some advanced tensorflow techniques for building RNNs. Tensorflow, and to some extent machine learning more broadly is like learning how to bake a cake.
M + b = parameters associated with baking
Our operation is the multiply between m and x as well as the addition operation we apply with b
Transition To Next Slide:
While baking is cool, this is a somewhat contrived example. For our classification problem lets try something a bit more realistic, optical character recognition. We want to take an image consisting of black and white pixels and classify it as a digit from 0-9.
Confidences
98
39
73
28
If we didn’t have a nonlinearity, the hidden layer won’t do anything. For a sequence of linear operations there is an equivalent linear operation that only takes a single layer. Imagine we had a rubiks cube, the linear operations are like turning one of its faces. There are a limited number of transformations we can apply and they all kinda do the same thing, turn a face. In contrast, nonlinearities are like solving a rubiks cube in little brother mode where you smash it and then rebuild it. Nonlinearities let us smash features from our model in ways that linear operations simply cannot perform. Therefore they give our model more flexibility in solving its task.
I’d like to call out here that the only new part of this model is the hidden layer
Transition:
In addition to monitoring the model on test data, examples the model hasn’t seen, we’re also going to monitor the accuracy on train data, the examples it already has seen.
Overfitting can occur when the model has too many parameters. It learns an overly complex set of parameters to reduce the train which don’t generalize to our test data.
Dropout forces the model to learn more general representations. The parameters can’t get lazy and rely on eachother too heavily as they could with our original model. Dropout forces each parameter to learn how to process a useful feature from the data making them better at generalizing. Its kinda like good software design. You don’t want yourcomponents to be too tightly coupled. Sure, a tightly coupled system might be able to solve the task you’re working on currently. But as soon as you need to extend your system to new challenges, you run into trouble.
I’d like to call out here that the p_keep value we are setting is how likely we are to keep an activation. 0.8 means keep any given activation with an 80% probability. 1 means keep all of the activations.
Explicit about that p_keep is a little confusing
If you look at it all at once, you cant account for stuff like position of the word. The food at the restaurant was very good == restaurant the food good restaurant or w/e
Transition, explain that since we compute time t, this model can be used in a generative fashion as well.
RNNs can be run in different modes. At train time we treat it like a standard neural network. Alternatively, we can run the rnn in generation mode where we take an element of the input sequence at time t, we apply our rnn, compute the t+1 element of our sequence. We then feed the t+1 element back in to generate the t+2 element and so on.