6. Exploring the Twitter data
•All the Twitter data is stored in ElasticSearch…,
•We don’t know exactly yet what it looks like…,
•We want to create a Recurrent Neural Network with LSTM in Tensorflow…,
•So it’s a good thing Python has an ElasticSearch and a Tensorflow module!
Short demo
7. Predicting the sentiment of a tweet: positive or negative?
1 Million tweets… How to analyze these tweets and how do we put them in a deep learning algorithm?
Deep learning needs scalars or matrices of scalars as input.
For example a convolutional neural network uses pixels of images for object recognition
Likewise text/speech needs to be vectorized before analyzing it.
“Only words or word encodings provide no useful information regarding the relationships
that may exist between the individual symbols” (tensorflow.org).
So vectorization of our tweets….
9. The basic ideas behind a Word2Vec model
Word2Vec
model
Neural Network with one hidden layer
This hidden layer is a matrix with dimension N x D where
D is the length of a vector representing a word.
The input is a one-hot vector of a word and has dimension
N x 1 where N is the number of words in your dictionary.
The output layer is a vector with probabilities that a the
input word is the neighbour of the words in this vector.
This hidden layer is exactly what we are looking for!
10.
11. Pre-trained Word2Vec models
• Available on Stanford website (https://nlp.stanford.edu/projects/glove/)
• Data available with different number of words and several vector dimensions.
• In this project a set of 400k words is used with vectors of dimension 50 x 1.
• The data consist of a word list and a matrix:
❖The word list contains 400k words each represented by a number
❖The matrix has dimension 400k x 50, for each word a vector representation of length 50
12. Long-Short term memories, why should we use them?
source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks are sufficient if you want to predict for instance the sentiment of:
“The movie was really bad”
The problem arises when the relevant information is much further away or spread out over multiple
sentences:
“This is the best day ever. The weather is beautiful and I got a new job. However the movie I just saw
was really bad”
In a more simple recurrent network this may be predicted as negative. Long-Short term memories can
deal with the information in the whole text.
13. First an intuitive interpretation
•The complete network consist of n such layers.
•At each layer you put in the next word of your text, Xt, and add it to the already stored information.
•A number of updates and calculations are done and finally there is some output, ht and we move on
to the next layer.
And now step by step…
14. Step by Step: the main information line
•On this line all the information is stored and this information loops through all the cells until all words a
•Within each cell information is added, removed and updated.
15. Step by Step: the forget gate
•The next word is added to the cell, Xt, just like the information from the previous cell, ht-1.
•The sigmoid function determines which information from ht-1 is kept, e.g.:
When Xt is a new subject, you may want to forget the old one which is stored in the cell state at
the main line
•The outcome is multiplied with information from the cell state Ct-1.
16. Step by Step: the forget gate - example
• Assume the word at Xt is “bitcoin”. As earlier stated we use word vectors:
• The vector is multiplied by a weight matrix Wf with dimension 50 x (num LSTM units) and after that a bias is
added. In formula notation:
• We work with 50d vectors and 64 LSTM units, so the formula gives us:
• Finally this is put into the sigmoid function and the outcome goes to the cell state Ct
• Together with the previous state ht-1 the complete equation becomes:
𝜎(𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓)
𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓
𝜎(𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓)
𝜎(ℎ𝑡−1 ∗ 𝑊ℎ,𝑓 + 𝑏ℎ,𝑓)
17. Step by Step: the input gate
•The input gate consists of two functions:
1. A sigmoid function is used to determine what kind of information we would like to store. e.g. the
new subject
2. A tanh function is used to determine the content of the information, e.g. is the new subject male
or female?
•The output of these functions together is added to the current cell state Ct.
18. Step by Step: the output gate
•The output gate filters some information from the current cell state.
•A sigmoid decides what we are going to output and the tanh function makes sure the values are
between -1 and 1:
If we saw a new subject the output will be whether the subject is male or female, singular or plural.
20. Hyperparameters:
There are a lot of choices you have to make before training the RNN with LSTM.
• Length of the sequence: the number of LSTM cells.
• Number of LSTM units: comparable with the number of units in a layer of a regular NN.
• Iterations: how often you run the model during your training. Each iteration you run one batch.
• Batch size: each iteration you run one batch of tweets.
• Optimizer: the function that tries to optimize the loss. Often used functions are Gradient Descent and Adam.
• DropoutWrapper and its probability: the probability of keeping informations, it helps to prevent from overfitting.
• Learning rate: too big and you model may not converge, too small and it may take ages.
21. Loss function:
The loss function we use is softmax cross entropy:
• Softmax function: it squashes the output vector with real numbers to a vector with
real numbers between 0 and 1 and such that they add up to 1:
𝑆(𝑣)𝑖 =
𝑒𝑣𝑖
𝑘=1
𝑁
𝑒(𝑣𝑘)
• Cross entropy is an often used alternative of the well known squared error and is
defined by:
𝐻(𝑦, 𝑝) = −
𝑖
𝑦𝑖𝑙𝑜𝑔(𝑆𝑖)
Where Si is the output of the softmax function. Cross entropy is only useful when
the input is a probability distribution and therefore the Softmax function.
22. Optimization of the loss function:
The optimization functions used in this model are Gradient Descent and the Adam optimizer. The
Adam optimizer is an extension of Stochastic Gradient Descent. The SGD is defined as
SGD maintains a single learning rate for all parameter updates. Adam has learning rates for each
network weight and they are separately adapted.
• Adam: Adaptive Moment Estimation
• Adam stores the first and second moments (mean and variance) of the decaying average of the past gr
𝑚𝑡 = 𝛽1𝑚𝑡 − 1 + (1 − 𝛽1)𝛻𝑡
𝑣𝑡 = 𝛽2𝑣𝑡 − 1 + (1 − 𝛽2)𝛻𝑡
2
These variables are used to update your parameters/weights used in the model
𝑊𝑡+1 = 𝑊𝑡 −
𝛼
(𝑣𝑡) + 𝜖
𝑚𝑡
𝑊𝑡+1 = 𝑊𝑡 − 𝛼𝛻𝑡
http://ruder.io/optimizing-gradient-descent/index.html#adam
26. How about the ‘derivative’ of the sentiment?
• If the sentiment is getting better, the derivative is positive,
• If the sentiment is getting worse, the derivative is negative,
• If the sentiment is stable, the derivative is zero.
27. Discussion and conclusion
• Recurrent Neural Networks with LSTM are powerful tools to work with,
• The mathematics behind it are complicated, however the code is not that hard to understand,
• Many parameters to tune,
• Bitcoins and sentiment are not related according to this model.
Some possible improvements:
• Use a training set with the same kind of tweets as the actual set,
• Use other keywords in your tweets than only news and finance related topics
• Put a higher weight on tweets that were more retweeted than others.
28. Thank you all for coming
★Questions: https://www.linkedin.com/in/olaf-de-leeuw-6a2b073b/
★Code/Notebooks: https://github.com/olafdeleeuw/ODSC-London-2018
Hinweis der Redaktion
Leicester Square
We wanted to learn about RNN’s with LSTM and sentiment analysis.
Needed a cool topic, so bitcoin
We build an application in Java that collects Twitter data and stores it in ES. We run the collector during a couple weeks
We collected tweets with finance and news related items
The bitcoin data is stored in MySQL
Opportunity to learn some new things: ES
I wanted to learn about LSTM and Tensorflow
So I needed ES, Tensorflow and Recurrent NN —> python
We collected 1 million tweets but a RNN needs vectors, no strings
Example about image recognition
Strings provide no useful info to a RNN
How to convert the data? Vectorization
Words related in semantics, meaning and context are closer to each other
Word2Vec is a neural network: 1 hidden layer
Input is a 1-hot vector: see picture next slide. Length is the number of words in your dictionary. In my case 400k
Output of the NN is a vector with probabilities for all words in the dict that your input word is the neighbour.
Hidden layer is the vector matrix we want. It has dim 400k X 50 and is the vector representation of all the words in our dictionary
The hidden layer is the word vector matrix. We don’t need the output layer here
You can train a word2vec yourself but you need a lot of text. It is not the purpose of this talk. So I used pre-trained model.
There are sets available with vector dim from 50 to 300
So we split our tweets into words and each word in the tweet is converted to a word vector.
Use RNN with LSTM when regular RNN is not good enough, so when there is too much information and when it’s spread out.
Ref cholas blog
N cells, usually about the number of words. In our case the max length of tweets is about 60.
At each cell you put in a new word of your tweet
In the cell the input of the new word and output of previous cell is used to update your information about the sentiment, about your prediction..
Main layer, stores all relevant information
This goes from beginning to end, the output
The information is updated in each cell based on new words via multiplication and addition
Next word added just like information about the previous state
Sigmoid determines what to throw away from this —> 0 all, 1 nothing
Example: a new subject may be interesting and you may want to throw away the old subject
The output of sigmoid is multiplied with the current cell state to throw away this irrelevant data
Bitcoin as a vector —> Xt via Globe
Multiplication by weight matrix and add bias
In the model we start with a random normal distributed weight matrix and a constant bias
Via optimization algorithms such as SGD or Adam these weights and biases are updated
The outcomes are multiplied with Ct to throw away the information you don’t need anymore
AT the input gate you do 2 things:
- determine which items you want to update, e.g. the new subject
- determine what information you want to update: e.g. plural or singular, male or female
This is added (not multiplied) to Ct because you want to add information
In the last gate information is filtered which we would like to output
This information is also sent to the forget gate of the next cell
A sigmoid function determines which items are output, such as the new subject as in the previous example
A tanh function on the cell state determines what information the model outputs at this timestep
Start with all the tweets
Split them to lists
Create indices of words
Create vectors with the Globe dictionary/dataset
Run the RNN model with LSTM —> check loss, optimize with for example Adam
Evaluate the output labels
Explain hyperparameters
Batch size and number iterations may influence the overfitting of your model. My example subset, with bs 64 and 100k iterations
for each item you want the chances sum up to one —> softmax, e.g. 0,4 for pos and 0,6 for neg
So in fact it creates a probability distribution
Normal squared error causes non-convex functions for classification, therefore cross entropy. This makes sure we have a convex problem
Adam is better suitable because learning rate for each parameter, SGD 1 for all
Changing epsilon can help to prevent from fluctuations, in my model it didn’t
One period without predictions, because I had no data. Skiing :)