5. 5
Context is important
Edward Adelson
Neuroscientist, MIT
Checker shadow illusion
The squares represented by A and B
are of the same color
6. 6
Context is important
Edward Adelson
Neuroscientist, MIT
Checker shadow illusion
The squares represented by A and B
are of the same color
7. Can't play Spain? Improve your
playing via easy step-by-step video
lessons!
7
But sometimes it gets ambiguous...
8. 8
But sometimes it gets ambiguous...
Can't play Spain? Improve your
playing via easy step-by-step video
lessons!
9. Mom is a great TV show
9
But sometimes it gets ambiguous...
10. Mom is a great TV show
10
But sometimes it gets ambiguous...
Mother
11. ➔ Processing one word after another
➔ Assigning label to each word, based on local as well as global features
➔ Labels are B-PER, I-PER, B-LOC, I-LOC, OTHER, etc. (a.k.a IOB)
I/O am/O working/O for/O Basis/B-ORG Technology/I-ORG
11
NER as a sequence-labeling problem
13. Traditional ML vs. Deep Learning
I love this movie
words, part of speech tags,
lemmas, brown clusters
[00010010110000101001…..001]
☺ Positive
Feature extraction
Vectorization
Modeling
I love this movie
Embeddings lookup
[0.323, -0.3434, 0.901, …, -0.267]
[-0.4923, 0.554, 0.001, …, -0.365]
[1.58845, 0.478, 0.0901, …, -0.171]
…
[-0.0592, 0.588, -0.01, …, -0.111]
Modeling
☺ Positive
13
14. Word embeddings
- + BerlinJapan Germany
German
European
Europe
Africa
Tokyo =
15. 15
Feed forward network for NER
listen
to
while
I
Natural Language Processing (Almost) from Scratch (Collobert et al., 2011)
B-PER
B-LOC
...
...
Layer 1 Layer 2 Output
Spain I-PER
...
19. 19
Recurrent neural network (RNN)
➔ At each time step we
process one word
concatenated with
the output from
previous time steps
➔ It remembers information
for many time steps
20. 20
Recurrent neural network (RNN)
t-1 t t+1
B-PER I-PER OTHER
➔ At each time step we
process one word
concatenated with
the output from
previous time steps
➔ It remembers information
for many time steps
21. 21
Long Short Term Memory (LSTM)
LSTMIt can forget information when
necessary
LSTM LSTM
t-1 t t+1
B-PER I-PER OTHER
22. 22
LSTM for Sequence Labeling
LSTM
Washington
B-PER
LSTM
said
OTHER
LSTM
in
OTHER
LSTM
Chicago
B-LOC
LSTM
last
OTHER
...
23. +
23
Bidirectional LSTM for Sequence Labeling
Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al., 2015)
LSTM
Washington
B-PER
LSTM
+
LSTM
said
OTHER
LSTM
+
LSTM
in
OTHER
LSTM
+
LSTM
Chicago
B-LOC
LSTM
+
LSTM
last
OTHER
LSTM
...
24. 24
Multilayer LSTM for Sequence Labeling
+
LSTM
Washington
B-PER
LSTM
+
LSTM
said
OTHER
LSTM
+
LSTM
in
OTHER
LSTM
+
LSTM
Chicago
B-LOC
LSTM
+
LSTM
last
OTHER
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
+ + + + +
25. 25
Multilayer LSTM for Sequence Labeling
+
LSTM
Washington
B-PER
LSTM
+
LSTM
said
OTHER
LSTM
+
LSTM
in
OTHER
LSTM
+
LSTM
Chicago
B-LOC
LSTM
+
LSTM
last
OTHER
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
+ + + + +
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
+ + + + +
26. +
26
Alternative decoding using Conditional Random Fields (CRF)
LSTM
Washington
LSTM
+
LSTM
said
LSTM
+
LSTM
in
LSTM
+
LSTM
Chicago
LSTM
+
LSTM
last
LSTM
...
B-PER OTHER OTHER B-LOC OTHER
27. +
27
Alternative decoding using Conditional Random Fields (CRF)
LSTM
Washington
LSTM
+
LSTM
said
LSTM
+
LSTM
in
LSTM
+
LSTM
Chicago
LSTM
+
LSTM
last
LSTM
...
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
28. 28
Decoding with CRF
The global score of
a specific sequence
of labels
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
OTHER
I-LOC
B-LOC
I-PER
B-PER
29. 29
Decoding with CRF
The global score of
a specific sequence
of labels
T [O, I-PER] < T [B-PER, I-PER]
32. 32
Character encoding results
*Results are F score measured over Basis’ evaluation set
English Arabic Korean
BiLSTM 83.5 80.3 82.3
BiLSTM+Char 85.1 82.5 86.0
33. 33
Char encode, word encode, decode
Char encoding
Word encoding
Decoding
Washington said in Chicago last...
Labels
34. 34
Reported combinations
Char encoder Word encoder Decoder
Collobert et al. (2011) None CNN CRF
Mesnil et al. (2013) None RNN RNN
Nguyen et al. (2016) None RNN GRU
Huang et al. (2015) None LSTM CRF
Lample et al. (2016) LSTM LSTM CRF
Chiu & Nichols (2016) CNN LSTM CRF
Zhai et al. (2017) CNN LSTM LSTM
Yang et al. (2016) GRU GRU CRF
Strubell et al. (2017) None Dilated CNN CRF
Shen et al. (2018) CNN CNN LSTM
Borrowed from Shen et al. (2018)
36. 36
By Siddhartha Mukherjee
The dying algorithm - predicts death
for oncological patients
“Here is the strange rub of such a deep
learning system: It learns, but it cannot
tell us why it has learned…
...the algorithm looks vacantly at us
when we ask, Why? It is, like death,
another black box.”
Jan 2018
37. +
37
Bidirectional LSTM for NER
LSTM
Washington
B-PER
LSTM
+
LSTM
said
OTHER
LSTM
+
LSTM
in
OTHER
LSTM
+
LSTM
Chicago
B-LOC
LSTM
+
LSTM
last
OTHER
LSTM
...
38. + + + ++
38
What does LSTM actually learn?
LSTM
Washington
B-PER
LSTM
LSTM
said
OTHER
LSTM
LSTM
in
OTHER
LSTM
LSTM
Chicago
B-LOC
LSTM
LSTM
last
OTHER
LSTM
...
39. + + + ++
39
What does LSTM actually learn?
LSTM
Washington
B-PER
LSTM
LSTM
said
OTHER
LSTM
LSTM
in
OTHER
LSTM
LSTM
Chicago
B-LOC
LSTM
LSTM
last
OTHER
LSTM
...
Let’s look at this cell vector over time
...