Big Data Analytics course: Named Entities and Deep Learning for NLP
1. Big Data and Deep Learning in NLP
christian.morbidoni@gmail.com
Lecture given as part of the
Specialization course in «Big Data Engineering», A.A. 2017/2018,
Università Politecnica delle Marche, Ancona
2. Outline
• Named Entities Recognition
• The NER Task .. a little theory
• Tools and API
• Open/commercial tools demos
• (Semantic) Data
• Applications
• Faceted browsing, recommendations, social data analysis…
• Deep Learning and NLP
(e.g. online news topics classification, hotel review sentiment polarity)
• Neural Networks .. a little “theory”
• fully connected networks
• convolutional networks
• .. in NLP
• Word Embeddings
• Text Classification
• Python examples (Pytorch)
6. Major challenges in entity
recognition
• Entity spotting - identification of the pieces of text that
represent entities
• Chunking – correctly selecting the sequence of words that
represents an entity
• Checking if a particular text segment really represents an entity
(i.e., it is not a false positive)
7. Major challenges in entity
recognition
• Determining the type of an entity
• Group (Team) vs. Location:
• “England won the World Cup” vs.“The World Cup took place in
England”
• Company vs. Artefact:
• “having shares in BBC” vs. “watching BBC”
• Location vs. Organization:
• “she met him at Heathrow” vs. “the Heathrow authorities…”
8. Approaches
• “Traditional”:
• List lookup approaches
• Rely on the use of domain specific dictionaries and gazetteer lists
• Rule-based approaches
• Big Data driven
• Machine learning and knowledge bases
• In practice: hybrid approaches
• They combine two or more of the aforementioned approaches
• Most frequently applied in practice
9. Rule based approaches:
shallow parsing
• Well-known Hearst patterns for recognizing entities of different types
• M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proc. of
the 14th Int’l, Conference on Computational Linguistics, Nantes France, 1992 (link).
• such NP as {NP,}* {or | and} NP
• ... works by such authors as Herrick, Goldsmith, and Shakespeare
• NP {,} including {NP,}* {or | and} NP
• All common-law countries, including Canada and England ...
• NP {,} especially {NP,}* {or | and} NP
• ... most European countries, especially France, England, and Spain.
10. Rule-based approaches:
regular expressions
• Particularly suitable for detecting entities whose textual representation
has to follow a well-defined structure
• An example: regular expression for recognizing someone’s username
11. Web (data) - based
• Often used in combination with Machine
Learning
• Huge amount of training data!
• —> Deep Neural Networks
Hint:
“The more example you show to the machine, the better it learns”
We need BIG DATA here!
12. Big training data
• Each 1000 Wikipedia pages proceed…
• … over 100.000 examples to train the machine learning NER
model
13. (some) Tools
• Basic python example (NLTK)
• Open tools:
• AIDA
• Link: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-
naga/aida/
• Demo: https://gate.d5.mpi-inf.mpg.de/webaida/
• DisplaCy
• https://explosion.ai/demos/displacy-ent
•A list of open tools: http://nlp.cs.rpi.edu/kbp/2014/tools.html
• Open tools using “Big Data”
• DBpedia Spotlight
• Demo: https://www.dbpedia-spotlight.org/demo/
•DBPedia what? A big semantically structured graph. Ex. http://dbpedia.org/page/Harry_Potter
• TAGME!
• Demo: https://tagme.d4science.org/tagme/
• Commercial tools (available for research purposes)
• Dandelion API
• Demo: https://dandelion.eu/semantic-text/entity-extraction-demo/
• Derived from TAGME!
• TextRazor
• Demo: https://www.textrazor.com/demo
14. Python: NER with NLTK
• Locale: http://localhost:8888/notebooks/NE_example.ipynb
• Jupyter Notebook:
• https://nbviewer.jupyter.org/github/chrmor/ExampleCode/blob/master/NE_example.ipynb
• NLTK demo page: http://text-processing.com/demo/tag/
15. Open source tools: example
• Demo: https://explosion.ai/demos/displacy-ent
16. Open source tools: example
• Demo: https://gate.d5.mpi-inf.mpg.de/webaida/
20. Evaluation
• Example of experimental evaluation:
• http://gerbil.aksw.org/gerbil/experiment?id=201807090004
• http://gerbil.aksw.org/gerbil/experiment?id=201807090005
• Run other experiments:
• http://gerbil.aksw.org/gerbil/
21. Knowledge graphs
• Entity recognition is a way of:
• Making the semantics of a document emerge
• Connect the text to a big knowledge graph (like Wikipedia, DBpedia
or others)
• A knowledge graph
semantically connects resources
(entities)
22. Knowledge graphs
• Entity recognition is a way of:
• Making the semantics of a document emerge
• Connect the text to a big knowledge graph (like Wikipedia, DBpedia
or others)
• A knowledge graph
semantically connects resources
(entities)
text document
23. Knowledge graphs
• Entity recognition is a way of:
• Making the semantics of a document emerge
• Connect the text to a big knowledge graph (like Wikipedia, DBpedia
or others)
• A knowledge graph
semantically connects resources
(entities)
text document
24. DBPedia.org
• Similar to Wikipedia but…
• Wikipedia has pages
• Mostly free-text (un-structured)
• DBPedia has entities (resources):
• contains structured information
• Based on Semantic Web and
Linked Data standards
• “Semantically” Structured Data
• Data and schema together
• It is a graph where nodes are entities
(resources)
Example:
http://dbpedia.org/page/Harry_Potter
26. Exploring DBpedia
• “instances of Software in DBpedia”:
• select distinct ?software
where
{?software a <http://dbpedia.org/ontology/Software>}
LIMIT 100
• Run query via HTTP API:
• http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+
%3Fsoftware+where+%7B%3Fsoftware+a+
%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FSoftware%3E%7D+LIMIT+100&format=text%
2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on&run=+Ru
n+Query+
• Choose a result and browse the graph:
• http://dbpedia.org/page/Final_Fantasy_II
40. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
41. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
42. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
43. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
• Goal: Providing feedback to journalists:
44. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
• Goal: Providing feedback to journalists:
• highlighting aspects of a story that needs to be further addressed
45. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
• Goal: Providing feedback to journalists:
• highlighting aspects of a story that needs to be further addressed
• finding issues that appear to be of interest for the public but have been ignored
46. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
• Goal: Providing feedback to journalists:
• highlighting aspects of a story that needs to be further addressed
• finding issues that appear to be of interest for the public but have been ignored
• helping local news-papers echo international press releases
47. Big Data application example:
“Social recommender for Journalists”
•Given a popular event:
•W3 recommends aspects still uncovered (or poorly covered) in
news articles related to events
• Detecting communication and information needs respectively, in
micro-blogs (Twitter) and online information sources (Wikipedia)
• Goal: Providing feedback to journalists:
• highlighting aspects of a story that needs to be further addressed
• finding issues that appear to be of interest for the public but have been ignored
• helping local news-papers echo international press releases
48. The task
•Given a corpus of news
•Find topics (sets of named entities, hashtags, words) that:
49. The task
•Given a corpus of news
•Find topics (sets of named entities, hashtags, words) that:
• Emerge as topics of collective interest in Twitter and
Wikipedia (popularity)
• Are related to a given event reported in news items
(saliency)
• Are novel w.r.t. the reported news items (serendipity)
50. The task
•Given a corpus of news
•Find topics (sets of named entities, hashtags, words) that:
• Emerge as topics of collective interest in Twitter and
Wikipedia (popularity)
• Are related to a given event reported in news items
(saliency)
• Are novel w.r.t. the reported news items (serendipity)
53. Data
• Time span: Jun-Sept 2014
•Online news
• Gathered from Google News(GN) and highbeam.com(HB)
• For each news: day of publication, source, title, text snippet (~25 words)
•351,922 news from 88 sources (GN), 1,181,166 news from 325 sources (HB)
54. Data
• Time span: Jun-Sept 2014
•Online news
• Gathered from Google News(GN) and highbeam.com(HB)
• For each news: day of publication, source, title, text snippet (~25 words)
•351,922 news from 88 sources (GN), 1,181,166 news from 325 sources (HB)
•Wikipedia page views
• Downloaded from the Wikimedia dumps site
• English queries only, Wikipedia pages only, no redirects
• 27.708.310.008 page-views on about 388 million pages
55. Data
• Time span: Jun-Sept 2014
•Online news
• Gathered from Google News(GN) and highbeam.com(HB)
• For each news: day of publication, source, title, text snippet (~25 words)
•351,922 news from 88 sources (GN), 1,181,166 news from 325 sources (HB)
•Wikipedia page views
• Downloaded from the Wikimedia dumps site
• English queries only, Wikipedia pages only, no redirects
• 27.708.310.008 page-views on about 388 million pages
•Twitter messages
• Collected 1% of Twitter traffic using the standard Twitter API
• Overall, about 235 million tweets
56. Entity extraction & linking
• Pre-processing: extract entities mentions from news and
tweets
• State of the art Entity Extraction & Linking tools
• Dandelion for news (https://dandelion.eu)
• Text Razor for tweets (https://www.textrazor.com/)
• Both of them link to Wikipedia pages
57. Symbolic Aggregate Approximation
(convert signals into symbolic strings)
SAX conversion of the signal ”Ukraine” in a Twitter stream, with an
alphabet of 2 symbols during a 10-days window in July 2014.
58. Example of cluster: Malaysia Airlines
crash (Wikipedia page views)
Clusters are sets of active tokens (= words, entities,
hashtags, page views.. ) matching patterns of collective attention
AND with a similar shape in the same temporal window
59. Malaysia Airlines crash: meta-
clusters alignment
[Ukraine, Malaysia Airlines, Surface-to-air missile, Malaysia, Kuala Lumpur, Eastern Ukraine,
Malaysia Airlines Flight 17, Boeing 777, Amsterdam, Airliner, Russia, Government of Ukraine,
Buk missile system, Hrabove, Donetsk Oblast, 2014 pro-Russian unrest in Ukraine, Soviet Union,
Kiev, War in Donbass, Barack Oba<ma, United States, Malaysian, Amsterdam Airport
Schiphol, Russian Empire, Jet airliner, . . . ]
________________________________________________________________________________________
[Malaysia, Aviation accidents and incidents, Malaysia Airlines, Ukraine, Airline, Malaysia
Airlines Flight 17, Russia, Airliner, Passenger, Boeing 777, Interfax, Eastern Ukraine, Kuala
Lumpur, Jet aircraft, Missile, Amsterdam, Boeing, Reuters, Vladimir Putin, Aviation, Jet airliner,
United States, Airplane, CNN, President of Russia, Surface-to-air missile, AirAsia, Barack
Obama, Kiev, United Kingdom, Government of Ukraine, Aircraft, Buk missile system, Sky
News, Flight recorder, BBC, Southwest Airlines, Terrorism, Carpet bombing, Altitude, Iran Air,
France, Ministry of Internal Affairs (Ukraine), USS Vincennes (CG-49), . . . ]
________________________________________________________________________________________
[Kuala Lumpur, Siberia Airlines Flight 1812, Korean Air Lines Flight 007, Malaysia Airlines,
Boeing 777, Surface-to-air missile, Malaysia, Buk missile system, Malaysia Airlines Flight 370,
Malaysia Airlines Flight 17, Iran Air Flight 655, Ukraine, 2014 Crimean crisis, Pan Am Flight 103,
Ukraine, Malaysia Airlines Flight 17, 2014 proRussian unrest in Ukraine, Crimea, Igor Girkin,
Russia, 2014 Russian military intervention in Ukraine, bermuda triangle, . . . ]
NewsTwitterWikipedia
60. Malaysia Airlines crash:
Recommendations
Rin news: [ ukraine, russia, malaysia airlines, kuala lumpur, malaysia airlines
flight 17, surface-to-air missile, boeing 777, buk missile system, 2014 pro-
russian unrest in ukraine, malaysia, crimea, igor girkin, malaysia airlines flight
370 ]
Rnovel: [ministry of internal affairs (ukraine), southwest airlines, iran air,
interfax, trans world airlines, flight recorder, jet aircraft, military aircraft,
buffer state, carpet bombing, uss vincennes (cg-49) ]
________________________________________________________________________
Rin news: [ukraine, malaysia airlines, malaysia airlines flight 17, russia, kuala
lumpur, boeing 777, malaysia, crimea, iran air flight 655, buk missile system,
2014 pro-russian unrest in ukraine ]
Rnovel: [malaysia airlines flight 370, 2014 crimean crisis, 2014 russian military
intervention in ukraine, korean air lines flight 007, bermuda triangle, siberia
airlines flight 1812, pan am flight 103 ]
TwitterWikipedia
62. Text classification
• Task: classify text into N classes
• Examples:
• Sentiment classification:
• Classes: positive, negative
• Language detection:
• Classes: English, Italian, Spanish, etc.
• News topic classification:
• classes: sport, economy, politics, etc.
• Focus: Using (Deep) Neural Networks
63. NLP & Machine Learning
•The input is always a Feature Vector
• Example:
• Features related to individual words:
word length;
first capital letter;
all capital letters;
part of speech (POS) role;
the frequency of the word’s occurrence in the training set;
position in the sentence,…
• Features related to the word’s context/surrounding:
width of the surrounding;
the types of words (POS) in the surrounding, …
9
1
0
6
1000
65
10
12
FeatureVector
66. Features ?
• “Good old" ML:
• features are accurately chosen by a domain experts
• features are extracted as a data pre-processing
• Feature tuning is an art…
• But in the era of Big Data…
• Use the raw data as features!
• KEY IDEA: let the learn the correct features itself!
• In the case of images: pixel colors
• In the case of text: words, characters, n-grams
67. Machine Learning Classifier
RawData
Classifier
?
0 | 1
0
1
0
0.9
Output(labels)
Some fancy algorithm
(Binary classifier)
(Multi-class classifier)In machine learning:
• Feature vector as input
• A vector as output
• A loss function measure deviation from correct output
• Iterative retroaction to minimize loss
70. Representing RGB Images
image width
imageheight
NUMBERS: RGB
levels
How to detect patters in image?
We do not want pixel to be independent!
We want exploit proximity of pixels.
3
(RGB)
71. What about TEXT?
• Many ways to represent TEXT:
• Stream of characters
• Bag of words
• Sequences of words
• But we need a numerical input!
• A vector or matrix that we ca feed to the classifier
72. “Traditional” NLP features
word length
first capital letter
all capital letters
part of speech (POS) role
the frequency of the word’s occurrence
position in the sentence
width of the surrounding
the types of words (POS) in the surrounding
9
1
0
6
1000
65
10
12
FeatureVector
• Features have “meaning”
• Are chosen by experts
73. Bag of Words
1 1 1 2 1 1……….
TEXT:
“The dog and the cat are good friends”
are
dog
cat
the
good
friends
vector length: the size of the vocabulary
(e.g. all the possible english words)
FEATURE VECTOR
74. Bag of Words
1 1 1 2 1 1……….
TEXT:
“The dog and the cat are good friends”
are
dog
cat
the
good
friends
vector length: the size of the vocabulary
(e.g. all the possible english words)
• Possible problems:
• Sparse vectors: a lot of Zeros!
• Huge vector size!
FEATURE VECTOR
75. Neural Network Classifier
RawData
Neural Network ?
0 | 1
0
1
0
0.9
Output(labels)
Some fancy algorithm
(Binary classifier)
(Multi-class classifier)In machine learning:
• Feature vector as input
• A vector as output
• A loss function measure deviation from correct output
• Iterative retroaction to minimize loss
TEXT
76. Linear Regression
To make a predictor:
We want to draw a line that best approximate house prices
y = W*x + b
Find the best W and the best b
price
m2
Example: HOUSES PRICE
77. Logistic Regression: Classifier
We want to draw a line that separates pink items from blue items
y = sigmoid( W*x + b )
Find the best W and the best b
In this case we have
only one feature (m2)
Feature vector
1 x 1
Output:
1 x 1
78. Logistic Regression: Classifier
We want to draw a line that separates pink items from blue items
y = sigmoid( W*x + b )
Find the best W and the best b
In this case we have
only one feature (m2)
Feature vector
1 x 1
Output:
1 x 1
79. Logistic Regression: Classifier
We want to draw a line that separates pink items from blue items
y = sigmoid( W*x + b )
Find the best W and the best b
In this case we have
only one feature (m2)
Feature vector
1 x 1
Output:
1 x 1
81. Logistic Regression: Classifier
• How this works in practice?
• At each Epoch the classifier look at all the data
examples
• First initialize W and b to random
• Et each Epoch:
• calculate y = w * x + b for all examples x
• calculate the Loss
• Loss: distance of between output (y) and true labels
• Update w and b to minimize Loss
82. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
83. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
84. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
0.25 0.67
0.54 0.13
0.91 0.58
0.72 0.33
0.04 0.69
85. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
86. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
87. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
88. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
0.05 0.33
0.20 0.25
0.02 0.01
0.36 0.36
0.25 0.01
89. from
Logistic classifier as a “shallow” Net
featurevector
output
m2
#rooms
#bathrooms
garden?
city
distance
Class A
Class B
Here we have
5 feature,
2 output classes
Feature vector
1 x 5
Output:
1 x 2
y1
= w1,1
* x1
+ b1
w1,1 w1,2
w2,1 w2,2
w3,1 w3,2
w4,1 w4,2
w5,1 w5,2
[ x1 x2 x3 x4 x5 ] * + [ b1 b2 ] = [ y1 y2 ]
X * W + B = Y
0.05 0.33
0.20 0.25
0.02 0.01
0.36 0.36
0.25 0.01
Network’s
parameters
(we learn them
via training)
90. … to (Deep) Neural Networks
[W1] [b1] [W2] [b2] [W3] [b3][X] [Y]
Forward propagation: compute Y from X using weights
Loss
Backward propagation: compute derivatives and update weights
re-do
until
convergence
1
2
3
4
[W4] [b4]
91. Back-prop: Gradient Descent
For each iteration step:
calcolate the derivative of the function
move a step in the direction of the derivative (updates
network parameters)
Simple gradient descent
example in one dimension!
92. Back-prop: Gradient Descent
For each iteration step:
calcolate the derivative of the function
move a step in the direction of the derivative (updates
network parameters)
Simple gradient descent
example in one dimension!
93. Back-prop: Gradient Descent
Loss function
We have to find the minimum!
Predicting function:
we what to fit the data!
Two dimension example
For each iteration step:
calcolate the derivative of the function
move a step in the direction of the derivative (updates network parameters)
Fancy video: https://www.youtube.com/watch?v=GCvWD9zIF-s
94. Back-prop: Gradient Descent
Loss function
We have to find the minimum!
Predicting function:
we what to fit the data!
Two dimension example
For each iteration step:
calcolate the derivative of the function
move a step in the direction of the derivative (updates network parameters)
Fancy video: https://www.youtube.com/watch?v=GCvWD9zIF-s
95. Neural Networks in practice
• Today, practitioners do not have to understand all the details…
• A number of easy to use frameworks exists
• Python rulez! :-)
• PyTorch, https://pytorch.org/
• TensorFlow, https://www.tensorflow.org/
• Keras, https://keras.io/
• Theano, http://deeplearning.net/software/theano/
• ….
• Note: for Big Data you need computational power.
96. Practical aspects
• NN models can be big in memory and can require a lot of
CPU time to run trough several epochs
• Often run on Graphics Processing Unit (GPU)
• Faster than CPUs to compute matrix operations
• In house solutions relatively cheap
• Business plans available for cloud services (e.g.
Amazon)
97. PyTorch example
• Text classification with “simple plain” BOW with
Pytorch tutorial:
• Based on official tutorial: https://pytorch.org/tutorials/
beginner/nlp/deep_learning_tutorial.html
• Notebook: deep_learning_tutorial(NLP).ipynb
98. Network parameters as “features”
• With feature vector we indicate the input of a classifier
• In deep networks:
• Ouput of layer X is input to layer X+1
• … its feature vector
W1 W2 W3
99. Network parameters as “features”
• With feature vector we indicate the input of a classifier
• In deep networks:
• Ouput of layer X is input to layer X+1
• … its feature vector
W1 W2 W3
100. Network parameters as “features”
• With feature vector we indicate the input of a classifier
• In deep networks:
• Ouput of layer X is input to layer X+1
• … its feature vector
W1 W2 W3
101. Network parameters as “features”
• With feature vector we indicate the input of a classifier
• In deep networks:
• Ouput of layer X is input to layer X+1
• … its feature vector
W1 W2 W3
102. Network parameters as “features”
• With feature vector we indicate the input of a classifier
• In deep networks:
• Ouput of layer X is input to layer X+1
• … its feature vector
W1 W2 W3
The network finds hidden features
• They are relevant to the task (e.g. classification)
• We do not know their meaning (semantics)
103. NPL input vector
and embeddings
1 1 1 1 1 1……….
cat
the
good
vector length: the size of the vocabulary
(e.g. all the possible english words)
- Suppose we have a 1-HOT encoding (0 or 1 only) -
are
dog
0 1 2 3 …… N
0
1
2
3
104. NPL input vector
and embeddings
1 1 1 1 1 1……….
cat
the
good
vector length: the size of the vocabulary
(e.g. all the possible english words)
- Suppose we have a 1-HOT encoding (0 or 1 only) -
are
dog
good
W1
are
dog
0 1 2 3 …… N
0
1
2
3
105. NPL input vector
and embeddings
1 1 1 1 1 1……….
cat
the
good
vector length: the size of the vocabulary
(e.g. all the possible english words)
- Suppose we have a 1-HOT encoding (0 or 1 only) -
are
dog
good
W1
Embedded
word vector of
“dog”
… semantically
represents the concept
“dog” (?!?)
are
dog
0 1 2 3 …… N
0
1
2
3
106. How to learn
good embeddings?
• We want the embedded vector of a word to encode its
“semantics”
• Semantics can be derived by CONTEXT.
107. How to learn
good embeddings?
• We want the embedded vector of a word to encode its
“semantics”
• Semantics can be derived by CONTEXT.
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
108. How to learn
good embeddings?
• We want the embedded vector of a word to encode its
“semantics”
• Semantics can be derived by CONTEXT.
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
The Context of “cat” are
the words that often occurs
closer
It is related to the meaning of
the word.
109. Learning word embedding
… via neural networks
• Idea: train a network to predict the context of a word…
• … after training, the weight matrix W will contain
embeddings:
• related to the context of the word (other words
appearing together)
• are dense vectors: short (e.g. 100 - 300) and with no
zeros
110. Learning word embedding
… via neural networks
cat
[dog,
selfish,
friends]
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
111. Learning word embedding
… via neural networks
cat
[dog,
selfish,
friends]
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
112. Learning word embedding
… via neural networks
cat
[dog,
selfish,
friends]
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
dog
cat
friends
… numeric vector …
W
(Embeddings)
selfish
113. Learning word embedding
… via neural networks
cat
[dog,
selfish,
friends]
… the cat and the dog …
… my cat is selfish …
… dog and cats are friends …
dog
cat
friends
… numeric vector …
W
(Embeddings)
selfishDense vector
• A representation of the context of a word
• “Semantic” representation of a word
114. Learning word embedding
… via neural networks
• Two alternative models for Word Embeddings Learning:
• CBOW: input: sum context of words, output: word
• Skip-gram: input: word, output context vector
115. Pre-trained embeddings
• Available pre-trained Embeddings:
◦ ConVec (Wikipedia pages embeddings), reference paper [1]
◦ NASARI-embed (Babelnet sinstes embeddigns)
◦ RDF2vec embeddings (DBpedia and WikiData entities embeddings)
◦ Google News word2vect (Google News word embedding + Freebase
entity embeddings)
◦ Glove pretrained vectors (word embedding)
◦ FastText (word embeddings)
◦ …proven to be effective in different of machine learning tasks on
your own data
116. Embedded vectors “Magic”
• Vectors of semantically related words are similar!
• Embeddings encode words semantics
120. Embedded vectors algebra
• Readings:
• https://deeplearning4j.org/word2vec.html
• https://www.tensorflow.org/tutorials/representation/word2vec
King + Woman ≈ Queen
!
Other examples:
Geopolitics: Iraq - Violence ≈ Jordan
Distinction: Human - Animal ≈ Ethics
President - Power ≈ Prime Minister
Library - Books ≈ Hall
Analogy: Stock Market ≈ Thermometer
121. Embeddings as input features
Neural Network ?
0 | 1
0
1
0
0.9
Output(labels)
(Binary classifier)
(Multi-class classifier)
Word embeddings
122. Deep Learning and Big Data
• Deep neural networks need a lot of training examples to work well
• Data examples has to be annotated!
• for example in sentiment analysis as positive or negative
• Search engines and social networks are obvious source of huge amount
of annotated data (e.g. facebook likes, frequent searched keywords)
• This is generally not available to researchers
• Big evaluation datasets are provided by the research community for
different NLP tasks
• However in many application the availability of training data is a problem!
123. Pytorch example:
Learning embeddings
• A simple network to learn embeddings from a text
corpus:
• https://pytorch.org/tutorials/beginner/nlp/
word_embeddings_tutorial.html
• Notebook: word_embeddings_tutorial.ipynb
124. Convolution
• Convolutional Neural Network can be better explained
with image classification
• Intuitive video explanation:
• https://www.youtube.com/watch?
time_continue=105&v=ISHGyvsT0QY
• For more details see, for example:
• https://ujjwalkarn.me/2016/08/11/intuitive-explanation-
convnets/
125. Convolution
Neural Network ?
Exploit local patterns
Learns patters that are relevant to the classification
(e.g. edges, colored elements, faces, etc.)
126. Evaluation & data
• Measures
• How do we know if a NER system works?
• random split data in train/val/test datasets
TRAIN DATA TEST DATA
VALIDATION
DATA
To train the classifier.
Data that we show to the classifier.
Data we use to tune to classifier.
Adjust parameters to obtain better results.
To evaluate the performances:
Precision: how many entities spots are correct?
Recall: how many entity spots on all possible entities?
127. Pytorch example: images
• Image classification with a simple 1-layer network with
Pytorch tutorial:
• http://pytorch.org/tutorials/beginner/blitz/
cifar10_tutorial.html
• Notebook: cifar10_tutorial_original.ipynb
128. Convolution
• Intuitive for images…
• .. what about TEXT?
• we can represent documents as a vector or matrix of word
embeddings
• They are numeric an dense
• We can exploit words proximity with convolution
129. Convolution
• Intuitive for images…
• .. what about TEXT?
• we can represent documents as a vector or matrix of word
embeddings
• They are numeric an dense
• We can exploit words proximity with convolution
text document
…….
word vectors of consecutive words
130. Convolution
Neural Network ?
Exploit local sequences of words
Learns patters that are relevant to the classification
(e.g. edges, colored elements, faces, etc.)
131. Pytorch example:
ConvNets for Text Classification
• Paper: Convolutional Neural Networks for Sentence Classification:
https://arxiv.org/abs/1408.5882
• Several implementations on GitHub
• Example: https://github.com/Shawn1993/cnn-text-classification-pytorch
• Jupiter Notebook used in this lecture:
• https://github.com/chrmor/NN-experiments/blob/master/Experiment_1.py
• http://localhost:8888/notebooks/Events2Categories/Experiments/
Experiment_1.ipynb
• Two models:
• ConvNet on word2vec Google embeddings
• C-BOW on word2vec Google embeddings
132. Pytorch example:
ConvNets for Text Classification
['India', 'enacts', 'new', 'rules', 'designed', 'to',
'make', 'it', 'more', 'difficult', 'for', 'foreign',
'investors', 'to', 'use', 'the', 'country', 'as', 'a',
'tax', 'dodge.', '(Bloomberg)']
Class: business and economy
PredictedClass: business and economy
['Barack', 'Obama', 'and', 'Raul', 'Castro', 'exchange',
'handshakes', 'despite', 'a', 'United', 'States',
'embargo', 'against', 'Cuba.', '(ABC', 'News)']
Class: arts and culture
PredictedClass: politics and elections
After 1 Epoch of training with CBOW….
Accuracy: 71%
OK
ERR
133. Pytorch example:
ConvNets for Text Classification
['India', 'enacts', 'new', 'rules', 'designed', 'to',
'make', 'it', 'more', 'difficult', 'for', 'foreign',
'investors', 'to', 'use', 'the', 'country', 'as', 'a',
'tax', 'dodge.', '(Bloomberg)']
Class: business and economy
PredictedClass: business and economy
['Barack', 'Obama', 'and', 'Raul', 'Castro', 'exchange',
'handshakes', 'despite', 'a', 'United', 'States',
'embargo', 'against', 'Cuba.', '(ABC', 'News)']
Class: arts and culture
PredictedClass: international relations
After 1 Epoch of training with CBOW….
Accuracy: 78%
OK
ERR
135. Readings
▪ A very good overview and tutorial:
• A Primer on Neural Network Models for Natural Language Processing (2015)
• A goog blog post:
• Best Practices for Document Classification with Deep Learning (blog-post)
• Selected papers to deepen some lecture topics:
• Bag-of-embeddings for text classification (2016), IJCAI
◦ Text classification with heterogeneous information network kernels (2016), AAAI
◦ Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the
Loop (2018)
• Very Deep Convolutional Networks for Text Classification (2017)
• Character-level Convolutional Networks for Text Classification (2015)
• RDF2Vec: RDF Graph Embeddings for Data Mining (2016)