New trends in NLP applications

New trends in
NLP applications
Constantin Orasan
University of Wolverhampton, UK
http://www.wlv.ac.uk/~in6093/
6th September 2015 RANLP 2015, HISSAR, BULGARIA 1/100

A better title: Constantin’s
subjective view of some of the
interesting trends in NLP that
can be presented in 3 hours

The latest trend in NLP is …
natural language understanding

Not understanding like in …
“Open the pod bay doors, please Hal...”
Jurafsky, D., & Martin, J. H. (2009) Speech and language processing (2nd ed.). Pearson Prentice Hall. More information
from http://www.cs.colorado.edu/~martin/slp.html

NLU for specific applications
• Translate texts between two languages
• Simplify texts
• Find out the opinion/sentiment of texts
• Find out the entities mentioned in texts and the relations between them
• Answer questions from large collections of documents
• Help customers navigate knowledge databases
• Filter spam in social media
• Profile people
• Summarise texts
• ….
Are these
new?

Then: 1993
Source: https://en.wikipedia.org/wiki/On_the_Internet,_nobody_knows_you%27re_a_dog

Now
Source: http://www.kdnuggets.com/2015/08/cartoon-big-data-internet-dog-question.html

The technology advanced
The Internet evolved
Web 2.0
Openness
More access
Better hardware

NLP is approaching maturity
More interest from companies on developing and deploying working
applications
Interest from users to employ NLP technologies in their company
“NLP for masses”
More datasets available, more tools available

Structure of the tutorial
1. Text analytics: example of establish field with impact on industry
2. Crowdsourcing
3. Processing large quantities of data
4. Deep-learning

RANLP 2015, HISSAR, BULGARIA 11/100
Text analytics

Text analytics – from users’
perspective
From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on
Solutions and Providers. Retrieved from www.altaplana.com
Text analytics = software and transformational processes that uncover
business value in “unstructured” text. Text analytics applies statistical,
linguistic, machine learning, and data analysis and visualization techniques
to identify and extract salient
information and insights. The goal is to
inform decision-making and support
business optimization.
Survey of 220 users of text analytics tools

Text analytics
Can benefit from crowdsourcing
Requires processing of large quantities of data
Needs better ML algorithms
It is widely and successfully used by companies
Other similarly successful applications are machine translation and
virtual personal assistants

From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com

Comments on the overall
experience
• It is a messy business, but invaluable if there is no other information available.
• It gives as an overview of the data that we could not achieve without it.
• I have been doing text analytics since 1984, and I have yet to find an environment that meets my
requirements for knowledge extraction.
• When applied properly and when its limits are understood, it works quite well.
• With access to proper info, I can generate a PhD level analysis in one day.
• We annotate incoming text against our taxonomy and then use the
annotations as the basis of text analytics as well as search.
• As with any “adolescent” technology, there is no single end-to-end
product that finds, analyzes, and visualizes all available data sources
• Accuracy needs improvement. Tools need to be customized to specific
business cases.
• Still need a human to interpret context, inference, etc.
• It is (relatively) easy to apply algorithms. It is difficult to assess the
accuracy of the results or to translate them into strategic insight.
• Text content analytics is in its early infancy, and there is a long road
ahead.

Technology-related growth
drivers1
Open source: lowers the barriers to technology adoption and enabling
focusing on building higher level, more specific applications
The API economy: enable easier adoption of technologies
Data availability: there is more data than needs to be analysed and
available to train our systems
Synthesis: as different technologies become mature they lead to more
complex systems and more automation
1 Adapted from Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from
www.altaplana.com where they are presented from the perspective of text analytics

NLP meets the cloud
• Software as a Service (SaaS) is a very popular way of giving
access to software
• The software is run in the cloud and users pay some kind of
subscription to access it
• Great way to develop (commercial) NLP applications that
mashup information from several services
• Can lead to scalable applications
• There are already several established provides of APIs that allow language
processing (usually branded text analytics)
• Difficult to assess how accurate these tools are
• “don’t try to compete with what’s there, but build something new using it.”1
1 Dale, R. (2015). NLP meets the cloud. Natural Language Engineering, 21(04), 653–659. http://doi.org/10.1017/S1351324915000200

“text analytics has come to age”1
Is data science the next big thing (or it is already the big thing)?
1Text Analytics: The Next Generation of Big Data, http://insidebigdata.com/2015/06/05/text-analytics-the-next-generation-of-big-data/

Crowdsourcing

Crowdsourcing
Crowdsourcing = the act of delegating a task to a large diffuse group, usually without substantial
monetary compensation1
Has developed largely as a result of Web 2.0 and increasing access to Internet by masses
“distributed labor networks are using the Internet to exploit the spare processing power of millions of
human brains” 1
Wikipedia is considered one of the most successful projects using this approach
Embraced by the research community and industry
It is not outsourcing, but crowdsourcing
1Jeff Howe (June 2006). The Rise of Crowdsourcing. Wired. Available at http://www.wired.com/wired/archive/14.06/crowds.html

Crowdsourcing in NLP
Used to
 Create gold standards
 Collect human judgements
 Involve the community in projects (e.g. competitions)
Used increasingly in NLP1
1 Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical
Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2),
413–420. http://doi.org/10.1162/COLI_a_00057

Revision of
annotation
guidelines
Linguistic
analysis of the
problem tackled
• Annotation
guidelines
produced
Annotation
process
• Annotated
dataset
produced
Interannotator
agreement
calculated
• Disagreements
discussed
Standard annotation flow
Language experts are involved in all stages

The crowdsourcing approach
Relies much less on experts
Requires decomposing the (annotation) task into simple tasks that does not
require linguistic knowledge (e.g. for paraphrasing the expression desert rat ask
participants to fill in the gap rat that … desert(s)1)
These tasks can be combined to obtain high quality annotation
Requires screening of participants, filtering of noise, validation of data
In some cases the tasks are presented as games
1 Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs: Feasibility study. In Proceedings of the 13th international
conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA '08, pages 103 - 117, Berlin, Heidelberg. Springer-Verlag.

Crowdsourcing used for
Annotation of data:
• label data according to predefined categories
• quality can be assessed using inter-annotator agreement
Creation of new content:
• text created for certain purposes e.g. translation of a sentence, description of an image
• validation of the work is more difficult
• validation can be decomposed as a series of crowdsourced tasks
Obtain subjective information:
• in some cases there is more than one correct answer and the opinion of majority is sought
e.g. important features of mobile phones for an IQA system1
1 Konstantinova, N., Orasan, C., & Balage, P. P. (2012). A Corpus-Based Method for Product Feature Ranking for Interactive Question
Answering Systems. International Journal of Computational Linguistics and Applications, 3(1), 57 – 70.

Source: http://blog.lionbridge.com/enterprise-crowdsourcing/2013/07/22/managed-crowds-help-deliver-on-promise-of-business-crowdsourcing/
The crowdsourcing approach

Open Mind Common Sense
project
One of the first examples of crowdsourcing
Project initiated at MIT Media Lab with the goal to build and utilize a
large common sense knowledge base
Since 1999 it collected more than 1 million English facts from over
15,000 contributors
“an attempt to ... harness some of the distributed human computing
power of the Internet”
1 http://commons.media.mit.edu/en/ (not working August 2015) summarised at https://en.wikipedia.org/wiki/Open_Mind_Common_Sense

Teaching computers common
sense
The slow progress in IA is due to the fact that computers lack common sense1
Common Sense: The mental skills that most people share. Common sense thinking is actually
more complex than many of the intellectual accomplishments that attract more attention and
respect, because the mental skills we call “expertise” often engage large amounts of
knowledge but usually employ only a few types of representations. In contrast, common sense
involves many kinds of representations and thus requires a larger range of different skills.
It is estimated that humans have hundreds of millions of pieces of common sense knowledge
1Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Retrieved from http://web.media.mit.edu/~push/Kurzweil.html

Cyc vs OMCS
Cyc another attempt to acquire common sense knowledge backed by
Cycorp company (http://www.cyc.com/)
Employs knowledge engineers to populate the database
People from the Cyc team have worked for nearly two decades to build
a database of 1.5 million pieces of common knowledge at the cost of
many tens of millions of dollars1
1Information from Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Representing the situation at the turn of the century

Open Mind Common Sense
Asks volunteers to provide common knowledge by:
• Asking them to fill in templates: A hammer is for ________ or The effect of eating a
sandwich is ________
• Giving them a story and asking to enter knowledge in response:
User is prompted with story: Bob had a cold. Bob went to the doctor.
User enters many kinds of knowledge in response: Bob was feeling sick. Bob wanted to feel better. The
doctor wore a stethoscope around his neck.
• Collect information longer than expressed in one sentence (photo caption, supply short
stories, annotate movies of simple iconic spatial events)
• After information is entered the user can be shown with an inference the system made
which can be accepted or rejected
The participants provided the information using English sentences which were
processed afterwards
Peer reviewing used to ensure the quality of the input

Phrase detectives
A specially designed interface developed at University of Essex, UK used to
create a resource for anaphora resolution
It is presented as a game with a purpose where participants collect points
The participation is not paid, but at times rewards are given to the most active
participants
One of the main challenges is how to present the task to non experts
Further reading: Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources:
Successes and Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg.
http://doi.org/10.1007/978-3-642-35085-6_1

Source: https://anawiki.essex.ac.uk/

Phrase detectives
The interface operates in two modes:
• Annotation mode: name the culprit
• Validation mode: detectives conference
New participants are trained on a gold standard before they progress to real documents
Each markable is annotated by 8 players to collect multiple judgements (4 more
judgements can be added in case of disagreement)
Users are profiled to identify spammers, rate the quality of their work, etc
The quality of the resource produced is considered excellent: in 84% of all annotations the
interpretation specified by the majority vote of non-expert was identical with one
assigned by an expert (agreement between experts 94%)
Agreement for property 0% and for non-referential 100%

Amazon Mechanical Turk
Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that
enables individuals and businesses (known as Requesters) to coordinate the use
of human intelligence to perform tasks that computers are currently unable to
do.1
Requesters who need tasks completed load HITs (Human Intelligence Tests) load
them on MTurk indicating various parameters (how much they are willing to
pay, conditions for participants, max time spent, etc.)
One of the most used crowdsourcing platforms
1 https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
To find out more about the original mechanical turk
http://www.bbc.co.uk/news/magazine-21882456

Why Mturk (or similar services)
Little work required to set up the interface (pre-existing templates or fairly simple
programming)
Use existing infrastructure (hardware, payment)
Access to workers (at times tasks are completed extremely fast): in Jan 2011 over 500,000
workers from 190 countries1
But you need to keep in mind that you will have to pay these services in addition to
workers
1 Information from https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk

Snow et al. (2008)1
Use crowdsourcing for five tasks: affect recognition, word similarity,
recognition of textual entailment, event temporal ordering and word
sense disambiguation.
The main purpose of the research was to explore the quality of
resources created using crowdsourcing
Propose a model to assess the reliability of individual workers and
correct their biases
1Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language
tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational
Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751

Affect recognition
Based on the task proposed in Strapparava and Mihalcea (2007)
Short headlines and annotators gave a numeric judgement
• between 0 and 100 related to 6 emotions: anger, disgust, fear, joy, sadness and
surprise
• between -100 and 100 to denote the overall positive or negative valence
E.g. Outcry at N Korea ‘nuclear test’
(Anger, 30), (Disgust,30), (Fear,30), (Joy,0), (Sadness,20), (Surprise,40), (Valence,-
50).
100 headlines selected and each was annotated by 10 annotators
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for
Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751

Affect recognition
Pearson correlation was calculated
between the labels
Individual experts are better than
individual non experts, but adding
their annotation to the gold
standard improves the quality of
the gold standard
In average it takes 4 non-expert annotations to achieve equivalent of ITA of an expert annotator
The numbers are different for each class: 2 for anger, disgust and sadness; 5 for valence; 7 for
joy and 9 for surprise. For fear more than 10.
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for
Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751

Affect recognition: system
• A bag-of-words unigram system was trained on crowdsourced data to predict
the affect and valence
• Explanation for these unexpected
results: “individual labelers (including
experts) tend to have a strong bias,
and since multiple non-expert labellers
may contribute to a single set of
non-expert annotations, the annotator
diversity within the single set of labels
may have the effect of reducing
annotator bias and thus increasing
system performance.”
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).

Word similarity
Provide numeric judgements on word similarity for 30 word pairs on a scale
[0,10]
E.g. {boy, lad} and {noon, string}
Crowdsourcing used to collect 10 annotations for the 30 pairs
It took less than 11 minutes to complete all the
annotations
Previous studies reported inter-annotator agreement
between 0.958 and 0.97
Annotation obtained using crowdsourcing achieves 0.952

Recognising textual entailment
For a pair of sentences workers were asked to say whether the second
sentence is inferred from the first
Collected 10 annotations for 100 RTE sentence pairs
Expert interannotator agreement between 91% and 96%
Using MTurk 89.7% ITA is observed

Event annotation
Annotate verb events from TimeBank corpus with relations strictly
before and strictly after
462 verb event pairs were annotated by 10 workers
ITA 0.94 using simple voting over 10 annotators
No expert ITA available for this task

Bias correction for non-expert
annotations
• A small number of workers do a large portion of the task
• Some of the workers produce low quality annotation, whilst others are
biased
• Model the reliability and biases of individual workers and correct for them
• Train the model on a small gold standard

Word sense disambiguation
Obtain 10 annotations for each of the 177 examples of the noun
“president” from the SemEval corpus
3 senses available
100% interannotator agreement
These results are so high because of the simplicity of the task. For more
complicated tasks a small set of expert annotators perform much better
than a large number of untrained turkers1
1 Bhardwaj, V., & Passonneau, R. (2010). Anveshan: a framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the
Fourth Linguistic Annotation Workshopp (pp. 47–55). Uppsala, Sweden. Retrieved from http://dl.acm.org/citation.cfm?id=1868726

Callison-Burch (2009)1
• Presents several experiments which attempt to
create resources for MT evaluation
• He shows that by combining judgements of several
non-experts it is possible to produce a resource like
those created by experts
• Ranking of sentences works quite well, but
producing a gold standard not because many
workers used MT engines
• A second task was created to identify poor
reference translations
1 Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using
Amazon’s Mechanical Turk. In EMNLP ’09 Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing (Vol. 1, pp. 286–295).
http://doi.org/10.3115/1699510.1699548

Gillick and Liu (2010)1
• Try to use non-experts to evaluate automatic summarisation systems
• Workers are given two reference summaries and the topic of the summaries
• They are asked to rank a summary produced by a system on a scale from 1 to 10
• The annotated data was noisy and unlikely to produce a ranking that matches the
one of experts
• The reason is that non-experts are not able to separate the evaluation of content
from evaluation of readability
• For evaluation of automatic summarisation crowdsourcing could be used for
extrinsic evaluation
1Gillick, D. and Liu, Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010
Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 148 - 151, Stroudsburg, PA,
USA.

Costs
Many people employ crowdsourcing because it can reduce the costs
Workers are paid between $0.01 and $1 per task
The approximate costs1 for marking anaphoric relations in 1m tokens:
• Partially validated data: 0.83 markables/$1
• Entirely validated data: 0.33 markables/$1
• Mturk: 20-84 markables/$1 + costs of researchers
• Phrase detectives: 1 markable/$1
If you pay too little you may draw the wrong conclusions (e.g. translation, summarisation,
…)
1Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources: Successes and
Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg.

Criticism MTurk (and similar
services)
MTurk has became “the digital equivalent of an unregulated sweatshop”1,2
Limitations of crowdsourcing approaches:
• Lack of expertise
• Decomposition of complex tasks in simpler tasks introduces bias
• Need to validate the results afterwards (e.g. use PhD students)
• Impossible to control some aspects about workers (e.g. language level)
1 http://vonahn.blogspot.co.uk/2010/07/work-and-internet.html
2 http://www.utne.com/science-and-technology/amazon-mechanical-turk-zm0z13jfzlin.aspx
Further readings: Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2),
413–420. http://doi.org/10.1162/COLI_a_00057
Fort, K., Adda, G., Sagot, B., Mariani, J., & Couillault, A. (2014). Crowdsourcing for language resource development: Criticisms about Amazon
Mechanical Turk overpowering use. Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 8387 LNAI, 303–314. http://doi.org/10.1007/978-3-319-08958-4_25

Conclusions
Crowdsourcing has been hailed as one of the solutions of information
overload
Used properly can create large resources that otherwise cannot be
obtained
Don’t forget the ethical implications of
using crowdsourcing

Processing large
datasets
6th September 2015 RANLP 2015, HISSAR, BULGARIA 58

More data means better
results
Banco & Brill (2001)1 carry out experiments where they
show that it is possible to improve the performance of
ML methods by increasing the size of the training data
They show that for the confusion set disambiguation
{to, two, too} the performance increases almost linearly
when the size of the dataset increases
For their task it is possible to obtain annotated data for
free
1Banko, M., & Brill, E. (2001). Scaling to Very Very Large Corpora for Natural Language
Disambiguation. In Proceedings of the 39th Annual Meeting on Association for
Computational Linguistics (pp. 26 – 33). Toulouse, France. Retrieved from
http://dx.doi.org/10.3115/1073012.1073017

Big data in NLP
We usually have huge text collections
It is not possible to load the text collections in memory
We need to obtain statistics for example:
• Number of times each distinct word appears in the file
• Search occurrences of a word/of words
• Produce language models

The MapReduce paradigm
It is one of the most common approaches used to process large
collections of documents
It was inspired by functional programming (e.g. Lisp)
Assumes that the task can be decomposed into (key, value) pairs and
these pairs can be:
• processed independently of each other (map)
• the result of processing is combined to obtain the final result (reduce)
Normally processing is distributed between computers and involve large
datasets

MapReduce in python
Sum of the squares
def pow2(a): return a*a;
def add(a, b): return a + b;
def iterative(my_list):
s = 0
for x in my_list:
s += pow2(x);
return s
reduce(add, map(pow2, my_list))

Word counting using Unix
commands
words(collection) | sort | uniq -c
• words prints each word from the collection and
prints it on a different line (not a unix command!)
• sort: sorts all the words alphabetically
• uniq -c: filters matching lines from input, writing
to output, prefixing the lines by the number of occurrences
REDUCE
MAP

MapReduce
Input: a set of key-value pairs derived from the dataset to be processed
Map(k, v)  <k’, v’>*
• Needs to be written by the programmer
• Takes a key-value pair and outputs a set (including the empty set) of key-
value pairs
• There is one call of the Map function for each input pair
Reduce(k’, <v’>*)  <k’, v’’>*
• All values v’ with the same key k’ are reduced together
• There is one call of the Reduce function for each unique key k’

Word count using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/

Word count using MapReduce
Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/

Word count in Apache Spark
public static void wordCountJava8( String filename ) {
// Define a configuration to use to interact with Spark
SparkConf conf = new SparkConf().setMaster("local").setAppName("Work Count App");
// Create a Java version of the Spark Context from the configuration
JavaSparkContext sc = new JavaSparkContext(conf);
// Load the input data, which is a text file read from the command line
JavaRDD<String> input = sc.textFile( filename );
// Java 8 with lambdas: split the input string into words
JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );
// Java 8 with lambdas: transform the collection of words into pairs (word and 1) and then count them
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2( t, 1 ) )
.reduceByKey( (x, y) -> (int)x + (int)y );
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile( "output" );
}
Tutorial from http://www.javaworld.com/article/2972863/big-data/open-source-java-projects-apache-spark.html

Language model in MapReduce
Count number of times each 5-grams occurs in a large corpus of
documents
Map
• Extract (5-grams, count) from documents
Reduce
• Combine the counts

Other requirements
To use MapReduce for real life scenarios you need much more than this:
• A distributed cluster of computers
• A distributed file system e.g. Google GFS, Hadoop HDFS
• A framework that implements MapReduce e.g. Hadoop, Apache Spark
Setting up is not difficult, but fine tuning requires quite a bit of
knowledge

MapReduce in NLP
• Build co-occurance matrices from very large corpora1
◦ Uses a cluster of 20 computers running Hadoop
◦ the co-occurrence matrix for the Gigaword corpus (7.15 million documents and about 2.97 billion words)
◦ takes about 37 minutes for a window of 2 words, and 1 hour and 23 minutes for a window of 6 words.
• Build language models2
◦ Use MapReduce to build language models from corpora that have between 13 million to 2 trillion tokens
◦ The quality of MT engines using these language models improves when the corpus is increased
• Increase the processing speed: Watson running on a single process took 2 hours to answer a
single question, a distributed implementation with over 2,500 cores can answer in 3-5
seconds
1 Lin, J. (2008). Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 419 – 428, http://www.aclweb.org/anthology/D08-1044
2 Brants, T., Popat, A. C., Xu, P., Och, F. J., & Jeffrey Dean. (2007). Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 858–867). Prague, Czech Republic.
http://www.aclweb.org/anthology/D07-1090.pdf

Further reading
Jimmy Lin and Chris Dyer (2010) Data-Intensive Text Processing with
MapReduce. Morgan & Claypool Publishers. Available at
https://lintool.github.io/MapReduceAlgorithms/
Mining Massive Datasets Course to start on 12th Sept 2015
https://www.coursera.org/course/mmds

Deep learning
6th September 2015 RANLP 2015, HISSAR, BULGARIA 72

The standard ML approach
Input (annotated) data set
Low level features
Machine learning algorithm
Evaluation
Try to
improve

But what if you can’t always
define the features?
Can deep learning help find the perfect date?
http://www.kdnuggets.com/2015/07/can-deep-learning-help-find-perfect-girl.html

Quick introduction in neural
networks (NNs)
Perceptron
Multi-layer network

Degree of complexity
From: http://www.slideshare.net/roelofp/deep-learning-for-information-retrieval

What is Deep learning
Is a new big trend in Machine Learning
Neural Networks that are composed of many layers
Deep learning algorithms attempt to automatically learn multiple levels
of representation of increasing complexity/abstraction
Biologically justified: Audio/Visual cortex has multiple stages ==
Hierarchical

Why deep learning
• Neural networks can work as lookup tables to represent functions (i.e. some
neurons can activate only for specific range of values)
• For some functions we would need to many units in the hidden layer – not
efficient
• More hidden units in the hidden layer requires more training data
• Instead we can try to learn a complex function as a composition of simple
functions

Different levels of abstraction

Google trends for 5 search terms: machine learning, deep learning, neural networks, support vector machines, naïve Bayes

Why now?
NNs have been around for many years
Breakthrough around 2006
• More data
• Faster processing: GPUs and multi-core CPUs
• Better ideas how to train deep architectures

Representation models
• In standard representation model a word is represented as a vector with
one 1 and the rest 0 E.g. cat = [0 0 0 0 0 0 1 0 0 0 0 0 0]
• Problem with vector space model:
cat [0 0 0 0 0 0 1 0 0 0 0 0 0] AND dog [0 0 1 0 0 0 0 0 0 0 0 0 0] = 0
• “You shall know a word by the company that it keeps” (Firth 1957)
• In distributional similarity based representations words are represented by
the words that appear in its context (co-ocurrence vector)
• Examples:
◦ Latent Semantic Analysis (LSA/LSI),
◦ Latent Dirichlet Analysis (LDA)
◦ Word embedding

Properties of continuous
space representations
The vector space representation has some very interesting features:
• It allows a level of generalisation not possible for classical n-gram model
• In continuous space model similar words are likely to have similar vectors
• When the model parameters are adjusted in response to a particular word or
word-sequence, the improvements will carry over to occurrences of similar
words and sequences
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings
of NAACL-HLT (pp. 746–751). https://www.aclweb.org/anthology/N/N13/N13-1090.pdf

Word embedding
A word embedding is a parameterized function that maps words in a
language in high-dimensional vectors
It learns simultaneously:
• A distributed representation for each word
• A probability function for word sequences
One of the most exciting developments in deep learning for NLP
Proposed quite a while ago: Yoshua Bengio, Réjean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural probabilistic language
model. Journal of Machine Learning Research 3 (March 2003), 1137-1155.
http://dl.acm.org/citation.cfm?id=944966

Architectures

Collobert et al. (2011)1
• Train a NN to obtain word embedding
• Experiments with small datasets did not lead to good results
• Wikipedia and Reuters RCV1 corpora are used
• The “map” of words obtained makes lots of sense
1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch.
The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186

Collobert et al. (2011)1
1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch.
The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186
2 Bottou, L. (2011). From Machine Learning to Machine Reasoning. Arxiv Preprint arXiv11021808, 15. Retrieved from
http://arxiv.org/abs/1102.1808
3 See Richard Socher’s tutorial on Deep learning for NLP (without magic) http://lxmls.it.pt/2014/socher-lxmls.pdf for detailed information how to
train this network
R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat")) = 1
R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat")) = 0
• The network trained predicts if a 5-gram is valid3
• Not particular useful information as such
• However, the word embedding are very useful
• Training these networks on large datasets can take
weeks
• They use these embedding to train more complicated
NN to perform POS tagging, chunking, NER, SRL
.
Determine if a 5-gram is valid. Figure
from Bottou (2011)

Mikolov et al (2013)1
• Use a Recurrent Neural Network Language Model
• The model has no knowledge of syntax, morphology
or semantics
• Used to measure linguistic regularities using the
pattern “a is to b as c is to ___”
◦ Syntactic test: year:years law:laws
◦ Semantic test: clothing:shirt dish:bowl
• The relationships can be expressed in terms
of offsets
1 Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space
word representations. In Proceedings of NAACL-HLT (pp. 746–751). Atlanta, Georgia,
USA. Retrieved from https://www.aclweb.org/anthology/N/N13/N13-1090.pdf

Adjectival scales
• Mikolov et al. (2013) show that using continuous space representations capture
syntactic and semantic regularities:
◦ apple − apples ≈ car − cars ≈ family − families
◦ king − man + woman ≈ queen
• Kim & Marneffe (2013)1 derive adjectival scales
1Kim, J., & Marneffe, M.-C. de. (2013). Deriving adjectival scales from continuous space word representations. In Proceedings of the 2013
Conference on Empirical Methods in Natural Language Processing (pp. 1625 – 1630). Retrieved from http://www.aclweb.org/anthology/D13-
1169

Bilingual word embeddings1
• Two word embeddings are trained in the traditional manner (for English and
Mandarin Chinese)
• An additional constraint is introduced that words from the two languages
that have similar meaning should be close together
• Words that were not known as translations of each other end up close
together
• The word embeddings are used:
◦ In the Chinese similarity task where they lead to results
better than the state of the art
◦ A 0.49 increase in the BLUE score
1Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual Word Embeddings for
Phrase-Based Machine Translation. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2013).

Tree Structured Long Short
Term Memory (Tree-LSTM)
• Recurrent NN with Long Short Term Memory (LSTM) proved very
good at representing sentences and useful in capturing long
distance dependencies1
• Recurrent NN (RNN) can process sequences of arbitrary length
• RNN have the problems of learning long distance correlations of
the sequence
• Have a memory cell that preserves states over long periods of time
• Tree-LSTM are very useful in capturing semantic relatedness and
sentiment analysis of movie reviews:
◦ Outperforms state-of-the-art for fine grained sentiment classification and
comparable for the binary classification
◦ Outperforms the best performing systems at SemEval 2014 semantic relatedness task
1Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term
memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566, Beijing, China.

ReVal: MT evaluation metric1
• Evaluation Metric based on Tree Structured Long Short Term Memory (Tree-LSTM) networks
• Trained on WMT-13 rank data: had to convert ranks to similarity scores and also Used 4500 pairs of
the SICK data
• Performs better at system level than some methods
that rely on many features
• Average performance for segment level evaluation
• Good example how you can develop new methods
based on existing approaches in deep learning
• Code available at
https://github.com/rohitguptacs/ReVal
1 Rohit Gupta, Constantin Orasan, and Josef van Genabith. 2015. Reval: A Simple and Effective Machine Translation Evaluation Metric based on
Recurrent Neural Networks. In Proceedings of EMNLP-2015, Lisbon, Portugal.
Rohit Gupta, Constantin Orasan and Josef van Genabith. 2015. Machine Translation Evaluation using Recurrent Neural Networks. In Proceeding
of WMT-2015.

Text understanding from
scratch1
• Discuss how it is possible to achieve text understanding starting from
characters (e.g. not providing any information about words, paragraphs, etc.)
• Use convolutional networks (ConvNets) for determining the polarity of texts
and determine the main topic of articles
• The method words for both English and Chinese
• The conclusion of the paper is that ConvNets do not need any syntactic or
semantic structure of the language to work
1Zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. Retrieved from http://arxiv.org/pdf/1502.01710v3.pdf

Word embedding for Verbal
comprehension questions
• Attempt to answer verbal reasoning questions from IQ tests:
◦ Isotherm is to temperature as isobar is to? (i) atmosphere, (ii) wind, (iii) pressure, (iv) latitude, (v) current.
◦ Which is the odd one out? (i) calm, (ii) quiet, (iii) relaxed, (iv) serene, (v) unruffled.
◦ Which word is most opposite to MUSICAL? (i) discordant, (ii) loud, (iii) lyrical, (iv) verbal, (v) euphonious?
• These questions belong to predefined categories that can be identified easily by computers
• Each category has a different solver
• A novel way of producing word embeddings was necessary
• Ask 200 people to answer questions via Amazon Mechanical Turk
• The average performance of human beings is a little lower than that the proposed method
• “Our model can reach the intelligence level between the people with the bachelor degrees and
those with the master degrees”
• “The results indicate that with appropriate uses of the deep learning technologies we might be a
further step closer to the human intelligence.”
Huazheng Wang, Bin Gao, Jiang Bian, Fei Tian, Tie-Yan Liu (2015) Solving Verbal Comprehension Questions in IQ Test by Knowledge-Powered
Word Embedding. Retrieved from http://arxiv.org/abs/1505.07909

Deep learning
• It leads to better results than other methods
• Can be applied to a large number of tasks
• … but how many of these tasks tackle realistic data?
• … will it really lead to proper text understanding?
• … or it is yet another trend
• Proper understanding of deep learning requires very good background in
maths
• … but there are many packages available that implement methods

Many resources available
Slides and presentations from tutorials:
• Using Neural Networks for Modelling and Representing Natural Languages
http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20-
%20Tomas%20Mikolov.pdf
• Richard Socher’s tutorial on Deep learning for NLP (without magic)
http://lxmls.it.pt/2014/socher-lxmls.pdf
• General Sequence Learning using Recurrent Neural Networks
https://youtu.be/VINCQghQRuM
Books: http://neuralnetworksanddeeplearning.com/
Comprehensive hub of information: http://deeplearning.net/
The topic appears constantly on social media:
• Less than one day ago: “What are the limits of deep learning” on Reddit
https://redd.it/3jo968

Are we closer to “text understanding” or we are
only getting better at optimising for some (very
specific and sometimes unnatural) tasks?
“Open the pod bay doors, please Hal...”
https://youtu.be/dSIKBliboIo

The latest version of the slides available at:
http://www.slideshare.net/dinel/new-trends-in-nlp-applications
You can contact me by email at C.Orasan@wlv.ac.uk

Thank you

New trends in NLP applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to New trends in NLP applications

Similar to New trends in NLP applications (20)

More from Constantin Orasan

More from Constantin Orasan (8)

Recently uploaded

Recently uploaded (20)

New trends in NLP applications