Tutorial given at RANLP 2015 in Hissar, Bulgaria
Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data.
Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given.
Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
New trends in NLP applications
1. New trends in
NLP applications
Constantin Orasan
University of Wolverhampton, UK
http://www.wlv.ac.uk/~in6093/
6th September 2015 RANLP 2015, HISSAR, BULGARIA 1/100
2. A better title: Constantin’s
subjective view of some of the
interesting trends in NLP that
can be presented in 3 hours
6th September 2015 RANLP 2015, HISSAR, BULGARIA 2/100
3. The latest trend in NLP is …
natural language understanding
6th September 2015 RANLP 2015, HISSAR, BULGARIA 3/100
4. Not understanding like in …
“Open the pod bay doors, please Hal...”
Jurafsky, D., & Martin, J. H. (2009) Speech and language processing (2nd ed.). Pearson Prentice Hall. More information
from http://www.cs.colorado.edu/~martin/slp.html
6th September 2015 RANLP 2015, HISSAR, BULGARIA 4/100
5. NLU for specific applications
• Translate texts between two languages
• Simplify texts
• Find out the opinion/sentiment of texts
• Find out the entities mentioned in texts and the relations between them
• Answer questions from large collections of documents
• Help customers navigate knowledge databases
• Filter spam in social media
• Profile people
• Summarise texts
• ….
Are these
new?
6th September 2015 RANLP 2015, HISSAR, BULGARIA 5/100
8. The technology advanced
The Internet evolved
Web 2.0
Openness
More access
Better hardware
6th September 2015 RANLP 2015, HISSAR, BULGARIA 8/100
9. NLP is approaching maturity
More interest from companies on developing and deploying working
applications
Interest from users to employ NLP technologies in their company
“NLP for masses”
More datasets available, more tools available
6th September 2015 RANLP 2015, HISSAR, BULGARIA 9/100
10. Structure of the tutorial
1. Text analytics: example of establish field with impact on industry
2. Crowdsourcing
3. Processing large quantities of data
4. Deep-learning
6th September 2015 RANLP 2015, HISSAR, BULGARIA 100/100
11. RANLP 2015, HISSAR, BULGARIA 11/100
Text analytics
6th September 2015 RANLP 2015, HISSAR, BULGARIA 111/100
12. Text analytics – from users’
perspective
From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on
Solutions and Providers. Retrieved from www.altaplana.com
Text analytics = software and transformational processes that uncover
business value in “unstructured” text. Text analytics applies statistical,
linguistic, machine learning, and data analysis and visualization techniques
to identify and extract salient
information and insights. The goal is to
inform decision-making and support
business optimization.
Survey of 220 users of text analytics tools
6th September 2015 RANLP 2015, HISSAR, BULGARIA 122/100
13. Text analytics
Can benefit from crowdsourcing
Requires processing of large quantities of data
Needs better ML algorithms
It is widely and successfully used by companies
Other similarly successful applications are machine translation and
virtual personal assistants
6th September 2015 RANLP 2015, HISSAR, BULGARIA 133/100
14. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com
6th September 2015 RANLP 2015, HISSAR, BULGARIA 144/100
15. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com
6th September 2015 RANLP 2015, HISSAR, BULGARIA 155/100
16. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com
6th September 2015 RANLP 2015, HISSAR, BULGARIA 166/100
17. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com
6th September 2015 RANLP 2015, HISSAR, BULGARIA 177/100
18. Comments on the overall
experience
• It is a messy business, but invaluable if there is no other information available.
• It gives as an overview of the data that we could not achieve without it.
• I have been doing text analytics since 1984, and I have yet to find an environment that meets my
requirements for knowledge extraction.
• When applied properly and when its limits are understood, it works quite well.
• With access to proper info, I can generate a PhD level analysis in one day.
• We annotate incoming text against our taxonomy and then use the
annotations as the basis of text analytics as well as search.
• As with any “adolescent” technology, there is no single end-to-end
product that finds, analyzes, and visualizes all available data sources
• Accuracy needs improvement. Tools need to be customized to specific
business cases.
• Still need a human to interpret context, inference, etc.
• It is (relatively) easy to apply algorithms. It is difficult to assess the
accuracy of the results or to translate them into strategic insight.
• Text content analytics is in its early infancy, and there is a long road
ahead.
From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com
6th September 2015 RANLP 2015, HISSAR, BULGARIA 188/100
19. Technology-related growth
drivers1
Open source: lowers the barriers to technology adoption and enabling
focusing on building higher level, more specific applications
The API economy: enable easier adoption of technologies
Data availability: there is more data than needs to be analysed and
available to train our systems
Synthesis: as different technologies become mature they lead to more
complex systems and more automation
1 Adapted from Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from
www.altaplana.com where they are presented from the perspective of text analytics
6th September 2015 RANLP 2015, HISSAR, BULGARIA 199/100
20. NLP meets the cloud
• Software as a Service (SaaS) is a very popular way of giving
access to software
• The software is run in the cloud and users pay some kind of
subscription to access it
• Great way to develop (commercial) NLP applications that
mashup information from several services
• Can lead to scalable applications
• There are already several established provides of APIs that allow language
processing (usually branded text analytics)
• Difficult to assess how accurate these tools are
• “don’t try to compete with what’s there, but build something new using it.”1
1 Dale, R. (2015). NLP meets the cloud. Natural Language Engineering, 21(04), 653–659. http://doi.org/10.1017/S1351324915000200
6th September 2015 RANLP 2015, HISSAR, BULGARIA 20/100
21. “text analytics has come to age”1
Is data science the next big thing (or it is already the big thing)?
1Text Analytics: The Next Generation of Big Data, http://insidebigdata.com/2015/06/05/text-analytics-the-next-generation-of-big-data/
6th September 2015 RANLP 2015, HISSAR, BULGARIA 221/100
23. Crowdsourcing
Crowdsourcing = the act of delegating a task to a large diffuse group, usually without substantial
monetary compensation1
Has developed largely as a result of Web 2.0 and increasing access to Internet by masses
“distributed labor networks are using the Internet to exploit the spare processing power of millions of
human brains” 1
Wikipedia is considered one of the most successful projects using this approach
Embraced by the research community and industry
It is not outsourcing, but crowdsourcing
1Jeff Howe (June 2006). The Rise of Crowdsourcing. Wired. Available at http://www.wired.com/wired/archive/14.06/crowds.html
6th September 2015 RANLP 2015, HISSAR, BULGARIA 23/100
24.
25. Crowdsourcing in NLP
Used to
Create gold standards
Collect human judgements
Involve the community in projects (e.g. competitions)
Used increasingly in NLP1
1 Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical
Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2),
413–420. http://doi.org/10.1162/COLI_a_00057
6th September 2015 RANLP 2015, HISSAR, BULGARIA 25/100
26. Revision of
annotation
guidelines
Linguistic
analysis of the
problem tackled
• Annotation
guidelines
produced
Annotation
process
• Annotated
dataset
produced
Interannotator
agreement
calculated
• Disagreements
discussed
Standard annotation flow
Language experts are involved in all stages
6th September 2015 RANLP 2015, HISSAR, BULGARIA 26/100
27. The crowdsourcing approach
Relies much less on experts
Requires decomposing the (annotation) task into simple tasks that does not
require linguistic knowledge (e.g. for paraphrasing the expression desert rat ask
participants to fill in the gap rat that … desert(s)1)
These tasks can be combined to obtain high quality annotation
Requires screening of participants, filtering of noise, validation of data
In some cases the tasks are presented as games
1 Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs: Feasibility study. In Proceedings of the 13th international
conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA '08, pages 103 - 117, Berlin, Heidelberg. Springer-Verlag.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 27/100
28. Crowdsourcing used for
Annotation of data:
• label data according to predefined categories
• quality can be assessed using inter-annotator agreement
Creation of new content:
• text created for certain purposes e.g. translation of a sentence, description of an image
• validation of the work is more difficult
• validation can be decomposed as a series of crowdsourced tasks
Obtain subjective information:
• in some cases there is more than one correct answer and the opinion of majority is sought
e.g. important features of mobile phones for an IQA system1
1 Konstantinova, N., Orasan, C., & Balage, P. P. (2012). A Corpus-Based Method for Product Feature Ranking for Interactive Question
Answering Systems. International Journal of Computational Linguistics and Applications, 3(1), 57 – 70.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 28/100
30. Open Mind Common Sense
project
One of the first examples of crowdsourcing
Project initiated at MIT Media Lab with the goal to build and utilize a
large common sense knowledge base
Since 1999 it collected more than 1 million English facts from over
15,000 contributors
“an attempt to ... harness some of the distributed human computing
power of the Internet”
1 http://commons.media.mit.edu/en/ (not working August 2015) summarised at https://en.wikipedia.org/wiki/Open_Mind_Common_Sense
6th September 2015 RANLP 2015, HISSAR, BULGARIA 30/100
31. Teaching computers common
sense
The slow progress in IA is due to the fact that computers lack common sense1
Common Sense: The mental skills that most people share. Common sense thinking is actually
more complex than many of the intellectual accomplishments that attract more attention and
respect, because the mental skills we call “expertise” often engage large amounts of
knowledge but usually employ only a few types of representations. In contrast, common sense
involves many kinds of representations and thus requires a larger range of different skills.
It is estimated that humans have hundreds of millions of pieces of common sense knowledge
1Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Retrieved from http://web.media.mit.edu/~push/Kurzweil.html
6th September 2015 RANLP 2015, HISSAR, BULGARIA 331/100
32. Cyc vs OMCS
Cyc another attempt to acquire common sense knowledge backed by
Cycorp company (http://www.cyc.com/)
Employs knowledge engineers to populate the database
People from the Cyc team have worked for nearly two decades to build
a database of 1.5 million pieces of common knowledge at the cost of
many tens of millions of dollars1
1Information from Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Representing the situation at the turn of the century
6th September 2015 RANLP 2015, HISSAR, BULGARIA 32/100
33. Open Mind Common Sense
Asks volunteers to provide common knowledge by:
• Asking them to fill in templates: A hammer is for ________ or The effect of eating a
sandwich is ________
• Giving them a story and asking to enter knowledge in response:
User is prompted with story: Bob had a cold. Bob went to the doctor.
User enters many kinds of knowledge in response: Bob was feeling sick. Bob wanted to feel better. The
doctor wore a stethoscope around his neck.
• Collect information longer than expressed in one sentence (photo caption, supply short
stories, annotate movies of simple iconic spatial events)
• After information is entered the user can be shown with an inference the system made
which can be accepted or rejected
The participants provided the information using English sentences which were
processed afterwards
Peer reviewing used to ensure the quality of the input
6th September 2015 RANLP 2015, HISSAR, BULGARIA 33/100
34. Phrase detectives
A specially designed interface developed at University of Essex, UK used to
create a resource for anaphora resolution
It is presented as a game with a purpose where participants collect points
The participation is not paid, but at times rewards are given to the most active
participants
One of the main challenges is how to present the task to non experts
Further reading: Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources:
Successes and Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg.
http://doi.org/10.1007/978-3-642-35085-6_1
6th September 2015 RANLP 2015, HISSAR, BULGARIA 34/100
39. Phrase detectives
The interface operates in two modes:
• Annotation mode: name the culprit
• Validation mode: detectives conference
New participants are trained on a gold standard before they progress to real documents
Each markable is annotated by 8 players to collect multiple judgements (4 more
judgements can be added in case of disagreement)
Users are profiled to identify spammers, rate the quality of their work, etc
The quality of the resource produced is considered excellent: in 84% of all annotations the
interpretation specified by the majority vote of non-expert was identical with one
assigned by an expert (agreement between experts 94%)
Agreement for property 0% and for non-referential 100%
6th September 2015 RANLP 2015, HISSAR, BULGARIA 39/100
40. Amazon Mechanical Turk
Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that
enables individuals and businesses (known as Requesters) to coordinate the use
of human intelligence to perform tasks that computers are currently unable to
do.1
Requesters who need tasks completed load HITs (Human Intelligence Tests) load
them on MTurk indicating various parameters (how much they are willing to
pay, conditions for participants, max time spent, etc.)
One of the most used crowdsourcing platforms
1 https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
To find out more about the original mechanical turk
http://www.bbc.co.uk/news/magazine-21882456
6th September 2015 RANLP 2015, HISSAR, BULGARIA 40/100
43. Why Mturk (or similar services)
Little work required to set up the interface (pre-existing templates or fairly simple
programming)
Use existing infrastructure (hardware, payment)
Access to workers (at times tasks are completed extremely fast): in Jan 2011 over 500,000
workers from 190 countries1
But you need to keep in mind that you will have to pay these services in addition to
workers
1 Information from https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
6th September 2015 RANLP 2015, HISSAR, BULGARIA 43/100
44. Snow et al. (2008)1
Use crowdsourcing for five tasks: affect recognition, word similarity,
recognition of textual entailment, event temporal ordering and word
sense disambiguation.
The main purpose of the research was to explore the quality of
resources created using crowdsourcing
Propose a model to assess the reliability of individual workers and
correct their biases
1Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language
tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational
Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751
6th September 2015 RANLP 2015, HISSAR, BULGARIA 44/100
45. Affect recognition
Based on the task proposed in Strapparava and Mihalcea (2007)
Short headlines and annotators gave a numeric judgement
• between 0 and 100 related to 6 emotions: anger, disgust, fear, joy, sadness and
surprise
• between -100 and 100 to denote the overall positive or negative valence
E.g. Outcry at N Korea ‘nuclear test’
(Anger, 30), (Disgust,30), (Fear,30), (Joy,0), (Sadness,20), (Surprise,40), (Valence,-
50).
100 headlines selected and each was annotated by 10 annotators
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for
Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751
6th September 2015 RANLP 2015, HISSAR, BULGARIA 45/100
46. Affect recognition
Pearson correlation was calculated
between the labels
Individual experts are better than
individual non experts, but adding
their annotation to the gold
standard improves the quality of
the gold standard
In average it takes 4 non-expert annotations to achieve equivalent of ITA of an expert annotator
The numbers are different for each class: 2 for anger, disgust and sadness; 5 for valence; 7 for
joy and 9 for surprise. For fear more than 10.
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for
Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751
6th September 2015 RANLP 2015, HISSAR, BULGARIA 46/100
47. Affect recognition: system
• A bag-of-words unigram system was trained on crowdsourced data to predict
the affect and valence
• Explanation for these unexpected
results: “individual labelers (including
experts) tend to have a strong bias,
and since multiple non-expert labellers
may contribute to a single set of
non-expert annotations, the annotator
diversity within the single set of labels
may have the effect of reducing
annotator bias and thus increasing
system performance.”
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 47/100
48. Word similarity
Provide numeric judgements on word similarity for 30 word pairs on a scale
[0,10]
E.g. {boy, lad} and {noon, string}
Crowdsourcing used to collect 10 annotations for the 30 pairs
It took less than 11 minutes to complete all the
annotations
Previous studies reported inter-annotator agreement
between 0.958 and 0.97
Annotation obtained using crowdsourcing achieves 0.952
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 48/100
49. Recognising textual entailment
For a pair of sentences workers were asked to say whether the second
sentence is inferred from the first
Collected 10 annotations for 100 RTE sentence pairs
Expert interannotator agreement between 91% and 96%
Using MTurk 89.7% ITA is observed
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 49/100
50. Event annotation
Annotate verb events from TimeBank corpus with relations strictly
before and strictly after
462 verb event pairs were annotated by 10 workers
ITA 0.94 using simple voting over 10 annotators
No expert ITA available for this task
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 50/100
51. Bias correction for non-expert
annotations
• A small number of workers do a large portion of the task
• Some of the workers produce low quality annotation, whilst others are
biased
• Model the reliability and biases of individual workers and correct for them
• Train the model on a small gold standard
From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 51/100
52. Word sense disambiguation
Obtain 10 annotations for each of the 177 examples of the noun
“president” from the SemEval corpus
3 senses available
100% interannotator agreement
These results are so high because of the simplicity of the task. For more
complicated tasks a small set of expert annotators perform much better
than a large number of untrained turkers1
1 Bhardwaj, V., & Passonneau, R. (2010). Anveshan: a framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the
Fourth Linguistic Annotation Workshopp (pp. 47–55). Uppsala, Sweden. Retrieved from http://dl.acm.org/citation.cfm?id=1868726
6th September 2015 RANLP 2015, HISSAR, BULGARIA 52/100
53. Callison-Burch (2009)1
• Presents several experiments which attempt to
create resources for MT evaluation
• He shows that by combining judgements of several
non-experts it is possible to produce a resource like
those created by experts
• Ranking of sentences works quite well, but
producing a gold standard not because many
workers used MT engines
• A second task was created to identify poor
reference translations
1 Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using
Amazon’s Mechanical Turk. In EMNLP ’09 Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing (Vol. 1, pp. 286–295).
http://doi.org/10.3115/1699510.1699548
6th September 2015 RANLP 2015, HISSAR, BULGARIA 53/100
54. Gillick and Liu (2010)1
• Try to use non-experts to evaluate automatic summarisation systems
• Workers are given two reference summaries and the topic of the summaries
• They are asked to rank a summary produced by a system on a scale from 1 to 10
• The annotated data was noisy and unlikely to produce a ranking that matches the
one of experts
• The reason is that non-experts are not able to separate the evaluation of content
from evaluation of readability
• For evaluation of automatic summarisation crowdsourcing could be used for
extrinsic evaluation
1Gillick, D. and Liu, Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010
Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 148 - 151, Stroudsburg, PA,
USA.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 54/100
55. Costs
Many people employ crowdsourcing because it can reduce the costs
Workers are paid between $0.01 and $1 per task
The approximate costs1 for marking anaphoric relations in 1m tokens:
• Partially validated data: 0.83 markables/$1
• Entirely validated data: 0.33 markables/$1
• Mturk: 20-84 markables/$1 + costs of researchers
• Phrase detectives: 1 markable/$1
If you pay too little you may draw the wrong conclusions (e.g. translation, summarisation,
…)
1Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources: Successes and
Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 55/100
56. Criticism MTurk (and similar
services)
MTurk has became “the digital equivalent of an unregulated sweatshop”1,2
Limitations of crowdsourcing approaches:
• Lack of expertise
• Decomposition of complex tasks in simpler tasks introduces bias
• Need to validate the results afterwards (e.g. use PhD students)
• Impossible to control some aspects about workers (e.g. language level)
1 http://vonahn.blogspot.co.uk/2010/07/work-and-internet.html
2 http://www.utne.com/science-and-technology/amazon-mechanical-turk-zm0z13jfzlin.aspx
Further readings: Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2),
413–420. http://doi.org/10.1162/COLI_a_00057
Fort, K., Adda, G., Sagot, B., Mariani, J., & Couillault, A. (2014). Crowdsourcing for language resource development: Criticisms about Amazon
Mechanical Turk overpowering use. Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 8387 LNAI, 303–314. http://doi.org/10.1007/978-3-319-08958-4_25
6th September 2015 RANLP 2015, HISSAR, BULGARIA 56/100
57. Conclusions
Crowdsourcing has been hailed as one of the solutions of information
overload
Used properly can create large resources that otherwise cannot be
obtained
Don’t forget the ethical implications of
using crowdsourcing
6th September 2015 RANLP 2015, HISSAR, BULGARIA 57/100
58. RANLP 2015, HISSAR, BULGARIA 58/100
Processing large
datasets
6th September 2015 RANLP 2015, HISSAR, BULGARIA 58
59. More data means better
results
Banco & Brill (2001)1 carry out experiments where they
show that it is possible to improve the performance of
ML methods by increasing the size of the training data
They show that for the confusion set disambiguation
{to, two, too} the performance increases almost linearly
when the size of the dataset increases
For their task it is possible to obtain annotated data for
free
1Banko, M., & Brill, E. (2001). Scaling to Very Very Large Corpora for Natural Language
Disambiguation. In Proceedings of the 39th Annual Meeting on Association for
Computational Linguistics (pp. 26 – 33). Toulouse, France. Retrieved from
http://dx.doi.org/10.3115/1073012.1073017
6th September 2015 RANLP 2015, HISSAR, BULGARIA 59/100
60. Big data in NLP
We usually have huge text collections
It is not possible to load the text collections in memory
We need to obtain statistics for example:
• Number of times each distinct word appears in the file
• Search occurrences of a word/of words
• Produce language models
6th September 2015 RANLP 2015, HISSAR, BULGARIA 60/100
61. The MapReduce paradigm
It is one of the most common approaches used to process large
collections of documents
It was inspired by functional programming (e.g. Lisp)
Assumes that the task can be decomposed into (key, value) pairs and
these pairs can be:
• processed independently of each other (map)
• the result of processing is combined to obtain the final result (reduce)
Normally processing is distributed between computers and involve large
datasets
6th September 2015 RANLP 2015, HISSAR, BULGARIA 61/100
62. MapReduce in python
Sum of the squares
def pow2(a): return a*a;
def add(a, b): return a + b;
def iterative(my_list):
s = 0
for x in my_list:
s += pow2(x);
return s
reduce(add, map(pow2, my_list))
6th September 2015 RANLP 2015, HISSAR, BULGARIA 62/100
63. Word counting using Unix
commands
words(collection) | sort | uniq -c
• words prints each word from the collection and
prints it on a different line (not a unix command!)
• sort: sorts all the words alphabetically
• uniq -c: filters matching lines from input, writing
to output, prefixing the lines by the number of occurrences
REDUCE
MAP
6th September 2015 RANLP 2015, HISSAR, BULGARIA 63/100
64. MapReduce
Input: a set of key-value pairs derived from the dataset to be processed
Map(k, v) <k’, v’>*
• Needs to be written by the programmer
• Takes a key-value pair and outputs a set (including the empty set) of key-
value pairs
• There is one call of the Map function for each input pair
Reduce(k’, <v’>*) <k’, v’’>*
• All values v’ with the same key k’ are reduced together
• There is one call of the Reduce function for each unique key k’
6th September 2015 RANLP 2015, HISSAR, BULGARIA 64/100
65. Word count using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/
6th September 2015 RANLP 2015, HISSAR, BULGARIA 65/100
66. Word count using MapReduce
Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/
6th September 2015 RANLP 2015, HISSAR, BULGARIA 66/100
67. Word count in Apache Spark
public static void wordCountJava8( String filename ) {
// Define a configuration to use to interact with Spark
SparkConf conf = new SparkConf().setMaster("local").setAppName("Work Count App");
// Create a Java version of the Spark Context from the configuration
JavaSparkContext sc = new JavaSparkContext(conf);
// Load the input data, which is a text file read from the command line
JavaRDD<String> input = sc.textFile( filename );
// Java 8 with lambdas: split the input string into words
JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );
// Java 8 with lambdas: transform the collection of words into pairs (word and 1) and then count them
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2( t, 1 ) )
.reduceByKey( (x, y) -> (int)x + (int)y );
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile( "output" );
}
Tutorial from http://www.javaworld.com/article/2972863/big-data/open-source-java-projects-apache-spark.html
6th September 2015 RANLP 2015, HISSAR, BULGARIA 67/100
68. Language model in MapReduce
Count number of times each 5-grams occurs in a large corpus of
documents
Map
• Extract (5-grams, count) from documents
Reduce
• Combine the counts
6th September 2015 RANLP 2015, HISSAR, BULGARIA 68/100
69. Other requirements
To use MapReduce for real life scenarios you need much more than this:
• A distributed cluster of computers
• A distributed file system e.g. Google GFS, Hadoop HDFS
• A framework that implements MapReduce e.g. Hadoop, Apache Spark
Setting up is not difficult, but fine tuning requires quite a bit of
knowledge
6th September 2015 RANLP 2015, HISSAR, BULGARIA 69/100
70. MapReduce in NLP
• Build co-occurance matrices from very large corpora1
◦ Uses a cluster of 20 computers running Hadoop
◦ the co-occurrence matrix for the Gigaword corpus (7.15 million documents and about 2.97 billion words)
◦ takes about 37 minutes for a window of 2 words, and 1 hour and 23 minutes for a window of 6 words.
• Build language models2
◦ Use MapReduce to build language models from corpora that have between 13 million to 2 trillion tokens
◦ The quality of MT engines using these language models improves when the corpus is increased
• Increase the processing speed: Watson running on a single process took 2 hours to answer a
single question, a distributed implementation with over 2,500 cores can answer in 3-5
seconds
1 Lin, J. (2008). Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 419 – 428, http://www.aclweb.org/anthology/D08-1044
2 Brants, T., Popat, A. C., Xu, P., Och, F. J., & Jeffrey Dean. (2007). Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 858–867). Prague, Czech Republic.
http://www.aclweb.org/anthology/D07-1090.pdf
6th September 2015 RANLP 2015, HISSAR, BULGARIA 70/100
71. Further reading
Jimmy Lin and Chris Dyer (2010) Data-Intensive Text Processing with
MapReduce. Morgan & Claypool Publishers. Available at
https://lintool.github.io/MapReduceAlgorithms/
Mining Massive Datasets Course to start on 12th Sept 2015
https://www.coursera.org/course/mmds
6th September 2015 RANLP 2015, HISSAR, BULGARIA 71/100
72. RANLP 2015, HISSAR, BULGARIA 72/100
Deep learning
6th September 2015 RANLP 2015, HISSAR, BULGARIA 72
73. The standard ML approach
Input (annotated) data set
Low level features
Machine learning algorithm
Evaluation
Try to
improve
6th September 2015 RANLP 2015, HISSAR, BULGARIA 73/100
74. But what if you can’t always
define the features?
Can deep learning help find the perfect date?
http://www.kdnuggets.com/2015/07/can-deep-learning-help-find-perfect-girl.html
6th September 2015 RANLP 2015, HISSAR, BULGARIA 74/100
75. Quick introduction in neural
networks (NNs)
Perceptron
Multi-layer network
6th September 2015 RANLP 2015, HISSAR, BULGARIA 75/100
76. Degree of complexity
From: http://www.slideshare.net/roelofp/deep-learning-for-information-retrieval
6th September 2015 RANLP 2015, HISSAR, BULGARIA 76/100
77. What is Deep learning
Is a new big trend in Machine Learning
Neural Networks that are composed of many layers
Deep learning algorithms attempt to automatically learn multiple levels
of representation of increasing complexity/abstraction
Biologically justified: Audio/Visual cortex has multiple stages ==
Hierarchical
6th September 2015 RANLP 2015, HISSAR, BULGARIA 77/100
78. Why deep learning
• Neural networks can work as lookup tables to represent functions (i.e. some
neurons can activate only for specific range of values)
• For some functions we would need to many units in the hidden layer – not
efficient
• More hidden units in the hidden layer requires more training data
• Instead we can try to learn a complex function as a composition of simple
functions
6th September 2015 RANLP 2015, HISSAR, BULGARIA 78/100
79. Different levels of abstraction
6th September 2015 RANLP 2015, HISSAR, BULGARIA 79/100
80. Google trends for 5 search terms: machine learning, deep learning, neural networks, support vector machines, naïve Bayes
6th September 2015 RANLP 2015, HISSAR, BULGARIA 80/100
81. Why now?
NNs have been around for many years
Breakthrough around 2006
• More data
• Faster processing: GPUs and multi-core CPUs
• Better ideas how to train deep architectures
6th September 2015 RANLP 2015, HISSAR, BULGARIA 81/100
82. Representation models
• In standard representation model a word is represented as a vector with
one 1 and the rest 0 E.g. cat = [0 0 0 0 0 0 1 0 0 0 0 0 0]
• Problem with vector space model:
cat [0 0 0 0 0 0 1 0 0 0 0 0 0] AND dog [0 0 1 0 0 0 0 0 0 0 0 0 0] = 0
• “You shall know a word by the company that it keeps” (Firth 1957)
• In distributional similarity based representations words are represented by
the words that appear in its context (co-ocurrence vector)
• Examples:
◦ Latent Semantic Analysis (LSA/LSI),
◦ Latent Dirichlet Analysis (LDA)
◦ Word embedding
6th September 2015 RANLP 2015, HISSAR, BULGARIA 82/100
83. Properties of continuous
space representations
The vector space representation has some very interesting features:
• It allows a level of generalisation not possible for classical n-gram model
• In continuous space model similar words are likely to have similar vectors
• When the model parameters are adjusted in response to a particular word or
word-sequence, the improvements will carry over to occurrences of similar
words and sequences
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings
of NAACL-HLT (pp. 746–751). https://www.aclweb.org/anthology/N/N13/N13-1090.pdf
6th September 2015 RANLP 2015, HISSAR, BULGARIA 83/100
84. Word embedding
A word embedding is a parameterized function that maps words in a
language in high-dimensional vectors
It learns simultaneously:
• A distributed representation for each word
• A probability function for word sequences
One of the most exciting developments in deep learning for NLP
Proposed quite a while ago: Yoshua Bengio, Réjean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural probabilistic language
model. Journal of Machine Learning Research 3 (March 2003), 1137-1155.
http://dl.acm.org/citation.cfm?id=944966
6th September 2015 RANLP 2015, HISSAR, BULGARIA 84/100
86. Collobert et al. (2011)1
• Train a NN to obtain word embedding
• Experiments with small datasets did not lead to good results
• Wikipedia and Reuters RCV1 corpora are used
• The “map” of words obtained makes lots of sense
1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch.
The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186
6th September 2015 RANLP 2015, HISSAR, BULGARIA 86/100
87. Collobert et al. (2011)1
1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch.
The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186
2 Bottou, L. (2011). From Machine Learning to Machine Reasoning. Arxiv Preprint arXiv11021808, 15. Retrieved from
http://arxiv.org/abs/1102.1808
3 See Richard Socher’s tutorial on Deep learning for NLP (without magic) http://lxmls.it.pt/2014/socher-lxmls.pdf for detailed information how to
train this network
R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat")) = 1
R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat")) = 0
• The network trained predicts if a 5-gram is valid3
• Not particular useful information as such
• However, the word embedding are very useful
• Training these networks on large datasets can take
weeks
• They use these embedding to train more complicated
NN to perform POS tagging, chunking, NER, SRL
.
Determine if a 5-gram is valid. Figure
from Bottou (2011)
6th September 2015 RANLP 2015, HISSAR, BULGARIA 87/100
88. Mikolov et al (2013)1
• Use a Recurrent Neural Network Language Model
• The model has no knowledge of syntax, morphology
or semantics
• Used to measure linguistic regularities using the
pattern “a is to b as c is to ___”
◦ Syntactic test: year:years law:laws
◦ Semantic test: clothing:shirt dish:bowl
• The relationships can be expressed in terms
of offsets
1 Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space
word representations. In Proceedings of NAACL-HLT (pp. 746–751). Atlanta, Georgia,
USA. Retrieved from https://www.aclweb.org/anthology/N/N13/N13-1090.pdf
6th September 2015 RANLP 2015, HISSAR, BULGARIA 88/100
89. Adjectival scales
• Mikolov et al. (2013) show that using continuous space representations capture
syntactic and semantic regularities:
◦ apple − apples ≈ car − cars ≈ family − families
◦ king − man + woman ≈ queen
• Kim & Marneffe (2013)1 derive adjectival scales
1Kim, J., & Marneffe, M.-C. de. (2013). Deriving adjectival scales from continuous space word representations. In Proceedings of the 2013
Conference on Empirical Methods in Natural Language Processing (pp. 1625 – 1630). Retrieved from http://www.aclweb.org/anthology/D13-
1169
6th September 2015 RANLP 2015, HISSAR, BULGARIA 89/100
90. Bilingual word embeddings1
• Two word embeddings are trained in the traditional manner (for English and
Mandarin Chinese)
• An additional constraint is introduced that words from the two languages
that have similar meaning should be close together
• Words that were not known as translations of each other end up close
together
• The word embeddings are used:
◦ In the Chinese similarity task where they lead to results
better than the state of the art
◦ A 0.49 increase in the BLUE score
1Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual Word Embeddings for
Phrase-Based Machine Translation. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2013).
6th September 2015 RANLP 2015, HISSAR, BULGARIA 90/100
91. Tree Structured Long Short
Term Memory (Tree-LSTM)
• Recurrent NN with Long Short Term Memory (LSTM) proved very
good at representing sentences and useful in capturing long
distance dependencies1
• Recurrent NN (RNN) can process sequences of arbitrary length
• RNN have the problems of learning long distance correlations of
the sequence
• Have a memory cell that preserves states over long periods of time
• Tree-LSTM are very useful in capturing semantic relatedness and
sentiment analysis of movie reviews:
◦ Outperforms state-of-the-art for fine grained sentiment classification and
comparable for the binary classification
◦ Outperforms the best performing systems at SemEval 2014 semantic relatedness task
1Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term
memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566, Beijing, China.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 91/100
92. ReVal: MT evaluation metric1
• Evaluation Metric based on Tree Structured Long Short Term Memory (Tree-LSTM) networks
• Trained on WMT-13 rank data: had to convert ranks to similarity scores and also Used 4500 pairs of
the SICK data
• Performs better at system level than some methods
that rely on many features
• Average performance for segment level evaluation
• Good example how you can develop new methods
based on existing approaches in deep learning
• Code available at
https://github.com/rohitguptacs/ReVal
1 Rohit Gupta, Constantin Orasan, and Josef van Genabith. 2015. Reval: A Simple and Effective Machine Translation Evaluation Metric based on
Recurrent Neural Networks. In Proceedings of EMNLP-2015, Lisbon, Portugal.
Rohit Gupta, Constantin Orasan and Josef van Genabith. 2015. Machine Translation Evaluation using Recurrent Neural Networks. In Proceeding
of WMT-2015.
6th September 2015 RANLP 2015, HISSAR, BULGARIA 92/100
93. Text understanding from
scratch1
• Discuss how it is possible to achieve text understanding starting from
characters (e.g. not providing any information about words, paragraphs, etc.)
• Use convolutional networks (ConvNets) for determining the polarity of texts
and determine the main topic of articles
• The method words for both English and Chinese
• The conclusion of the paper is that ConvNets do not need any syntactic or
semantic structure of the language to work
1Zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. Retrieved from http://arxiv.org/pdf/1502.01710v3.pdf
6th September 2015 RANLP 2015, HISSAR, BULGARIA 93/100
94. Word embedding for Verbal
comprehension questions
• Attempt to answer verbal reasoning questions from IQ tests:
◦ Isotherm is to temperature as isobar is to? (i) atmosphere, (ii) wind, (iii) pressure, (iv) latitude, (v) current.
◦ Which is the odd one out? (i) calm, (ii) quiet, (iii) relaxed, (iv) serene, (v) unruffled.
◦ Which word is most opposite to MUSICAL? (i) discordant, (ii) loud, (iii) lyrical, (iv) verbal, (v) euphonious?
• These questions belong to predefined categories that can be identified easily by computers
• Each category has a different solver
• A novel way of producing word embeddings was necessary
• Ask 200 people to answer questions via Amazon Mechanical Turk
• The average performance of human beings is a little lower than that the proposed method
• “Our model can reach the intelligence level between the people with the bachelor degrees and
those with the master degrees”
• “The results indicate that with appropriate uses of the deep learning technologies we might be a
further step closer to the human intelligence.”
Huazheng Wang, Bin Gao, Jiang Bian, Fei Tian, Tie-Yan Liu (2015) Solving Verbal Comprehension Questions in IQ Test by Knowledge-Powered
Word Embedding. Retrieved from http://arxiv.org/abs/1505.07909
6th September 2015 RANLP 2015, HISSAR, BULGARIA 94/100
96. Deep learning
• It leads to better results than other methods
• Can be applied to a large number of tasks
• … but how many of these tasks tackle realistic data?
• … will it really lead to proper text understanding?
• … or it is yet another trend
• Proper understanding of deep learning requires very good background in
maths
• … but there are many packages available that implement methods
6th September 2015 RANLP 2015, HISSAR, BULGARIA 96/100
97. Many resources available
Slides and presentations from tutorials:
• Using Neural Networks for Modelling and Representing Natural Languages
http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20-
%20Tomas%20Mikolov.pdf
• Richard Socher’s tutorial on Deep learning for NLP (without magic)
http://lxmls.it.pt/2014/socher-lxmls.pdf
• General Sequence Learning using Recurrent Neural Networks
https://youtu.be/VINCQghQRuM
Books: http://neuralnetworksanddeeplearning.com/
Comprehensive hub of information: http://deeplearning.net/
The topic appears constantly on social media:
• Less than one day ago: “What are the limits of deep learning” on Reddit
https://redd.it/3jo968
6th September 2015 RANLP 2015, HISSAR, BULGARIA 97/100
98. Are we closer to “text understanding” or we are
only getting better at optimising for some (very
specific and sometimes unnatural) tasks?
“Open the pod bay doors, please Hal...”
https://youtu.be/dSIKBliboIo
6th September 2015 RANLP 2015, HISSAR, BULGARIA 98/100
99. The latest version of the slides available at:
http://www.slideshare.net/dinel/new-trends-in-nlp-applications
You can contact me by email at C.Orasan@wlv.ac.uk
6th September 2015 RANLP 2015, HISSAR, BULGARIA 99/100