SlideShare a Scribd company logo
1 of 96
Download to read offline
Frontiers of Natural Language Processing
Deep Learning Indaba 2018, Stellenbosch, South Africa
Sebastian Ruder, Herman Kamper, Panellists, Leaders in NLP, Everyone
Goals of session
1. What is NLP? What are the major developments in the last few
years?
2. What are the biggest open problems in NLP?
3. Get to know the local community and start thinking about
collaborations
1 / 68
What is NLP? What were the major advances?
A Review of the Recent History of NLP
What is NLP? What were the major advances?
A Review of the Recent History of NLP
Sebastian Ruder
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
3 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
4 / 68
Neural language models
• Language modeling: predict next word given previous words
• Classic language models: n-grams with smoothing
• First neural language models: feed-forward neural networks that take
into account n previous words
• Initial look-up layer is commonly known as word embedding matrix as
each word corresponds to one vector
[Bengio et al., NIPS ’01; Bengio et al., JMLR ’03] 5 / 68
Neural language models
• Later language models: RNNs and LSTMs [Mikolov et al., Interspeech ’10]
• Many new models in recent years; classic LSTM is still a strong
baseline [Melis et al., ICLR ’18]
• Active research area: What information do language models capture?
• Language modelling: despite its simplicity, core to many later
advances
• Word embeddings: the objective of word2vec is a simplification of
language modelling
• Sequence-to-sequence models: predict response word-by-word
• Pretrained language models: representations useful for transfer learning
6 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
7 / 68
Multi-task learning
• Multi-task learning: sharing parameters between models trained on
multiple tasks
[Collobert & Weston, ICML ’08; Collobert et al., JMLR ’11]
8 / 68
Multi-task learning
• [Collobert & Weston, ICML ’08] won Test-of-time Award at ICML 2018
• Paper contained a lot of other influential ideas:
• Word embeddings
• CNNs for text
9 / 68
Multi-task learning
• Multi-task learning goes back a lot further
[Caruana, ICML ’93; Caruana, ICML ’96]
10 / 68
Multi-task learning
• “Joint learning” / “multi-task learning” used interchangeably
• Now used for many tasks in NLP, either using existing tasks or
“artificial” auxiliary tasks
• MT + dependency parsing / POS tagging / NER
• Joint multilingual training
• Video captioning + entailment + next-frame prediction [Pasunuru &
Bansal; ACL ’17]
• . . .
11 / 68
Multi-task learning
• Sharing of parameters is typically predefined
• Can also be learned [Ruder et al., ’17]
[Yang et al., ICLR ’17]
12 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
13 / 68
Word embeddings
• Main innovation: pretraining word embedding look-up matrix on a
large unlabelled corpus
• Popularized by word2vec, an efficient approximation to language
modelling
• word2vec comes in two variants: skip-gram and CBOW
[Mikolov et al., ICLR ’13; Mikolov et al., NIPS ’13]
14 / 68
Word embeddings
• Word embeddings pretrained on an unlabelled corpus capture certain
relations between words
[Tensorflow tutorial]
15 / 68
Word embeddings
• Pretrained word embeddings have been shown to improve
performance on many downstream tasks [Kim, EMNLP ’14]
• Later methods show that word embeddings can also be learned via
matrix factorization [Pennington et al., EMNLP ’14; Levy et al., NIPS ’14]
• Nothing inherently special about word2vec; classic methods (PMI,
SVD) can also be used to learn good word embeddings from
unlabeled corpora [Levy et al., TACL ’15]
16 / 68
Word embeddings
• Lots of work on word embeddings, but word2vec is still widely used
• Skip-gram has been applied to learn representations in many other
settings, e.g. sentences [Le & Mikolov, ICML ’14; Kiros et al., NIPS ’15],
networks [Grover & Leskovec, KDD ’16], biological sequences [Asgari & Mofrad,
PLoS One ’15], etc.
17 / 68
Word embeddings
• Projecting word embeddings of different languages into the same
space enables (zero-shot) cross-lingual transfer [Ruder et al., JAIR ’18]
[Luong et al., ’15]
18 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
19 / 68
Neural networks for NLP
• Key challenge for neural networks: dealing with dynamic input
sequences
• Three main model types
• Recurrent neural networks
• Convolutional neural networks
• Recursive neural networks
20 / 68
Recurrent neural networks
• Vanilla RNNs [Elman, CogSci ’90] are typically not used as gradients
vanish or explode with longer inputs
• Long-short term memory networks [Hochreiter & Schmidhuber, NeuComp ’97]
are the model of choice
[Olah, ’15]
21 / 68
Convolutional neural networks
• 1D adaptation of convolutional neural networks for images
• Filter is moved along temporal dimension
[Kim, EMNLP ’14]
22 / 68
Convolutional neural networks
• More parallelizable than RNNs, focus on local features
• Can be extended with wider receptive fields (dilated convolutions) to
capture wider context [Kalchbrenner et al., ’17]
• CNNs and LSTMs can be combined and stacked [Wang et al., ACL ’16]
• Convolutions can be used to speed up an LSTM [Bradbury et al., ICLR ’17]
23 / 68
Recursive neural networks
• Natural language is inherently hierarchical
• Treat input as tree rather than as a sequence
• Can also be extended to LSTMs [Tai et al., ACL ’15]
[Socher et al., EMNLP ’13]
24 / 68
Other tree-based based neural networks
• Word embeddings based on dependencies [Levy and Goldberg, ACL ’14]
• Language models that generate words based on a syntactic stack [Dyer
et al., NAACL ’16]
• CNNs over a graph (trees), e.g. graph-convolutional neural networks
[Bastings et al., EMNLP ’17]
25 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
26 / 68
Sequence-to-sequence models
• General framework for applying neural networks to tasks where output
is a sequence
• Killer application: Neural Machine Translation
• Encoder processes input word by word; decoder then predicts output
word by word
[Sutskever et al., NIPS ’14]
27 / 68
Sequence-to-sequence models
• Go-to framework for natural language generation tasks
• Output can not only be conditioned on a sequence, but on arbitrary
representations, e.g. an image for image captioning
[Vinyals et al., CVPR ’15]
28 / 68
Sequence-to-sequence models
• Even applicable to structured prediction tasks, e.g. constituency
parsing [Vinyals et al., NIPS ’15], named entity recognition [Gillick et al.,
NAACL ’16], etc. by linearizing the output
[Vinyals et al., NIPS ’15]
29 / 68
Sequence-to-sequence models
• Typically RNN-based, but other encoders and decoders can be used
• New architectures mainly coming out of work in Machine Translation
• Recent models: Deep LSTM [Wu et al., ’16], Convolutional encoders
[Kalchbrenner et al., arXiv ’16; Gehring et al., arXiv ’17], Transformer [Vaswani et al.,
NIPS ’17], Combination of LSTM and Transformer [Chen et al., ACL ’18]
30 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
31 / 68
Attention
• One of the core innovations in Neural Machine Translation
• Weighted average of source sentence hidden states
• Mitigates bottleneck of compressing source sentence into a single
vector
[Bahdanau et al., ICLR ’15]
32 / 68
Attention
• Different forms of attention available [Luong et al., EMNLP ’15]
• Widely applicable: constituency parsing [Vinyals et al., NIPS ’15], reading
comprehension [Hermann et al., NIPS ’15], one-shot learning [Vinyals et al.,
NIPS ’16], image captioning [Xu et al., ICML ’15]
[Xu et al., ICML ’15]
33 / 68
Attention
• Not only restricted to looking at an another sequence
• Can be used to obtain more contextually sensitive word
representations by attending to the same sequence → self-attention
• Used in Transformer [Vaswani et al., NIPS ’17], state-of-the-art architecture
for machine translation
34 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
35 / 68
Memory-based neural networks
• Attention can be seen as fuzzy memory
• Models with more explicit memory have been proposed
• Different variants: Neural Turing Machine [Graves et al., arXiv ’14],
Memory Networks [Weston et al., ICLR ’15] and End-to-end Memory
Networks [Sukhbaatar et al., NIPS ’15], Dynamic Memory Networks [Kumar et
al., ICML ’16], Neural Differentiable Computer [Graves et al., Nature ’16],
Recurrent Entity Network [Henaff et al., ICLR ’17]
36 / 68
Memory-based neural networks
• Memory is typically accessed based on similarity to current state
similar to attention; can be written to and read from
• End-to-end Memory Networks [Sukhbaatar et al., NIPS ’15] process input
multiple times and update memory
• Neural Turing Machines also have a location-based addressing; can
learn simple computer programs like sorting
• Memory can be a knowledge base or populated based on input
37 / 68
Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
38 / 68
Pretrained language models
• Word embeddings are context-agnostic, only used to initialize first
layer
• Use better representations for initialization or as features
• Language models pretrained on a large corpus capture a lot of
additional information
• Language model embeddings can be used as features in a target
model [Peters et al., NAACL ’18] or a language model can be fine-tuned on
target task data [Howard & Ruder, ACL ’18]
39 / 68
Pretrained language models
• Adding language model embeddings gives a large improvement over
state-of-the-art across many different tasks
[Peters et al., ’18]
40 / 68
Pretrained language models
• Enables learning models with significantly less data
• Additional benefit: Language models only require unlabelled data
• Enables application to low-resource languages where labelled data is
scarce
41 / 68
Other milestones
• Character-based representations
• Use a CNN/LSTM over characters to obtain a character-based word
representation
• First used for sequence labelling tasks [Lample et al., NAACL ’16; Plank et
al., ACL ’16]; now widely used
• Even fully character-based NMT [Lee et al., TACL ’17]
• Adversarial learning
• Adversarial examples are becoming widely used [Jia & Liang, EMNLP ’17]
• (Virtual) adversarial training [Miyato et al., ICLR ’17; Yasunaga et al., NAACL
’18] and domain-adversarial loss [Ganin et al., JMLR ’16; Kim et al., ACL ’17]
are useful forms of regularization
• GANs are used, but not yet too effective for NLG [Semeniuta et al., ’18]
• Reinforcement learning
• Useful for tasks with a temporal dependency, e.g. selecting data [Fang &
Cohn, EMNLP ’17; Wu et al., NAACL ’18] and dialogue [Liu et al., NAACL ’18]
• Also effective for directly optimizing a surrogate loss (ROUGE, BLEU)
for summarization [Paulus et al., ICLR ’18; ] or MT [Ranzato et al., ICLR ’16]
42 / 68
The Biggest Open Problems in NLP
The Biggest Open Problems in NLP
Sebastian
Ruder
Jade
Abbott
Stephan
Gouws
Omoju
Miller
Bernardt
Duvenhage
The biggest open problems: Answers from experts
Hal Daumé III Barbara Plank Miguel Ballesteros Anders Søgaard
Manaal Faruqui Mikel Artetxe Sebastian Riedel Isabelle Augenstein
Bernardt Duvenhage Lea Frermann Brink van der Merwe Karen
Livescu Jan Buys Kevin Gimpel Christine de Kock Alta de
Waal Michael Roth Maletěabisa Molapo Annie Louise Chris Dyer
Yoshua Bengio Felix Hill Kevin Knight Richard Socher George
Dahl Dirk Hovy Kyunghyun Cho
44 / 68
We asked the experts:
What are the three biggest open problems
in NLP at the moment?
The biggest open problems in NLP
1. Natural language understanding
2. NLP for low-resource scenarios
3. Reasoning about large or multiple documents
4. Datasets, problems and evaluation
46 / 68
Problem 1: Natural language understanding
• Many experts argued that this is central, also for generation
• Almost none of our current models have “real” understanding
• What (biases, structure) should we build explicitly into our models?
• Models should incorporate common sense
• Dialogue systems (and chat bots) were mentioned in several responses
47 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses.
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses.
Question: What city did Tesla move to in 1880?
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses.
Question: What city did Tesla move to in 1880?
Answer: Prague
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses.
Question: What city did Tesla move to in 1880?
Answer: Prague
Model predicts: Prague
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses. Tadakatsu moved to the city of Chicago in 1881.
Question: What city did Tesla move to in 1880?
Answer:
Model predicts:
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses. Tadakatsu moved to the city of Chicago in 1881.
Question: What city did Tesla move to in 1880?
Answer: Prague
Model predicts:
48 / 68
Problem 1: Natural language understanding
Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses. Tadakatsu moved to the city of Chicago in 1881.
Question: What city did Tesla move to in 1880?
Answer: Prague
Model predicts: Chicago
48 / 68
Problem 1: Natural language understanding
[Jia and Liang, EMNLP’17]
49 / 68
Problem 1: Natural language understanding
I think the biggest open problems are all related to natural language
understanding. . . we should develop systems that read and
understand text the way a person does, by forming a representation of
the world of the text, with the agents, objects, settings, and the
relationships, goals, desires, and beliefs of the agents, and everything else
that humans create to understand a piece of text.
Until we can do that, all of our progress is in improving
our systems’ ability to do pattern matching. Pattern
matching can be very effective for developing products
and improving people’s lives, so I don’t want to
denigrate it, but . . .
— Kevin Gimpel
50 / 68
Problem 1: Natural language understanding
Questions to panellists/audience:
• To achieve NLU, is it important to build models that process
language “the way a person does”?
51 / 68
Problem 1: Natural language understanding
Questions to panellists/audience:
• To achieve NLU, is it important to build models that process
language “the way a person does”?
• How do you think we would go about doing this?
51 / 68
Problem 1: Natural language understanding
Questions to panellists/audience:
• To achieve NLU, is it important to build models that process
language “the way a person does”?
• How do you think we would go about doing this?
• Do we need inductive biases or can we expect models to learn
everything from enough data?
51 / 68
Problem 1: Natural language understanding
Questions to panellists/audience:
• To achieve NLU, is it important to build models that process
language “the way a person does”?
• How do you think we would go about doing this?
• Do we need inductive biases or can we expect models to learn
everything from enough data?
• Questions from audience
51 / 68
Problem 2: NLP for low-resource scenarios
52 / 68
Problem 2: NLP for low-resource scenarios
• Generalisation beyond the training data
52 / 68
Problem 2: NLP for low-resource scenarios
• Generalisation beyond the training data – relevant everywhere!
52 / 68
Problem 2: NLP for low-resource scenarios
• Generalisation beyond the training data – relevant everywhere!
• Domain-transfer, transfer learning, multi-task learning
• Learning from small amounts of data
• Semi-supervised, weakly-supervised, “Wiki-ly” supervised,
distantly-supervised, lightly-supervised, minimally-supervised
52 / 68
Problem 2: NLP for low-resource scenarios
• Generalisation beyond the training data – relevant everywhere!
• Domain-transfer, transfer learning, multi-task learning
• Learning from small amounts of data
• Semi-supervised, weakly-supervised, “Wiki-ly” supervised,
distantly-supervised, lightly-supervised, minimally-supervised
• Unsupervised learning
52 / 68
Problem 2: NLP for low-resource scenarios
Word translation without parallel data:
[Conneau et al., ICLR’18]
53 / 68
Problem 2: NLP for low-resource scenarios
[Chung et al., arXiv’18]
54 / 68
Problem 2: NLP for low-resource scenarios
Questions to panellists/audience:
• Is it necessary to develop specialised NLP tools for specific languages,
or is it enough to work on general NLP?
55 / 68
Problem 2: NLP for low-resource scenarios
Questions to panellists/audience:
• Is it necessary to develop specialised NLP tools for specific languages,
or is it enough to work on general NLP?
• Since there is inherently only small amounts of text available for
under-resourced languages, the benefits of NLP in such settings will
also be limited. Agree or disagree?
55 / 68
Problem 2: NLP for low-resource scenarios
Questions to panellists/audience:
• Is it necessary to develop specialised NLP tools for specific languages,
or is it enough to work on general NLP?
• Since there is inherently only small amounts of text available for
under-resourced languages, the benefits of NLP in such settings will
also be limited. Agree or disagree?
• Unsupervised learning vs. transfer learning from high-resource
languages?
• Questions from audience
55 / 68
Problem 3: Reasoning about large or multiple
documents
• Related to understanding
• How do we deal with large contexts?
• Can be either text or spoken documents
• Again incorporating common sense is essential
56 / 68
Problem 3: Reasoning about large or multiple
documents
Example from NarrativeQA dataset:
[Kočiský et al., TACL’18]
57 / 68
Problem 3: Reasoning about large or multiple
documents
Questions to panellists/audience:
• Do we need better models or just train on more data?
58 / 68
Problem 3: Reasoning about large or multiple
documents
Questions to panellists/audience:
• Do we need better models or just train on more data?
• Questions from audience
58 / 68
Problem 4: Datasets, problems and evaluation
Perhaps the biggest problem is to properly define the
problems themselves. And by properly defining a
problem, I mean building datasets and evaluation
procedures that are appropriate to measure our
progress towards concrete goals. Things would be
easier if we could reduce everything to Kaggle style
competitions! — Mikel Artetxe
. . . basic resources (e.g. stop word lists) — Alta de Waal
59 / 68
Problem 4: Datasets, problems and evaluation
https://rma.nwu.ac.za
60 / 68
Problem 4: Datasets, problems and evaluation
Questions to panellists/audience:
• What are the most important NLP problems that should be tackled
for societies in Africa?
61 / 68
Problem 4: Datasets, problems and evaluation
Questions to panellists/audience:
• What are the most important NLP problems that should be tackled
for societies in Africa?
• How do we make sure that we don’t overfit to our benchmarks?
61 / 68
Problem 4: Datasets, problems and evaluation
Questions to panellists/audience:
• What are the most important NLP problems that should be tackled
for societies in Africa?
• How do we make sure that we don’t overfit to our benchmarks?
• Questions from audience
61 / 68
We asked the experts a few more questions:
We asked the experts a few more questions:
What, if anything, has led the field in the
wrong direction?
What has led the field in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
63 / 68
What has led the field in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
• “Chomskyan theories of linguistics instead of corpus linguistics”
— Brink van der Merwe
63 / 68
What has led the field in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
• “Chomskyan theories of linguistics instead of corpus linguistics”
— Brink van der Merwe
• “Not incorporating enough Chomskyan theory into our models”
— Someone Else
63 / 68
What has led the field in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
• “Chomskyan theories of linguistics instead of corpus linguistics”
— Brink van der Merwe
• “Not incorporating enough Chomskyan theory into our models”
— Someone Else
• “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu
63 / 68
What has led the field in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
• “Chomskyan theories of linguistics instead of corpus linguistics”
— Brink van der Merwe
• “Not incorporating enough Chomskyan theory into our models”
— Someone Else
• “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu
• “Haha, as if the field as a whole moved in a single direction”
— Michael Roth
63 / 68
What has led the field in the wrong direction?
I don’t think there is anything like that. We can learn
from “wrong” directions and “correct” directions, if
such a thing even exists.
— Miguel Ballesteros
Anything new will temporarily lead the field in the
wrong direction, I guess, but upon returning, we may
nevertheless have pushed research horizons.
— Anders Søgaard
Sentiment shared in many of the other responses
64 / 68
We asked the experts a few more questions:
We asked the experts a few more questions:
What advice would you give a
postgraduate student in NLP
starting their project now?
What advice would you give a postgraduate
student in NLP starting their project now?
Do not limit yourself to reading NLP papers. Read a lot
of machine learning, deep learning, reinforcement learning
papers. A PhD is a great time in one’s life to go for a
big goal, and even small steps towards that will be valued.
— Yoshua Bengio
Learn how to tune your models, learn how to make
strong baselines, and learn how to build baselines that
test particular hypotheses. Don’t take any single paper
too seriously, wait for its conclusions to show up more
than once. — George Dahl
66 / 68
What advice would you give a postgraduate
student in NLP starting their project now?
i believe scientific pursuit is meant to be full of failures.
. . . if every idea works out, it’s either (a) you’re not
ambitious enough, (b) you’re subconciously cheating
yourself, or (c) you’re a genius, the last of which i heard
happens only once every century or so. so, don’t despair!
— Kyunghyun Cho
Understand psychology and the core problems of semantic
cognition. Read . . . Go to CogSci. Understand machine
learning. Go to NIPS. Don’t worry about ACL. Submit
something terrible (or even good, if possible) to a
workshop as soon as you can. You can’t learn how to do
these things without going through the process. — Felix Hill
67 / 68
Summary of session
• What is NLP? What are the major developments in the last few
years?
• What are the biggest open problems in NLP?
• Get to know the local community and start thinking about
collaborations
68 / 68
Summary of session
• What is NLP? What are the major developments in the last few
years?
• What are the biggest open problems in NLP?
• Get to know the local community and start thinking about
collaborations
• We now have the closing ceremony, so eat and chat!
68 / 68

More Related Content

What's hot

Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdfChaoYang81
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 

What's hot (20)

Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Language models
Language modelsLanguage models
Language models
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdf
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 

Similar to Frontiers of Natural Language Processing

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingYu Huang
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningAndherson Maeda
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]akm sabbir
 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationstuti_agarwal
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal
 
Foundation Models in Recommender Systems
Foundation Models in Recommender SystemsFoundation Models in Recommender Systems
Foundation Models in Recommender SystemsAnoop Deoras
 
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...eraser Juan José Calderón
 
Vectorized Intent of Multilingual Large Language Models.pptx
Vectorized Intent of Multilingual Large Language Models.pptxVectorized Intent of Multilingual Large Language Models.pptx
Vectorized Intent of Multilingual Large Language Models.pptxSachinAngre3
 

Similar to Frontiers of Natural Language Processing (20)

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep Learning
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
Unit 5f.pptx
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]
 
short_story.pptx
short_story.pptxshort_story.pptx
short_story.pptx
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentation
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Foundation Models in Recommender Systems
Foundation Models in Recommender SystemsFoundation Models in Recommender Systems
Foundation Models in Recommender Systems
 
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot T...
 
Vectorized Intent of Multilingual Large Language Models.pptx
Vectorized Intent of Multilingual Large Language Models.pptxVectorized Intent of Multilingual Large Language Models.pptx
Vectorized Intent of Multilingual Large Language Models.pptx
 
ICS1020 NLP 2020
ICS1020 NLP 2020ICS1020 NLP 2020
ICS1020 NLP 2020
 

More from Sebastian Ruder

Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionSebastian Ruder
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiSebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimSebastian Ruder
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSebastian Ruder
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENSebastian Ruder
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisSebastian Ruder
 

More from Sebastian Ruder (20)

Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 

Recently uploaded

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 

Recently uploaded (20)

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 

Frontiers of Natural Language Processing

  • 1. Frontiers of Natural Language Processing Deep Learning Indaba 2018, Stellenbosch, South Africa Sebastian Ruder, Herman Kamper, Panellists, Leaders in NLP, Everyone
  • 2. Goals of session 1. What is NLP? What are the major developments in the last few years? 2. What are the biggest open problems in NLP? 3. Get to know the local community and start thinking about collaborations 1 / 68
  • 3. What is NLP? What were the major advances? A Review of the Recent History of NLP
  • 4. What is NLP? What were the major advances? A Review of the Recent History of NLP Sebastian Ruder
  • 5. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 3 / 68
  • 6. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 4 / 68
  • 7. Neural language models • Language modeling: predict next word given previous words • Classic language models: n-grams with smoothing • First neural language models: feed-forward neural networks that take into account n previous words • Initial look-up layer is commonly known as word embedding matrix as each word corresponds to one vector [Bengio et al., NIPS ’01; Bengio et al., JMLR ’03] 5 / 68
  • 8. Neural language models • Later language models: RNNs and LSTMs [Mikolov et al., Interspeech ’10] • Many new models in recent years; classic LSTM is still a strong baseline [Melis et al., ICLR ’18] • Active research area: What information do language models capture? • Language modelling: despite its simplicity, core to many later advances • Word embeddings: the objective of word2vec is a simplification of language modelling • Sequence-to-sequence models: predict response word-by-word • Pretrained language models: representations useful for transfer learning 6 / 68
  • 9. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 7 / 68
  • 10. Multi-task learning • Multi-task learning: sharing parameters between models trained on multiple tasks [Collobert & Weston, ICML ’08; Collobert et al., JMLR ’11] 8 / 68
  • 11. Multi-task learning • [Collobert & Weston, ICML ’08] won Test-of-time Award at ICML 2018 • Paper contained a lot of other influential ideas: • Word embeddings • CNNs for text 9 / 68
  • 12. Multi-task learning • Multi-task learning goes back a lot further [Caruana, ICML ’93; Caruana, ICML ’96] 10 / 68
  • 13. Multi-task learning • “Joint learning” / “multi-task learning” used interchangeably • Now used for many tasks in NLP, either using existing tasks or “artificial” auxiliary tasks • MT + dependency parsing / POS tagging / NER • Joint multilingual training • Video captioning + entailment + next-frame prediction [Pasunuru & Bansal; ACL ’17] • . . . 11 / 68
  • 14. Multi-task learning • Sharing of parameters is typically predefined • Can also be learned [Ruder et al., ’17] [Yang et al., ICLR ’17] 12 / 68
  • 15. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 13 / 68
  • 16. Word embeddings • Main innovation: pretraining word embedding look-up matrix on a large unlabelled corpus • Popularized by word2vec, an efficient approximation to language modelling • word2vec comes in two variants: skip-gram and CBOW [Mikolov et al., ICLR ’13; Mikolov et al., NIPS ’13] 14 / 68
  • 17. Word embeddings • Word embeddings pretrained on an unlabelled corpus capture certain relations between words [Tensorflow tutorial] 15 / 68
  • 18. Word embeddings • Pretrained word embeddings have been shown to improve performance on many downstream tasks [Kim, EMNLP ’14] • Later methods show that word embeddings can also be learned via matrix factorization [Pennington et al., EMNLP ’14; Levy et al., NIPS ’14] • Nothing inherently special about word2vec; classic methods (PMI, SVD) can also be used to learn good word embeddings from unlabeled corpora [Levy et al., TACL ’15] 16 / 68
  • 19. Word embeddings • Lots of work on word embeddings, but word2vec is still widely used • Skip-gram has been applied to learn representations in many other settings, e.g. sentences [Le & Mikolov, ICML ’14; Kiros et al., NIPS ’15], networks [Grover & Leskovec, KDD ’16], biological sequences [Asgari & Mofrad, PLoS One ’15], etc. 17 / 68
  • 20. Word embeddings • Projecting word embeddings of different languages into the same space enables (zero-shot) cross-lingual transfer [Ruder et al., JAIR ’18] [Luong et al., ’15] 18 / 68
  • 21. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 19 / 68
  • 22. Neural networks for NLP • Key challenge for neural networks: dealing with dynamic input sequences • Three main model types • Recurrent neural networks • Convolutional neural networks • Recursive neural networks 20 / 68
  • 23. Recurrent neural networks • Vanilla RNNs [Elman, CogSci ’90] are typically not used as gradients vanish or explode with longer inputs • Long-short term memory networks [Hochreiter & Schmidhuber, NeuComp ’97] are the model of choice [Olah, ’15] 21 / 68
  • 24. Convolutional neural networks • 1D adaptation of convolutional neural networks for images • Filter is moved along temporal dimension [Kim, EMNLP ’14] 22 / 68
  • 25. Convolutional neural networks • More parallelizable than RNNs, focus on local features • Can be extended with wider receptive fields (dilated convolutions) to capture wider context [Kalchbrenner et al., ’17] • CNNs and LSTMs can be combined and stacked [Wang et al., ACL ’16] • Convolutions can be used to speed up an LSTM [Bradbury et al., ICLR ’17] 23 / 68
  • 26. Recursive neural networks • Natural language is inherently hierarchical • Treat input as tree rather than as a sequence • Can also be extended to LSTMs [Tai et al., ACL ’15] [Socher et al., EMNLP ’13] 24 / 68
  • 27. Other tree-based based neural networks • Word embeddings based on dependencies [Levy and Goldberg, ACL ’14] • Language models that generate words based on a syntactic stack [Dyer et al., NAACL ’16] • CNNs over a graph (trees), e.g. graph-convolutional neural networks [Bastings et al., EMNLP ’17] 25 / 68
  • 28. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 26 / 68
  • 29. Sequence-to-sequence models • General framework for applying neural networks to tasks where output is a sequence • Killer application: Neural Machine Translation • Encoder processes input word by word; decoder then predicts output word by word [Sutskever et al., NIPS ’14] 27 / 68
  • 30. Sequence-to-sequence models • Go-to framework for natural language generation tasks • Output can not only be conditioned on a sequence, but on arbitrary representations, e.g. an image for image captioning [Vinyals et al., CVPR ’15] 28 / 68
  • 31. Sequence-to-sequence models • Even applicable to structured prediction tasks, e.g. constituency parsing [Vinyals et al., NIPS ’15], named entity recognition [Gillick et al., NAACL ’16], etc. by linearizing the output [Vinyals et al., NIPS ’15] 29 / 68
  • 32. Sequence-to-sequence models • Typically RNN-based, but other encoders and decoders can be used • New architectures mainly coming out of work in Machine Translation • Recent models: Deep LSTM [Wu et al., ’16], Convolutional encoders [Kalchbrenner et al., arXiv ’16; Gehring et al., arXiv ’17], Transformer [Vaswani et al., NIPS ’17], Combination of LSTM and Transformer [Chen et al., ACL ’18] 30 / 68
  • 33. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 31 / 68
  • 34. Attention • One of the core innovations in Neural Machine Translation • Weighted average of source sentence hidden states • Mitigates bottleneck of compressing source sentence into a single vector [Bahdanau et al., ICLR ’15] 32 / 68
  • 35. Attention • Different forms of attention available [Luong et al., EMNLP ’15] • Widely applicable: constituency parsing [Vinyals et al., NIPS ’15], reading comprehension [Hermann et al., NIPS ’15], one-shot learning [Vinyals et al., NIPS ’16], image captioning [Xu et al., ICML ’15] [Xu et al., ICML ’15] 33 / 68
  • 36. Attention • Not only restricted to looking at an another sequence • Can be used to obtain more contextually sensitive word representations by attending to the same sequence → self-attention • Used in Transformer [Vaswani et al., NIPS ’17], state-of-the-art architecture for machine translation 34 / 68
  • 37. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 35 / 68
  • 38. Memory-based neural networks • Attention can be seen as fuzzy memory • Models with more explicit memory have been proposed • Different variants: Neural Turing Machine [Graves et al., arXiv ’14], Memory Networks [Weston et al., ICLR ’15] and End-to-end Memory Networks [Sukhbaatar et al., NIPS ’15], Dynamic Memory Networks [Kumar et al., ICML ’16], Neural Differentiable Computer [Graves et al., Nature ’16], Recurrent Entity Network [Henaff et al., ICLR ’17] 36 / 68
  • 39. Memory-based neural networks • Memory is typically accessed based on similarity to current state similar to attention; can be written to and read from • End-to-end Memory Networks [Sukhbaatar et al., NIPS ’15] process input multiple times and update memory • Neural Turing Machines also have a location-based addressing; can learn simple computer programs like sorting • Memory can be a knowledge base or populated based on input 37 / 68
  • 40. Timeline 2001 • Neural language models 2008 • Multi-task learning 2013 • Word embeddings 2013 • Neural networks for NLP 2014 • Sequence-to-sequence models 2015 • Attention 2015 • Memory-based networks 2018 • Pretrained language models 38 / 68
  • 41. Pretrained language models • Word embeddings are context-agnostic, only used to initialize first layer • Use better representations for initialization or as features • Language models pretrained on a large corpus capture a lot of additional information • Language model embeddings can be used as features in a target model [Peters et al., NAACL ’18] or a language model can be fine-tuned on target task data [Howard & Ruder, ACL ’18] 39 / 68
  • 42. Pretrained language models • Adding language model embeddings gives a large improvement over state-of-the-art across many different tasks [Peters et al., ’18] 40 / 68
  • 43. Pretrained language models • Enables learning models with significantly less data • Additional benefit: Language models only require unlabelled data • Enables application to low-resource languages where labelled data is scarce 41 / 68
  • 44. Other milestones • Character-based representations • Use a CNN/LSTM over characters to obtain a character-based word representation • First used for sequence labelling tasks [Lample et al., NAACL ’16; Plank et al., ACL ’16]; now widely used • Even fully character-based NMT [Lee et al., TACL ’17] • Adversarial learning • Adversarial examples are becoming widely used [Jia & Liang, EMNLP ’17] • (Virtual) adversarial training [Miyato et al., ICLR ’17; Yasunaga et al., NAACL ’18] and domain-adversarial loss [Ganin et al., JMLR ’16; Kim et al., ACL ’17] are useful forms of regularization • GANs are used, but not yet too effective for NLG [Semeniuta et al., ’18] • Reinforcement learning • Useful for tasks with a temporal dependency, e.g. selecting data [Fang & Cohn, EMNLP ’17; Wu et al., NAACL ’18] and dialogue [Liu et al., NAACL ’18] • Also effective for directly optimizing a surrogate loss (ROUGE, BLEU) for summarization [Paulus et al., ICLR ’18; ] or MT [Ranzato et al., ICLR ’16] 42 / 68
  • 45. The Biggest Open Problems in NLP
  • 46. The Biggest Open Problems in NLP Sebastian Ruder Jade Abbott Stephan Gouws Omoju Miller Bernardt Duvenhage
  • 47. The biggest open problems: Answers from experts Hal Daumé III Barbara Plank Miguel Ballesteros Anders Søgaard Manaal Faruqui Mikel Artetxe Sebastian Riedel Isabelle Augenstein Bernardt Duvenhage Lea Frermann Brink van der Merwe Karen Livescu Jan Buys Kevin Gimpel Christine de Kock Alta de Waal Michael Roth Maletěabisa Molapo Annie Louise Chris Dyer Yoshua Bengio Felix Hill Kevin Knight Richard Socher George Dahl Dirk Hovy Kyunghyun Cho 44 / 68
  • 48. We asked the experts: What are the three biggest open problems in NLP at the moment?
  • 49. The biggest open problems in NLP 1. Natural language understanding 2. NLP for low-resource scenarios 3. Reasoning about large or multiple documents 4. Datasets, problems and evaluation 46 / 68
  • 50. Problem 1: Natural language understanding • Many experts argued that this is central, also for generation • Almost none of our current models have “real” understanding • What (biases, structure) should we build explicitly into our models? • Models should incorporate common sense • Dialogue systems (and chat bots) were mentioned in several responses 47 / 68
  • 51. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. 48 / 68
  • 52. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Question: What city did Tesla move to in 1880? 48 / 68
  • 53. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Question: What city did Tesla move to in 1880? Answer: Prague 48 / 68
  • 54. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Question: What city did Tesla move to in 1880? Answer: Prague Model predicts: Prague 48 / 68
  • 55. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Tadakatsu moved to the city of Chicago in 1881. Question: What city did Tesla move to in 1880? Answer: Model predicts: 48 / 68
  • 56. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Tadakatsu moved to the city of Chicago in 1881. Question: What city did Tesla move to in 1880? Answer: Prague Model predicts: 48 / 68
  • 57. Problem 1: Natural language understanding Article: Nicola Tesla Paragraph: In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospić for Prague where he was to study. Unfortunately, he arrived too late to enroll at Charles-Ferdinand University; he never studied Greek, a required subject; and he was illiterate in Czech, another required subject. Tesla did, however, attend lectures at the university, although, as an auditor, he did not receive grades for the courses. Tadakatsu moved to the city of Chicago in 1881. Question: What city did Tesla move to in 1880? Answer: Prague Model predicts: Chicago 48 / 68
  • 58. Problem 1: Natural language understanding [Jia and Liang, EMNLP’17] 49 / 68
  • 59. Problem 1: Natural language understanding I think the biggest open problems are all related to natural language understanding. . . we should develop systems that read and understand text the way a person does, by forming a representation of the world of the text, with the agents, objects, settings, and the relationships, goals, desires, and beliefs of the agents, and everything else that humans create to understand a piece of text. Until we can do that, all of our progress is in improving our systems’ ability to do pattern matching. Pattern matching can be very effective for developing products and improving people’s lives, so I don’t want to denigrate it, but . . . — Kevin Gimpel 50 / 68
  • 60. Problem 1: Natural language understanding Questions to panellists/audience: • To achieve NLU, is it important to build models that process language “the way a person does”? 51 / 68
  • 61. Problem 1: Natural language understanding Questions to panellists/audience: • To achieve NLU, is it important to build models that process language “the way a person does”? • How do you think we would go about doing this? 51 / 68
  • 62. Problem 1: Natural language understanding Questions to panellists/audience: • To achieve NLU, is it important to build models that process language “the way a person does”? • How do you think we would go about doing this? • Do we need inductive biases or can we expect models to learn everything from enough data? 51 / 68
  • 63. Problem 1: Natural language understanding Questions to panellists/audience: • To achieve NLU, is it important to build models that process language “the way a person does”? • How do you think we would go about doing this? • Do we need inductive biases or can we expect models to learn everything from enough data? • Questions from audience 51 / 68
  • 64. Problem 2: NLP for low-resource scenarios 52 / 68
  • 65. Problem 2: NLP for low-resource scenarios • Generalisation beyond the training data 52 / 68
  • 66. Problem 2: NLP for low-resource scenarios • Generalisation beyond the training data – relevant everywhere! 52 / 68
  • 67. Problem 2: NLP for low-resource scenarios • Generalisation beyond the training data – relevant everywhere! • Domain-transfer, transfer learning, multi-task learning • Learning from small amounts of data • Semi-supervised, weakly-supervised, “Wiki-ly” supervised, distantly-supervised, lightly-supervised, minimally-supervised 52 / 68
  • 68. Problem 2: NLP for low-resource scenarios • Generalisation beyond the training data – relevant everywhere! • Domain-transfer, transfer learning, multi-task learning • Learning from small amounts of data • Semi-supervised, weakly-supervised, “Wiki-ly” supervised, distantly-supervised, lightly-supervised, minimally-supervised • Unsupervised learning 52 / 68
  • 69. Problem 2: NLP for low-resource scenarios Word translation without parallel data: [Conneau et al., ICLR’18] 53 / 68
  • 70. Problem 2: NLP for low-resource scenarios [Chung et al., arXiv’18] 54 / 68
  • 71. Problem 2: NLP for low-resource scenarios Questions to panellists/audience: • Is it necessary to develop specialised NLP tools for specific languages, or is it enough to work on general NLP? 55 / 68
  • 72. Problem 2: NLP for low-resource scenarios Questions to panellists/audience: • Is it necessary to develop specialised NLP tools for specific languages, or is it enough to work on general NLP? • Since there is inherently only small amounts of text available for under-resourced languages, the benefits of NLP in such settings will also be limited. Agree or disagree? 55 / 68
  • 73. Problem 2: NLP for low-resource scenarios Questions to panellists/audience: • Is it necessary to develop specialised NLP tools for specific languages, or is it enough to work on general NLP? • Since there is inherently only small amounts of text available for under-resourced languages, the benefits of NLP in such settings will also be limited. Agree or disagree? • Unsupervised learning vs. transfer learning from high-resource languages? • Questions from audience 55 / 68
  • 74. Problem 3: Reasoning about large or multiple documents • Related to understanding • How do we deal with large contexts? • Can be either text or spoken documents • Again incorporating common sense is essential 56 / 68
  • 75. Problem 3: Reasoning about large or multiple documents Example from NarrativeQA dataset: [Kočiský et al., TACL’18] 57 / 68
  • 76. Problem 3: Reasoning about large or multiple documents Questions to panellists/audience: • Do we need better models or just train on more data? 58 / 68
  • 77. Problem 3: Reasoning about large or multiple documents Questions to panellists/audience: • Do we need better models or just train on more data? • Questions from audience 58 / 68
  • 78. Problem 4: Datasets, problems and evaluation Perhaps the biggest problem is to properly define the problems themselves. And by properly defining a problem, I mean building datasets and evaluation procedures that are appropriate to measure our progress towards concrete goals. Things would be easier if we could reduce everything to Kaggle style competitions! — Mikel Artetxe . . . basic resources (e.g. stop word lists) — Alta de Waal 59 / 68
  • 79. Problem 4: Datasets, problems and evaluation https://rma.nwu.ac.za 60 / 68
  • 80. Problem 4: Datasets, problems and evaluation Questions to panellists/audience: • What are the most important NLP problems that should be tackled for societies in Africa? 61 / 68
  • 81. Problem 4: Datasets, problems and evaluation Questions to panellists/audience: • What are the most important NLP problems that should be tackled for societies in Africa? • How do we make sure that we don’t overfit to our benchmarks? 61 / 68
  • 82. Problem 4: Datasets, problems and evaluation Questions to panellists/audience: • What are the most important NLP problems that should be tackled for societies in Africa? • How do we make sure that we don’t overfit to our benchmarks? • Questions from audience 61 / 68
  • 83. We asked the experts a few more questions:
  • 84. We asked the experts a few more questions: What, if anything, has led the field in the wrong direction?
  • 85. What has led the field in the wrong direction? • “Synthetic data/synthetic problems” — Hal Daumé III • “Benchmark/leaderboard chasing” — Sebastian Riedel • “Obsession of . . . beating the state of the art through “neural architecture search” — Isabelle Augenstein 63 / 68
  • 86. What has led the field in the wrong direction? • “Synthetic data/synthetic problems” — Hal Daumé III • “Benchmark/leaderboard chasing” — Sebastian Riedel • “Obsession of . . . beating the state of the art through “neural architecture search” — Isabelle Augenstein • “Chomskyan theories of linguistics instead of corpus linguistics” — Brink van der Merwe 63 / 68
  • 87. What has led the field in the wrong direction? • “Synthetic data/synthetic problems” — Hal Daumé III • “Benchmark/leaderboard chasing” — Sebastian Riedel • “Obsession of . . . beating the state of the art through “neural architecture search” — Isabelle Augenstein • “Chomskyan theories of linguistics instead of corpus linguistics” — Brink van der Merwe • “Not incorporating enough Chomskyan theory into our models” — Someone Else 63 / 68
  • 88. What has led the field in the wrong direction? • “Synthetic data/synthetic problems” — Hal Daumé III • “Benchmark/leaderboard chasing” — Sebastian Riedel • “Obsession of . . . beating the state of the art through “neural architecture search” — Isabelle Augenstein • “Chomskyan theories of linguistics instead of corpus linguistics” — Brink van der Merwe • “Not incorporating enough Chomskyan theory into our models” — Someone Else • “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu 63 / 68
  • 89. What has led the field in the wrong direction? • “Synthetic data/synthetic problems” — Hal Daumé III • “Benchmark/leaderboard chasing” — Sebastian Riedel • “Obsession of . . . beating the state of the art through “neural architecture search” — Isabelle Augenstein • “Chomskyan theories of linguistics instead of corpus linguistics” — Brink van der Merwe • “Not incorporating enough Chomskyan theory into our models” — Someone Else • “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu • “Haha, as if the field as a whole moved in a single direction” — Michael Roth 63 / 68
  • 90. What has led the field in the wrong direction? I don’t think there is anything like that. We can learn from “wrong” directions and “correct” directions, if such a thing even exists. — Miguel Ballesteros Anything new will temporarily lead the field in the wrong direction, I guess, but upon returning, we may nevertheless have pushed research horizons. — Anders Søgaard Sentiment shared in many of the other responses 64 / 68
  • 91. We asked the experts a few more questions:
  • 92. We asked the experts a few more questions: What advice would you give a postgraduate student in NLP starting their project now?
  • 93. What advice would you give a postgraduate student in NLP starting their project now? Do not limit yourself to reading NLP papers. Read a lot of machine learning, deep learning, reinforcement learning papers. A PhD is a great time in one’s life to go for a big goal, and even small steps towards that will be valued. — Yoshua Bengio Learn how to tune your models, learn how to make strong baselines, and learn how to build baselines that test particular hypotheses. Don’t take any single paper too seriously, wait for its conclusions to show up more than once. — George Dahl 66 / 68
  • 94. What advice would you give a postgraduate student in NLP starting their project now? i believe scientific pursuit is meant to be full of failures. . . . if every idea works out, it’s either (a) you’re not ambitious enough, (b) you’re subconciously cheating yourself, or (c) you’re a genius, the last of which i heard happens only once every century or so. so, don’t despair! — Kyunghyun Cho Understand psychology and the core problems of semantic cognition. Read . . . Go to CogSci. Understand machine learning. Go to NIPS. Don’t worry about ACL. Submit something terrible (or even good, if possible) to a workshop as soon as you can. You can’t learn how to do these things without going through the process. — Felix Hill 67 / 68
  • 95. Summary of session • What is NLP? What are the major developments in the last few years? • What are the biggest open problems in NLP? • Get to know the local community and start thinking about collaborations 68 / 68
  • 96. Summary of session • What is NLP? What are the major developments in the last few years? • What are the biggest open problems in NLP? • Get to know the local community and start thinking about collaborations • We now have the closing ceremony, so eat and chat! 68 / 68