The Brain’s Guide to Dealing with Context in Language Understanding
Like the visual cortex, the regions of the brain involved in understanding language represent information hierarchically. But whereas the visual cortex organizes things into a spatial hierarchy, the language regions encode information into a hierarchy of timescale. This organization is key to our uniquely human ability to integrate semantic information across narratives. More and more, deep learning-based approaches to natural language understanding embrace models that incorporate contextual information at varying timescales. This has not only led to state-of-the art performance on many difficult natural language tasks, but also to breakthroughs in our understanding of brain activity.
In this talk, we will discuss the important connection between language understanding and context at different timescales. We will explore how different deep learning architectures capture timescales in language and how closely their encodings mimic the brain. Along the way, we will uncover some surprising discoveries about what depth does and doesn’t buy you in deep recurrent neural networks. And we’ll describe a new, more flexible way to think about these architectures and ease design space exploration. Finally, we’ll discuss some of the exciting applications made possible by these breakthroughs.
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
1. The brain’s guide to dealing with context in
language understanding
Ted Willke, Javier Turek, and Vy Vo
Intel Labs
November 8th, 2019
Alex Huth and Shailee Jain
UT-Austin
2. Natural Language Understanding
!2
A form of natural language processing that deals with machine
reading comprehension.
Example:
“The problem to be solved is: Tom has twice as
many fish as Mary has guppies. If Mary has 3
guppies, what is the number of fish Tom has?”
(D.G. Bobrow, 1964)
3. A 1960’s example
!3
“The problem to be
solved is: If the
number of customers
Tom gets is twice the
square of 20 percent
of the number of
advertisements he
runs, and the number
of advertisements he
runs is 45, what is
the number of
customers Tom gets?”
Input Text
“The number (of/op)
customers Tom (gets/
verb) is 2 (times/op 1)
the (square/op 1) of 20
(percent/op 2) (of/op)
the number (of/op)
advertisements (he/pro)
runs (period/dlm) The
number (of/op)
advertisements (he/pro)
runs is 45 (period/dlm)
(what/qword) is the
number (of/op)
customers Tom (gets/
verb) (qmark/DLM)”
NLP
(Lisp example)
Canonical sentences, with mark-up
NLU
Answer
“The number of
customers Tom
gets is 162”
NLU derives meaning from
the lexicon, grammar and
context.
E.g., what is the meaning of
“(he/pro) runs” here?
(D.G. Bobrow, 1964)
4. Applications of NLU
!4
Super-valuable stuff!
Machine translation Question answering
(The Stanford Question Answering Dataset 2.0)
Machine reasoning
(Arista, Allen AI)(Google Translate)
(Even visual!)
(Zhu et al., 2015)
5. The importance of context in language understanding
•Retaining information about
narratives is key to effective
comprehension.
•This information must be:
•Represented
•Organized
•Effectively applied
https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Economic_inequality.html
The brain is great at this. What can it teach us?
6. Key questions for this talk
How does the brain organize and represent narratives?
What can deep learning models tell us about the brain?
Are the more effective ones more brain-like?
How well do deep learning models deal with narrative context?
7. Key questions for this talk
How does the brain organize and represent narratives?
What can deep learning models tell us about the brain?
Are the more effective ones more brain-like?
How well do deep learning models deal with narrative context?
8. The brain’s organization
!8
In order to understand language, the human brain explicitly
represents information at a hierarchy of different timescales
across different brain areas
•Early stages: auditory processing in
milliseconds to words at sub-second
Representations at long timescales shown to exist in separate
brain areas but little is known about their structure and format.
(Lerner et al., 2011)
•Later stages: derive meaning by
combining information across minutes
and hours
9. Key questions for this talk
How does the brain organize and represent narratives?
How well do deep learning models deal with narrative context?
What can deep learning models tell us about the brain?
Are the more effective ones more brain-like?
10. A look at recent state-of-the-art models
Recurrent Neural Networks
Temporal Convolutional Networks
Transformer Networks
11. Evaluating the performance of these models
•Sequence modeling
Given an input sequence x0, . . . , xT
and desired corresponding outputs (predictions) y0, . . . , yT
we wish to learn a function ̂y0, . . . , ̂yT = f(x0, . . . , xT)
where depends only on past inputs (causal).x0, . . . , xtyt
Use as a proxy to study the performance of backbone models for NLU
E.g., predicting next character
or word
•Sequence modeling applied to language is language modeling
•Self-supervised, basis for many other NLP tasks, and exploits context for prediction
12. Example sequence modeling tasks
•Add: Add two numbers that are marked in a long sequence, and output
the sum after a delay
•Copy: Copy a short sequence that appears much earlier in a long
sequence
•Classify (MNIST): Given a sequence of pixel values from MNIST
(784x1), predict the corresponding digit (0-9)
•Predict word (LAMBADA): Given a dataset of 10K passages from
novels, with average context of 4.6 sentences, predict the last word of a
target sentence
13. A look at recent state-of-the-art models
Recurrent Neural Networks
Temporal Convolutional Networks
Transformer Networks
14. Using recurrence to solve the problem
!14
Can process a sequence of vectors by applying
a recurrence formula at each time step:
xt
ht = fW(ht−1, xt)
new state some function
with params, W
old state input vector at time t
The same function and parameters are used at every time step!
15. Example:
Character-level
language model
!15
Predicting the next
character…
Vocabulary:
[h,e,l,o]
Training sequence:
“hello”
(Example adapted from Stanford’s excellent CS231n course. Thank you Fei-Fei Li, Justin Johnson, and Serena Young!)
17. - Vanishing and exploding gradient problem
- Smaller weight given to long-term interactions
Dealing with longer timescales
!17
• Learning long-term dependencies is difficult
- Little training success for sequences > 10-20 in
length
• Solution: Gated RNNs
- Control over timescale of integration of feedback
- Eliminates repeated matrix multiplies
singular value < 1 singular value > 1
18. One possible solution: LSTM
• Long Short-Term Memory
!18
- Provides uninterrupted gradient flow
- Solves the problem at the expense of more
parameters
• As revolutionary for sequential processing as
CNNs were for spatial processing
- Toy problems: long sequence recall, long-distance
interactions (math), classification and ordering of
widely-separated symbols, noisy inputs, etc.
- Real applications: natural machine translation, text-to-
speech, music and handwriting generation
21. !21
At first:
and further…
train further…
and further….
(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
22. !22
After a few hours of training:
(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
23. !23(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
The Stacks Project: Open source textbook on algebraic geometry
•Latex source!
•455910 lines of code
Can RNNs learn complex
syntactic structures?
24. !24(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
Algebraic Geometry (Latex)
Generates nearly compilable Latex!
25. !25(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
Algebraic Geometry (Latex)
26. !26(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
Algebraic Geometry (Latex)
Too long term of a dependency?
Never closes!
27. !27(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
Code generation?
•Concatenated into a
giant file (474 MB of C)
•10 million parameter RNN
28. !28(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
•Concatenated into a
giant file (474 MB of C)
•10 million parameter RNN
Comments here and there
Proper syntax for strings and pointers
Correctly learns to use brackets
Often uses undefined variables!
Declares variables it never uses!
29. !29(Andrej Karpathy’s blog: The Unreasonable Effectiveness of Recurrent Neural Networks)
Within scope
But vacuous!
Another problem with long-term dependencies
30. A look at recent state-of-the-art models
Recurrent Neural Networks
Temporal Convolutional Networks
Transformer Networks
31. !31
Temporal Convolutional Neural Networks
(Bai et al., 2018)
TCN = 1D FCN + causal convolution
Benefits:
• Parallelism!
• Flexible receptive field size
• Stable gradients
• Low memory for training
• Variable input lengths
Details:
• Uses dilated convolutions for exponential receptive field vs depth
• Effective history is and , where is the layer number
• Uses residuals, ReLUs, and weight normalization
• Spatial dropout
k(1 − d) d = 𝒪(2i
) i
32. !32
TCNs versus LSTMs
(Bai et al., 2018)
The ‘unlimited memory’ of LSTMs is quite limited
compared to the expansive receptive field of the
generic TCN.
Copy memory task (last 10 elements evaluated)
33. A look at recent state-of-the-art models
Recurrent Neural Networks
Temporal Convolutional Networks
Transformer Networks
34. !34
Transformer Networks
(Vaswani et al., 2017)
Relies entirely on attention to compute
representations!
Details:
• Encoder-decoder structure and auto-regressive model
• Multi-headed self-attention mechanisms
• FC feed forward networks applied to each position separately and identically
• Input and output embeddings used
• No recurrence and no convolution, so must inject positional encodings
Benefits:
• Low computational complexity
• Highly-parallelizable computation
• Low ‘path length’ for long-term
dependencies
Attention(Q, K, V) = softmax
(
QKT
dk )
V
Decoder attends
to all positions
in input seq
Encoder has
self-attention
for each layer
Decoder also has
self-attention masked
for causality
35. !35
Why self-attention?
(Vaswani et al., 2017)
is the sequence length, is the representation dimension, is the kernel size
for convolutions, and is the neighborhood size in restricted attention.
n d k
r
It’s not only the length of context that matters, but also the ease by which it
can be accessed.
longer path
lengths
d > n
more
ops
36. !36
Transformers vs TCNs
(Vaswani et al., 2017)
Google’s TCN for NMT
Even with a relative-limited context
(e.g., 128), Transformers win.
FAIR’s TCN with attention
Machine Translation
(Dai et al., 2019)
But with a segment-level recurrence mechanism,
it is freed of fixed context lengths and it soars.
Transformer-XL
WikiText-103 word-level sequence modeling
37. !37
Transformer-XL
(Dai et al., 2019)
Continued gains in performance to 1000+ contexts
Total hallucination!
(but nice generalization)
38. Key questions for this talk
How does the brain organize and represent narratives?
How well do deep learning models deal with narrative context?
What can deep learning models tell us about the brain?
Are the more effective ones more brain-like?
39. Are deep neural networks organized by timescale?
!39
=
?
Neural Network
Neural Network
Neural Network
The boy went out to fly an _____
airplane
short
intermediate
long
timescale
40. The methodology
!40
Story Neural models Neural activations
Goal: Determine how well NN layer activations predict fMRI data (regression).
41. Predicting brain activity with encoding models
!41
Eickenberg et al., NeuroImage 2017
Kell et al., Neuron 2018
42. Relative predictive power of models
!42(Jain et al., 2018)
LSTM vs Embedding
(Jain et al., unpublished)
Transformer vs Embedding
43. Layer-specific correlations for LSTM
!43
(Jain et al., 2018)
Low-level
speech processing
region
Higher
semantic region
white = no layer preference
44. Open questions
!44
Why do LSTMs perform so poorly?
Not all that predictive.
Not exhibiting layer-specific correlations.
Do TCNs and Transformers exhibit multi-timescale characteristics?
47. Encoding model performance for Transformer
!47
• Averaged across 3 subjects
• Contextual models from all layers
outperform embedding
• Increasing context length (to a
point) helps all layers
• Long context representations are
still missing information!
TCNs exhibit similar characteristics but do not seem to learn the same representations.
(Jain et al., unpublished)
48. Summary and Challenges
!48
•The brain’s language pathway is organized into a multi-timescale hierarchy, making it
very effective at utilizing context
•Language models are catching up, with Transformer-XL in the lead
•TCNs and Transformers indeed have explicit multi-timescale hierarchies
- Last layers have lower predictive performance, why?
- How to get more out of context at longer timescales?
- Lack of clear timescales in RNNs should lead to a revisiting of their depth
characteristics. (E.g., see Turek et al. 2019, https://arxiv.org/abs/1909.00021)
•More study needed on representations
- What specific information is captured in representations across the cortex?
- Are the same representations found across deep learning architectures?