SlideShare ist ein Scribd-Unternehmen logo
1 von 97
Downloaden Sie, um offline zu lesen
Introduction to Transformers
for NLP
where we are and how we got here
Olga Petrova
AI Product Manager
DataTalks.Club
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
www.olgapaints.net
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Representing words
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
NLP 101 RNNs Attention Transformers BERT
Representing words
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
I play with my dog.
● Machine learning: inputs are vectors / matrices
● How do we convert “I play with my dog” into numerical values?
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
0
0
0
0
0
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
● One hot encoding
● Dimensionality = vocabulary size
cat
0
0
0
0
0
0
1
1
2
3
4
5
6
7
NLP 101 RNNs Attention Transformers BERT
play played playing
I play / played / was playing with my dog.
Word = token: , ,
Subword = token: , ,
E.g.: played = ;
● In practice, subword tokenization is used most often
● For simplicity, we’ll go with word = token for the rest of the talk
Tokenization
play #ed #ing
play #ed
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings N = vocabulary size
● puppy : dog = kitten : cat. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
1 for feline
1 for canine - = -
1 for adult / 0 for baby
dog puppy cat kitten
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
● Embedding matrix: through machine learning!
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
In practice:
● Embedding can be part of your
model that you train for a ML task
● Or: use pre-trained embeddings
(e.g. Word2Vec, GloVe, etc)
NLP 101 RNNs Attention Transformers BERT
Tokenization:
blog.floydhub.com/tokenization-nlp/
Word embeddings:
jalammar.github.io/illustrated-word2vec/
Further reading
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Takeaways:
1. Multiple elements in a sequence
2. Order of the elements matters
RNNs: Recurrent Neural Networks
NLP 101 RNNs Attention Transformers BERT
RNNs: Recurrent Neural Networks
What does this mean???
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
Output sequence
Input sequence
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
Does this really work?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.” 😱
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Decoder determines which input elements get
attended to via a softmax layer
Further reading
NLP 101 RNNs Attention Transformers BERT
jalammar.github.io/illustrated-transformer/
My blog post that this talk is based on + the linear algebra perspective on
Attention:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3
6c0265937d
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Attention is All You Need © Google, 2017
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
○ Also Self-Attention in the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Word embedding + position encoding
Attention all around
● All of the sequence elements (w1, w2,
w3) are being processed in parallel
● Each element’s encoding gets
information (through hidden states)
from other elements
● Bidirectional by design 😍
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
Uni-directional RNN would have missed
this! Bi-directional is nice.
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;)
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
Task # 2: Next Sentence Prediction
● Input sequence: two sentences separated by a special token
● Binary classification: are these sentences successive or not?
NLP 101 RNNs Attention Transformers BERT
Further reading and coding
My blog posts:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0
265937d
blog.scaleway.com/understanding-text-with-bert/
The HuggingFace 🤗 Transformers library: huggingface.co/
NLP 101 RNNs Attention Transformers BERT

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with TransformersJulien SIMON
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersJulien SIMON
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningleopauly
 

Was ist angesagt? (20)

Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 

Ähnlich wie Introduction to Transformers for NLP - Olga Petrova

Kaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionKaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionArtsemZhyvalkouski
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to TransformersSuman Debnath
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxKaiduTester
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnSARCCOM
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
 
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...MobileMonday Estonia
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 

Ähnlich wie Introduction to Transformers for NLP - Olga Petrova (20)

Kaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionKaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solution
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
BERT.pptx
BERT.pptxBERT.pptx
BERT.pptx
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 

Mehr von Alexey Grigorev

Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour KaressliAlexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAlexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesAlexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data ScienceAlexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningAlexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningAlexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentAlexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationAlexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationAlexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursAlexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAlexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectAlexey Grigorev
 
ML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsAlexey Grigorev
 

Mehr von Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 
ML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and Logistics
 

Kürzlich hochgeladen

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 

Kürzlich hochgeladen (20)

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Introduction to Transformers for NLP - Olga Petrova

  • 1. Introduction to Transformers for NLP where we are and how we got here Olga Petrova AI Product Manager DataTalks.Club
  • 2. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France
  • 3. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway
  • 4. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University
  • 5. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University www.olgapaints.net
  • 6. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 7. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 8. Representing words NLP 101 RNNs Attention Transformers BERT
  • 9. Representing words I play with my dog. NLP 101 RNNs Attention Transformers BERT
  • 10. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices NLP 101 RNNs Attention Transformers BERT
  • 11. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 NLP 101 RNNs Attention Transformers BERT
  • 12. Representing words ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 I play with my dog. ● Machine learning: inputs are vectors / matrices ● How do we convert “I play with my dog” into numerical values? NLP 101 RNNs Attention Transformers BERT
  • 13. Representing words I play with my dog. I play with my dog . NLP 101 RNNs Attention Transformers BERT
  • 14. Representing words I play with my dog. I play with my dog . 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 0 0 0 0 0 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 15. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 16. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 ● One hot encoding ● Dimensionality = vocabulary size cat 0 0 0 0 0 0 1 1 2 3 4 5 6 7 NLP 101 RNNs Attention Transformers BERT
  • 17. play played playing I play / played / was playing with my dog. Word = token: , , Subword = token: , , E.g.: played = ; ● In practice, subword tokenization is used most often ● For simplicity, we’ll go with word = token for the rest of the talk Tokenization play #ed #ing play #ed NLP 101 RNNs Attention Transformers BERT
  • 18. Word embeddings = …… N = …… = …… = …… One hot encodings N = vocabulary size ● puppy : dog = kitten : cat. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 19. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 20. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 21. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 22. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 23. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy 1 for feline 1 for canine - = - 1 for adult / 0 for baby dog puppy cat kitten NLP 101 RNNs Attention Transformers BERT
  • 24. How do we arrive at this representation? Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 25. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 26. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 27. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 28. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process ● Embedding matrix: through machine learning! Word embeddings Embedding matrix W One-hot vector Embedded vector In practice: ● Embedding can be part of your model that you train for a ML task ● Or: use pre-trained embeddings (e.g. Word2Vec, GloVe, etc) NLP 101 RNNs Attention Transformers BERT
  • 30. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 31. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT
  • 32. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1
  • 33. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant
  • 34. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant Takeaways: 1. Multiple elements in a sequence 2. Order of the elements matters
  • 35. RNNs: Recurrent Neural Networks NLP 101 RNNs Attention Transformers BERT
  • 36. RNNs: Recurrent Neural Networks What does this mean??? NLP 101 RNNs Attention Transformers BERT
  • 37. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 38. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 39. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 40. RNNs: the seq2seq example Output sequence Input sequence NLP 101 RNNs Attention Transformers BERT
  • 41. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 42. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 43. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 44. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 45. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 46. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 47. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings)
  • 48. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 49. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 50. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 51. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 52. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 53. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 54. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 55. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 56. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 57. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 58. Does this really work? NLP 101 RNNs Attention Transformers BERT
  • 59. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 NLP 101 RNNs Attention Transformers BERT
  • 60. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? NLP 101 RNNs Attention Transformers BERT
  • 61. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 NLP 101 RNNs Attention Transformers BERT
  • 62. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? NLP 101 RNNs Attention Transformers BERT
  • 63. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” NLP 101 RNNs Attention Transformers BERT
  • 64. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ NLP 101 RNNs Attention Transformers BERT
  • 65. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” NLP 101 RNNs Attention Transformers BERT
  • 66. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” 😱 NLP 101 RNNs Attention Transformers BERT
  • 67. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 68. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT
  • 69. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 70. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 71. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one
  • 72. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one The Decoder determines which input elements get attended to via a softmax layer
  • 73. Further reading NLP 101 RNNs Attention Transformers BERT jalammar.github.io/illustrated-transformer/ My blog post that this talk is based on + the linear algebra perspective on Attention: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3 6c0265937d
  • 74. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 75. Attention is All You Need © Google, 2017 NLP 101 RNNs Attention Transformers BERT
  • 76. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs NLP 101 RNNs Attention Transformers BERT
  • 77. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers NLP 101 RNNs Attention Transformers BERT
  • 78. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder NLP 101 RNNs Attention Transformers BERT
  • 79. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention NLP 101 RNNs Attention Transformers BERT
  • 80. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention ○ Also Self-Attention in the Decoder NLP 101 RNNs Attention Transformers BERT
  • 81. Attention all around NLP 101 RNNs Attention Transformers BERT
  • 82. Attention all around NLP 101 RNNs Attention Transformers BERT Word embedding + position encoding
  • 83. Attention all around ● All of the sequence elements (w1, w2, w3) are being processed in parallel ● Each element’s encoding gets information (through hidden states) from other elements ● Bidirectional by design 😍 NLP 101 RNNs Attention Transformers BERT
  • 84. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” NLP 101 RNNs Attention Transformers BERT
  • 85. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” Uni-directional RNN would have missed this! Bi-directional is nice. NLP 101 RNNs Attention Transformers BERT
  • 86.
  • 87. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 88. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words NLP 101 RNNs Attention Transformers BERT
  • 89. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) NLP 101 RNNs Attention Transformers BERT
  • 90. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token NLP 101 RNNs Attention Transformers BERT
  • 91. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 92. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 93. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers NLP 101 RNNs Attention Transformers BERT
  • 94. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;) NLP 101 RNNs Attention Transformers BERT
  • 95. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing NLP 101 RNNs Attention Transformers BERT
  • 96. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing Task # 2: Next Sentence Prediction ● Input sequence: two sentences separated by a special token ● Binary classification: are these sentences successive or not? NLP 101 RNNs Attention Transformers BERT
  • 97. Further reading and coding My blog posts: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0 265937d blog.scaleway.com/understanding-text-with-bert/ The HuggingFace 🤗 Transformers library: huggingface.co/ NLP 101 RNNs Attention Transformers BERT