6. From One-Hot Vectors to Word Embeddings &
Self-Attention
P 6
animal...street...it
0000…10001…01000…0
one-hot
1.4…3.74.9…6.42.5…8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
7. From One-Hot Vectors to Word Embeddings &
Self-Attention
P 7
animal...street...it
0000…10001…01000…0
one-hot
1.4…3.74.9…6.42.5…8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
8. query, key, value
From One-Hot Vectors to Word Embeddings &
Self-Attention
P 8
animal...street...it
0000…10001…01000…0
one-hot
1.4…3.74.9…6.42.5…8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
9. query, key, value
From One-Hot Vectors to Word Embeddings &
Self-Attention
P 9
animal...street...it
0000…10001…01000…0
one-hot
0.1
0.2
0.7
(self-)attention
1.4…3.74.9…6.42.5…8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
10. query, key, value
From One-Hot Vectors to Word Embeddings &
Self-Attention
P 10
animal...street...it
0000…10001…01000…0
one-hot
0.1
0.2
0.7
(self-)attention
1.4…3.74.9…6.42.5…8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
11. OpenAI: Generative Pretraining
The animal tired Acceptable
<s> the … too <s> the … tired
P 11
Transformer Transformer Transformer
Transformer TransformerTransformer
Transformer Transformer Transformer
Transformer TransformerTransformer
12. Understanding Can Need “Future” Information
How far is Jacksonville from Miami?
Jacksonville is in the First Coast region of northeast Florida and is centered on the
banks of the St. Johns River, about 25 miles (40 km) south of the Georgia state line
and about 340 miles (550 km) north of Miami.
VERB NOUN
Mark which area you want to distress. Mark, which area do you want to distress?
P 12
13. Naive Bidirectionality: Words Can “See Themselves”
The animal tired The animal tired
<s> the … too <s> the … too
P 13
Transformer Transformer Transformer
Transformer TransformerTransformer
Transformer Transformer Transformer
Transformer TransformerTransformer
14. Training BERT
Masked Language Model (Fill-in-the-blank)
Deep learning (also [MASK] [MASK] deep structured learning or [MASK]
learning) is part of a broader family of machine learning methods
[MASK] on [MASK] data representations, as opposed to task-specific
algorithms.
[MASK] is allergic to peaches. Is
P 14https://en.wikipedia.org/wiki/Deep_learning
https://en.wikipedia.org/wiki/Daniel_Tiger%27s_Neighborhood
BooksCorpus: Zhu, Kiros, Zemel, Salakhutdinov, Urtasun, Torralba, Fidler, CVPR 2015
19. MRPC: Dolan and Brockett, IWP 2005
Pretraining Tasks Matter...and Bigger = Better *
P 19
20. Do I Need Full BERT Models for All My Tasks?
P 20Houlsby, Giurgiu, Jastrzebski, Morrone, de Laroussilhe, Gesmundo, Attariyan, Gelly, arxiv Feb 2019
21. Try It Out, Get Faster Training with TPUs
P 21
22. Mismatches between Training and
Realistic Inputs
Two Case Studies: Mixed Language Text and Identifying Commands
P 22
25. “A Fast, Compact, Accurate Model for Language Identification
of Codemixed Text”
Zhang, Riesa , Gillick , Bakalov, Baldridge, Weiss, EMNLP 2018 P 25
31. Certain insects can damage plumerias, such as mites, flies, or aphids. NOUN
Mark which area you want to distress. VERB
P 31
“A Challenge Set and Methods for Noun-Verb Ambiguity”,
EMNLP 2018