PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding

PR-175 :
XLNet: Generalized Autoregressive Pretraining for
Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le
Carnegie Mellon University, Google Brain
Presented by
Sungnam Park, Yonsei University
sungnam1108@naver.com

PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding 2
2019.6.19
2019.1.18

PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Contents
1. Recent Trends
2. AutoRegressive vs AutoEncoding
3. Changes
3-1. Permutation Language Model
3-2. Two-stream Self-attention Mechanism
3-3. Recurrence Mechanism
3
4. Experiments
5. Ablation Study
6. Conclusion

Recent Trends
•Unsupervised representation learning has been successful!
•Pretrain on large-scale unlabeled text corpora → finetuning
•CoVe(2017)*, ELMo(2018)**, GPT(2018)***, BERT(2018)****
•Pretraining methods : Autoregressive vs Autoencoding
4
*McCann et al 2017, Learned in translation: Contextualized word vectors. **Peters et al 2018, Deep contextualized word representations.
***Radford et al 2018, Improving language understanding by generative pre-training ****Devlin et al 2018, Bert: Pre-training of deep bidirectional transformers for language understanding.  
Autoregressive Autoencoding

Autoregressive vs Autoencoding
•AR language model (GPT)
5
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
•AE language model (BERT)
max
θ
log pθ(x) =
T
∑
t=1
log pθ(xt |x<t) max
θ
log pθ(¯x| ̂x) ≈
T
∑
t=1
mt log pθ(xt | ̂x)
Good at text generation Good at language understanding
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)

Autoregressive vs Autoencoding
•BERT’s limitations
•Model assumes that all masked tokens are independent
•Generalized model should not rely on data corruption (masking)
•<mask> token doesn’t appear in real world
•It lacks long-term dependency
6

Changes
Permutation Language Modeling
Two-Stream Self-Attention
Transformer-XL
7

8
[ 1, 2, 3, 4, 5, 6, 7, 8]origin sequence
[ 1, 2, 3, 4, 5, 6, 7, 8]
Forward AR
[ 8, 7, 6, 5, 4, 3, 2, 1]Backward AR
Permutation AR [ 3, 2, 5, 6, 8, 1, 7, 4]

9
[ 3, 2, 4, 1 ]
[ 2, 4, 3, 1 ]
[ 1, 4, 2, 3 ]
[ 4, 3, 1, 2 ]
Permutation

10
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
Masked Language Model Permutation Language Model
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask [ x2, x1, x5, x8, x6, x5, x3, x7 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask
autoencoding
[ x2, x1, x5, x8, x6, x5, x3, x7 ]
autoregressive

11
[ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ]
Masked Language Model Permutation Language Model
[ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ]
[ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ]
[ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ]
[ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ] [ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ]

12
GPT
[ x1, x2, x3, x4 ] → [ x5 ]
BERT
x1
x2
Transformer
<mask> x3 x4
Permutation LM
[ x2, x1, x4, x3 ] → [ x5 ]
→ [ x6 ]
→ [ x7 ]
???
Query Stream
Content Stream
Two-Stream Self Attention
not used when fine-tuning

13

Transformer-XL (PR-161)
•Segment Recurrence
14

•Segment Recurrence
15
Google AI blog, Transformer-XL: Unleashing the Potential of Attention Models

•Relative Positional Embedding
16
•When reuse old hidden states, how can we define positional encodings?
•It is enough to know relative distance between key vector and query vector ( )
•Standard Transformer
•Transformer-XL
[0, 1, 2, 3] [0, 1, 2, 3]
segment 1 segment 2
[0, 1, 2, 3]
segment 3
i − j

Experiments
•Pretraining Datasets
•BERT : BookCorpus + English Wikipedia
•XLNet : BookCorpus + English Wikipedia + Giga5 + ClueWeb + Common Crawl
•Model Size
•XLNet-Large BERT-Large
•Training Time
•512 TPU v3 chips for 500K steps with adam optimizer (batch size 2048, 2.5days)
17
≈
→ Still underfits!

Experiments
•SQuAD Dataset
18

Experiments
•Text Classification
19

Experiments
•RACE Dataset
20

Experiments
•GLUE Dataset
21

Ablation Study
•Importance of Transformer-XL
22

Ablation Study
•Effectiveness of Permutation Language Modeling
23

Conclusion
•Permutation language model can cover both AR, AE models
24
[ 1, 2, 3, 4 ]
[ 8, 7, 6, 5 ]
[ 1, 4, 2, 3 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)
XLNet
•Transformer-XL is very good for downstream tasks as well as language modeling

Summary
25
Generalized Autoregressive Pretraining for Language Understanding

Summary
26
Pretrain without data corruption(masking) by using Permutation LM

Summary
27
Autoregressive language model but also utilizes bidirectional context

Summary
28
Outperformed BERT on 20 NLP tasks, 18 of them are State Of The Art

PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding

Ähnlich wie PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding (20)

Mehr von Sungnam Park

Mehr von Sungnam Park (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding