최근 NLP에서 20개의 task에서 BERT를 이기고 그 중 18개에서는 SoTA를 찍은 XLNet에 대해서 발표해 보았습니다.
이 논문은 기존에 language 모델들이 갖고 있었던 Autoregressive한 특성을 지키면서도, BERT와 같이 다양한 방향에서의 context를 잘 반영한 pretraining language model입니다. XLNet의 가장 큰 특징은 다음 3가지 입니다.
- Permutation Language Model
- Two-Stream Self-Attention
- Transformer-XL
감사합니다.
* 논문 : https://arxiv.org/abs/1906.08237
Boost Fertility New Invention Ups Success Rates.pdf
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
1. PR-175 :
XLNet: Generalized Autoregressive Pretraining for
Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le
Carnegie Mellon University, Google Brain
Presented by
Sungnam Park, Yonsei University
sungnam1108@naver.com
3. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Contents
1. Recent Trends
2. AutoRegressive vs AutoEncoding
3. Changes
3-1. Permutation Language Model
3-2. Two-stream Self-attention Mechanism
3-3. Recurrence Mechanism
3
4. Experiments
5. Ablation Study
6. Conclusion
4. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Recent Trends
•Unsupervised representation learning has been successful!
•Pretrain on large-scale unlabeled text corpora → finetuning
•CoVe(2017)*, ELMo(2018)**, GPT(2018)***, BERT(2018)****
•Pretraining methods : Autoregressive vs Autoencoding
4
*McCann et al 2017, Learned in translation: Contextualized word vectors. **Peters et al 2018, Deep contextualized word representations.
***Radford et al 2018, Improving language understanding by generative pre-training ****Devlin et al 2018, Bert: Pre-training of deep bidirectional transformers for language understanding.
Autoregressive Autoencoding
5. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Autoregressive vs Autoencoding
•AR language model (GPT)
5
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
•AE language model (BERT)
max
θ
log pθ(x) =
T
∑
t=1
log pθ(xt |x<t) max
θ
log pθ(¯x| ̂x) ≈
T
∑
t=1
mt log pθ(xt | ̂x)
Good at text generation Good at language understanding
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)
6. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Autoregressive vs Autoencoding
•BERT’s limitations
•Model assumes that all masked tokens are independent
•Generalized model should not rely on data corruption (masking)
•<mask> token doesn’t appear in real world
•It lacks long-term dependency
6
7. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Changes
Permutation Language Modeling
Two-Stream Self-Attention
Transformer-XL
7
8. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Permutation Language Modeling
8
[ 1, 2, 3, 4, 5, 6, 7, 8]origin sequence
[ 1, 2, 3, 4, 5, 6, 7, 8]
Forward AR
[ 8, 7, 6, 5, 4, 3, 2, 1]Backward AR
Permutation AR [ 3, 2, 5, 6, 8, 1, 7, 4]
13. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Two-Stream Self-Attention
13
14. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Segment Recurrence
14
15. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Segment Recurrence
15
Google AI blog, Transformer-XL: Unleashing the Potential of Attention Models
16. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Relative Positional Embedding
16
•When reuse old hidden states, how can we define positional encodings?
•It is enough to know relative distance between key vector and query vector ( )
•Standard Transformer
•Transformer-XL
[0, 1, 2, 3] [0, 1, 2, 3]
segment 1 segment 2
[0, 1, 2, 3]
segment 3
i − j
17. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•Pretraining Datasets
•BERT : BookCorpus + English Wikipedia
•XLNet : BookCorpus + English Wikipedia + Giga5 + ClueWeb + Common Crawl
•Model Size
•XLNet-Large BERT-Large
•Training Time
•512 TPU v3 chips for 500K steps with adam optimizer (batch size 2048, 2.5days)
17
≈
→ Still underfits!
18. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•SQuAD Dataset
18
19. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•Text Classification
19
20. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•RACE Dataset
20
21. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•GLUE Dataset
21
22. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Ablation Study
•Importance of Transformer-XL
22
23. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Ablation Study
•Effectiveness of Permutation Language Modeling
23
24. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Conclusion
•Permutation language model can cover both AR, AE models
24
[ 1, 2, 3, 4 ]
[ 8, 7, 6, 5 ]
[ 1, 4, 2, 3 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)
XLNet
•Transformer-XL is very good for downstream tasks as well as language modeling
25. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
25
Generalized Autoregressive Pretraining for Language Understanding
26. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
26
Generalized Autoregressive Pretraining for Language Understanding
Pretrain without data corruption(masking) by using Permutation LM
27. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
27
Generalized Autoregressive Pretraining for Language Understanding
Autoregressive language model but also utilizes bidirectional context
28. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
28
Generalized Autoregressive Pretraining for Language Understanding
Outperformed BERT on 20 NLP tasks, 18 of them are State Of The Art
29. PR-175 :
XLNet: Generalized Autoregressive Pretraining for
Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le
Carnegie Mellon University, Google Brain
Presented by
Sungnam Park, Yonsei University
sungnam1108@naver.com