SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
PR-175 :
XLNet: Generalized Autoregressive Pretraining for
Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le
Carnegie Mellon University, Google Brain
Presented by
Sungnam Park, Yonsei University
sungnam1108@naver.com
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding 2
2019.6.19
2019.1.18
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Contents
1. Recent Trends
2. AutoRegressive vs AutoEncoding
3. Changes
3-1. Permutation Language Model
3-2. Two-stream Self-attention Mechanism
3-3. Recurrence Mechanism
3
4. Experiments
5. Ablation Study
6. Conclusion
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Recent Trends
•Unsupervised representation learning has been successful!
•Pretrain on large-scale unlabeled text corpora → finetuning
•CoVe(2017)*, ELMo(2018)**, GPT(2018)***, BERT(2018)****
•Pretraining methods : Autoregressive vs Autoencoding
4
*McCann et al 2017, Learned in translation: Contextualized word vectors. **Peters et al 2018, Deep contextualized word representations.
***Radford et al 2018, Improving language understanding by generative pre-training ****Devlin et al 2018, Bert: Pre-training of deep bidirectional transformers for language understanding. 

Autoregressive Autoencoding
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Autoregressive vs Autoencoding
•AR language model (GPT)
5
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
•AE language model (BERT)
max
θ
log pθ(x) =
T
∑
t=1
log pθ(xt |x<t) max
θ
log pθ(¯x| ̂x) ≈
T
∑
t=1
mt log pθ(xt | ̂x)
Good at text generation Good at language understanding
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Autoregressive vs Autoencoding
•BERT’s limitations
•Model assumes that all masked tokens are independent
•Generalized model should not rely on data corruption (masking)
•<mask> token doesn’t appear in real world
•It lacks long-term dependency
6
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Changes
Permutation Language Modeling
Two-Stream Self-Attention
Transformer-XL
7
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Permutation Language Modeling
8
[ 1, 2, 3, 4, 5, 6, 7, 8]origin sequence
[ 1, 2, 3, 4, 5, 6, 7, 8]
Forward AR
[ 8, 7, 6, 5, 4, 3, 2, 1]Backward AR
Permutation AR [ 3, 2, 5, 6, 8, 1, 7, 4]
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Permutation Language Modeling
9
[ 3, 2, 4, 1 ]
[ 2, 4, 3, 1 ]
[ 1, 4, 2, 3 ]
[ 4, 3, 1, 2 ]
Permutation
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Permutation Language Modeling
10
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
Masked Language Model Permutation Language Model
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask [ x2, x1, x5, x8, x6, x5, x3, x7 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask
autoencoding
[ x2, x1, x5, x8, x6, x5, x3, x7 ]
autoregressive
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Permutation Language Modeling
11
[ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ]
Masked Language Model Permutation Language Model
[ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ]
[ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ]
[ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ]
[ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ] [ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ]
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Two-Stream Self-Attention
12
GPT
[ x1, x2, x3, x4 ] → [ x5 ]
BERT
x1
x2
Transformer
<mask> x3 x4
Permutation LM
[ x2, x1, x4, x3 ] → [ x5 ]
→ [ x6 ]
→ [ x7 ]
???
Query Stream
Content Stream
Two-Stream Self Attention
not used when fine-tuning
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Two-Stream Self-Attention
13
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Segment Recurrence
14
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Segment Recurrence
15
Google AI blog, Transformer-XL: Unleashing the Potential of Attention Models
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Transformer-XL (PR-161)
•Relative Positional Embedding
16
•When reuse old hidden states, how can we define positional encodings?
•It is enough to know relative distance between key vector and query vector ( )
•Standard Transformer
•Transformer-XL
[0, 1, 2, 3] [0, 1, 2, 3]
segment 1 segment 2
[0, 1, 2, 3]
segment 3
i − j
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•Pretraining Datasets
•BERT : BookCorpus + English Wikipedia
•XLNet : BookCorpus + English Wikipedia + Giga5 + ClueWeb + Common Crawl
•Model Size
•XLNet-Large BERT-Large
•Training Time
•512 TPU v3 chips for 500K steps with adam optimizer (batch size 2048, 2.5days)
17
≈
→ Still underfits!
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•SQuAD Dataset
18
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•Text Classification
19
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•RACE Dataset
20
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Experiments
•GLUE Dataset
21
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Ablation Study
•Importance of Transformer-XL
22
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Ablation Study
•Effectiveness of Permutation Language Modeling
23
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Conclusion
•Permutation language model can cover both AR, AE models
24
[ 1, 2, 3, 4 ]
[ 8, 7, 6, 5 ]
[ 1, 4, 2, 3 ]
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
forward
backward
[ x1, x2, x3, x4, x5, x6, x7, x8 ]
bidirectional
(masked token)
XLNet
•Transformer-XL is very good for downstream tasks as well as language modeling
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
25
Generalized Autoregressive Pretraining for Language Understanding
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
26
Generalized Autoregressive Pretraining for Language Understanding
Pretrain without data corruption(masking) by using Permutation LM
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
27
Generalized Autoregressive Pretraining for Language Understanding
Autoregressive language model but also utilizes bidirectional context
PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Summary
28
Generalized Autoregressive Pretraining for Language Understanding
Outperformed BERT on 20 NLP tasks, 18 of them are State Of The Art
PR-175 :
XLNet: Generalized Autoregressive Pretraining for
Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le
Carnegie Mellon University, Google Brain
Presented by
Sungnam Park, Yonsei University
sungnam1108@naver.com
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptxTuCaoMinh2
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Conditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranConditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranWithTheBest
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 

Was ist angesagt? (20)

[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Conditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranConditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya Venkiteswaran
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Gpt models
Gpt modelsGpt models
Gpt models
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 

Ähnlich wie PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding

MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019 Alexis Agahi
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Becoming fully buzzword compliant
Becoming fully buzzword compliantBecoming fully buzzword compliant
Becoming fully buzzword compliantTrisha Gee
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Decision Transformers Model.pdf
Decision Transformers Model.pdfDecision Transformers Model.pdf
Decision Transformers Model.pdfJamieDornan2
 
Decision Transformers Model.pdf
Decision Transformers Model.pdfDecision Transformers Model.pdf
Decision Transformers Model.pdfJamieDornan2
 
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...mfrancis
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 

Ähnlich wie PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding (20)

MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
XLNet Presentation.pdf
XLNet Presentation.pdfXLNet Presentation.pdf
XLNet Presentation.pdf
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
Resume
ResumeResume
Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Becoming fully buzzword compliant
Becoming fully buzzword compliantBecoming fully buzzword compliant
Becoming fully buzzword compliant
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
Decision Transformers Model.pdf
Decision Transformers Model.pdfDecision Transformers Model.pdf
Decision Transformers Model.pdf
 
Decision Transformers Model.pdf
Decision Transformers Model.pdfDecision Transformers Model.pdf
Decision Transformers Model.pdf
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...
Release 4 from a Business Perspective - Peter Kriens, OSGi Alliance Fellow; T...
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 

Mehr von Sungnam Park

[Kaggle] Movie recommendation
[Kaggle] Movie recommendation[Kaggle] Movie recommendation
[Kaggle] Movie recommendationSungnam Park
 
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작Sungnam Park
 
[GAN] 1D Gaussian Distribution Tutorial
[GAN] 1D Gaussian Distribution Tutorial[GAN] 1D Gaussian Distribution Tutorial
[GAN] 1D Gaussian Distribution TutorialSungnam Park
 
[Kaggle] Human Resource Data Analysis
[Kaggle] Human Resource Data Analysis[Kaggle] Human Resource Data Analysis
[Kaggle] Human Resource Data AnalysisSungnam Park
 
[YBIGTA Open Session] 데이터로 바라본 독버섯
[YBIGTA Open Session] 데이터로 바라본 독버섯[YBIGTA Open Session] 데이터로 바라본 독버섯
[YBIGTA Open Session] 데이터로 바라본 독버섯Sungnam Park
 
[YBIGTA Open Session] 인공지능 힙합 비트메이커
[YBIGTA Open Session] 인공지능 힙합 비트메이커[YBIGTA Open Session] 인공지능 힙합 비트메이커
[YBIGTA Open Session] 인공지능 힙합 비트메이커Sungnam Park
 

Mehr von Sungnam Park (6)

[Kaggle] Movie recommendation
[Kaggle] Movie recommendation[Kaggle] Movie recommendation
[Kaggle] Movie recommendation
 
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작
데이터를 활용한 무슬림 맞춤형 서울 관광지도 제작
 
[GAN] 1D Gaussian Distribution Tutorial
[GAN] 1D Gaussian Distribution Tutorial[GAN] 1D Gaussian Distribution Tutorial
[GAN] 1D Gaussian Distribution Tutorial
 
[Kaggle] Human Resource Data Analysis
[Kaggle] Human Resource Data Analysis[Kaggle] Human Resource Data Analysis
[Kaggle] Human Resource Data Analysis
 
[YBIGTA Open Session] 데이터로 바라본 독버섯
[YBIGTA Open Session] 데이터로 바라본 독버섯[YBIGTA Open Session] 데이터로 바라본 독버섯
[YBIGTA Open Session] 데이터로 바라본 독버섯
 
[YBIGTA Open Session] 인공지능 힙합 비트메이커
[YBIGTA Open Session] 인공지능 힙합 비트메이커[YBIGTA Open Session] 인공지능 힙합 비트메이커
[YBIGTA Open Session] 인공지능 힙합 비트메이커
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding

  • 1. PR-175 : XLNet: Generalized Autoregressive Pretraining for Language Understanding Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le Carnegie Mellon University, Google Brain Presented by Sungnam Park, Yonsei University sungnam1108@naver.com
  • 2. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding 2 2019.6.19 2019.1.18
  • 3. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Contents 1. Recent Trends 2. AutoRegressive vs AutoEncoding 3. Changes 3-1. Permutation Language Model 3-2. Two-stream Self-attention Mechanism 3-3. Recurrence Mechanism 3 4. Experiments 5. Ablation Study 6. Conclusion
  • 4. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Recent Trends •Unsupervised representation learning has been successful! •Pretrain on large-scale unlabeled text corpora → finetuning •CoVe(2017)*, ELMo(2018)**, GPT(2018)***, BERT(2018)**** •Pretraining methods : Autoregressive vs Autoencoding 4 *McCann et al 2017, Learned in translation: Contextualized word vectors. **Peters et al 2018, Deep contextualized word representations. ***Radford et al 2018, Improving language understanding by generative pre-training ****Devlin et al 2018, Bert: Pre-training of deep bidirectional transformers for language understanding. 
 Autoregressive Autoencoding
  • 5. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Autoregressive vs Autoencoding •AR language model (GPT) 5 [ x1, x2, x3, x4, x5, x6, x7, x8 ] forward backward •AE language model (BERT) max θ log pθ(x) = T ∑ t=1 log pθ(xt |x<t) max θ log pθ(¯x| ̂x) ≈ T ∑ t=1 mt log pθ(xt | ̂x) Good at text generation Good at language understanding [ x1, x2, x3, x4, x5, x6, x7, x8 ] bidirectional (masked token)
  • 6. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Autoregressive vs Autoencoding •BERT’s limitations •Model assumes that all masked tokens are independent •Generalized model should not rely on data corruption (masking) •<mask> token doesn’t appear in real world •It lacks long-term dependency 6
  • 7. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Changes Permutation Language Modeling Two-Stream Self-Attention Transformer-XL 7
  • 8. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Permutation Language Modeling 8 [ 1, 2, 3, 4, 5, 6, 7, 8]origin sequence [ 1, 2, 3, 4, 5, 6, 7, 8] Forward AR [ 8, 7, 6, 5, 4, 3, 2, 1]Backward AR Permutation AR [ 3, 2, 5, 6, 8, 1, 7, 4]
  • 9. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Permutation Language Modeling 9 [ 3, 2, 4, 1 ] [ 2, 4, 3, 1 ] [ 1, 4, 2, 3 ] [ 4, 3, 1, 2 ] Permutation
  • 10. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Permutation Language Modeling 10 [ x1, x2, x3, x4, x5, x6, x7, x8 ] Masked Language Model Permutation Language Model [ x1, x2, x3, x4, x5, x6, x7, x8 ] [ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask [ x2, x1, x5, x8, x6, x5, x3, x7 ] [ x1, x2, x3, x4, x5, x6, x7, x8 ]mask mask autoencoding [ x2, x1, x5, x8, x6, x5, x3, x7 ] autoregressive
  • 11. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Permutation Language Modeling 11 [ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ] Masked Language Model Permutation Language Model [ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ] [ ‘New’, ‘York’, ‘is’, ‘a’, ‘city’ ] [ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ] [ <MASK>, <MASK>, ‘is’, ‘a’, ‘city’ ] [ ‘is’, ‘a’, ‘city’, ‘New’, ‘York’ ]
  • 12. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Two-Stream Self-Attention 12 GPT [ x1, x2, x3, x4 ] → [ x5 ] BERT x1 x2 Transformer <mask> x3 x4 Permutation LM [ x2, x1, x4, x3 ] → [ x5 ] → [ x6 ] → [ x7 ] ??? Query Stream Content Stream Two-Stream Self Attention not used when fine-tuning
  • 13. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Two-Stream Self-Attention 13
  • 14. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Transformer-XL (PR-161) •Segment Recurrence 14
  • 15. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Transformer-XL (PR-161) •Segment Recurrence 15 Google AI blog, Transformer-XL: Unleashing the Potential of Attention Models
  • 16. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Transformer-XL (PR-161) •Relative Positional Embedding 16 •When reuse old hidden states, how can we define positional encodings? •It is enough to know relative distance between key vector and query vector ( ) •Standard Transformer •Transformer-XL [0, 1, 2, 3] [0, 1, 2, 3] segment 1 segment 2 [0, 1, 2, 3] segment 3 i − j
  • 17. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Experiments •Pretraining Datasets •BERT : BookCorpus + English Wikipedia •XLNet : BookCorpus + English Wikipedia + Giga5 + ClueWeb + Common Crawl •Model Size •XLNet-Large BERT-Large •Training Time •512 TPU v3 chips for 500K steps with adam optimizer (batch size 2048, 2.5days) 17 ≈ → Still underfits!
  • 18. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Experiments •SQuAD Dataset 18
  • 19. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Experiments •Text Classification 19
  • 20. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Experiments •RACE Dataset 20
  • 21. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Experiments •GLUE Dataset 21
  • 22. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Ablation Study •Importance of Transformer-XL 22
  • 23. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Ablation Study •Effectiveness of Permutation Language Modeling 23
  • 24. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Conclusion •Permutation language model can cover both AR, AE models 24 [ 1, 2, 3, 4 ] [ 8, 7, 6, 5 ] [ 1, 4, 2, 3 ] [ x1, x2, x3, x4, x5, x6, x7, x8 ] forward backward [ x1, x2, x3, x4, x5, x6, x7, x8 ] bidirectional (masked token) XLNet •Transformer-XL is very good for downstream tasks as well as language modeling
  • 25. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Summary 25 Generalized Autoregressive Pretraining for Language Understanding
  • 26. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Summary 26 Generalized Autoregressive Pretraining for Language Understanding Pretrain without data corruption(masking) by using Permutation LM
  • 27. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Summary 27 Generalized Autoregressive Pretraining for Language Understanding Autoregressive language model but also utilizes bidirectional context
  • 28. PR-175: XLNet: Generalized Autoregressive Pretraining for Language Understanding Summary 28 Generalized Autoregressive Pretraining for Language Understanding Outperformed BERT on 20 NLP tasks, 18 of them are State Of The Art
  • 29. PR-175 : XLNet: Generalized Autoregressive Pretraining for Language Understanding Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V.Le Carnegie Mellon University, Google Brain Presented by Sungnam Park, Yonsei University sungnam1108@naver.com
  • 30. Q & A