SlideShare a Scribd company logo
1 of 17
Super Tickets in Pre-Trained Language Models: From
Model Compression to Improving Generalization
Chen Liang, Simiao Zuo, Minshuo Chen , Haoming Jiang, Xiaodong Liu, Pengcheng
He, Tuo Zhao, Weizhu Chen
Lottery Ticket Hypothesis
• A randomly-initialized, dense neural network contains a subnetwork
that is initialized such that—when trained in isolation—it can match the
test accuracy of the original network after training for at most the same
number of iterations.
Lottery Ticket Hypothesis
Phase Transition on LTH
1) Phase Transition: The change in the test accuracy of the compressed model
2) Super Ticket: The best value for weight remaining(esp. between Phase1 and Phase2 in this paper ).
Contributions
• The first to identify the phase transition phenomenon in pruning large
neural language models
• The first to show that pruning can improve the generalization when the
models are lightly compressed
• Propose a new pruning approach for multi-task fine-tunning of neural
language models
Transformer – MultiHeadAttention
• Attention
• Multi-Head Attention
Finding Super Tickets
• Prunning of attention heads and feed-forward layers.
• Adopt Importance score
Low Importance Score: small contribution towards the output
High Importance Score: high expressive power for the output
Multi-task learning with Tickets Sharing
Experiments - Single Task
- Baseline: ST-DNN(Base/Large): BERT(Base/Large) with Single Task FFN.
- Proposed: SuperT(Base/Large): BERT(Base/Large) with Super Tickets.
• Models
spec. pruning by 8 different sparsity(e.g. 10% heads/20% FFN) -> choose best!
- Optimizer: Adamax
- Learning rate: {5e-5, 1e-4, 2e-4}
- Batch size: {8, 6, 32}
• Compile/Train Options
Experiment results on GLUE Benchmarks
Experiment results on GLUE Benchmarks
In all the tasks, SuperT consistently archieves
better generalization than ST-DNN.
Performance gain of the super tickets is more
Significant in small task.
Performance of the super tickets is related to
Model size. In large models, more non-
expressive tickets can be pruned without
Performance degradation.
Experiment results on GLUE Benchmarks
Single task fine-tunning evaluation results of
1) Super tickets(blue) 2) random(orange) 3) losing tickets(8 different sparsity levels)
Experiments – Multi Task
Baseline:
1) MT-DNN(Base/Large): BERT(Base/Large) with task shared layers
2) MT-DNN(Base/Large)+ ST Fine-tuning: further trained MT-DNN on individual downstream task
• Models
-Same as the ones of Single Task.
• Compile/Train Options
Proposed:
1) Ticket-Share(Base/Large): MT-DNN model refined through the ticket sharing strategy.
2) Ticket-Share(Base/Large)+ ST Fine-tunning: A fine-tuned single-task Ticket-Share model.
Experiment results on GLUE
Experiment results on SNLI/ SciTail
Analysis
• Sensitivity to Random Seed
Training with super tickets effectively reduces
model variance on the performance caused
by the random initialization.
• Tickets Importance Across Tasks
SST-2 benefits little from tickets sharing(see
Figure6(a)(c)(d))
CoLA (Figure 6(c)), or dominated jointly by
two tasks, e.g., CoLA and STS-B (Figure 6(d))
are dominated by a single Task.
Thus, some tickets only learn task-specific
knowledge, and the two tasks may share
certain task-specific knowledge.
Discussion
• Structured Lottery Tickets
• Searching Better Generalized Super Tickets
• Searching Super Tickets Efficiently

More Related Content

What's hot

Analyzing individual neurons in pre trained language models
Analyzing individual neurons in pre trained language modelsAnalyzing individual neurons in pre trained language models
Analyzing individual neurons in pre trained language models
ken-ando
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...
butest
 

What's hot (11)

Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Triangular Learner Model
Triangular Learner ModelTriangular Learner Model
Triangular Learner Model
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble Learning
 
Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 
Analyzing individual neurons in pre trained language models
Analyzing individual neurons in pre trained language modelsAnalyzing individual neurons in pre trained language models
Analyzing individual neurons in pre trained language models
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...
 
THE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELS
THE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELSTHE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELS
THE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELS
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
RapidMiner: Learning Schemes In Rapid Miner
RapidMiner:  Learning Schemes In Rapid MinerRapidMiner:  Learning Schemes In Rapid Miner
RapidMiner: Learning Schemes In Rapid Miner
 

Similar to Super tickets in pre trained language models

MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational Experiments
GIScRG
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
butest
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
butest
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
IJITCA Journal
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient Descent
Katy Lee
 

Similar to Super tickets in pre trained language models (20)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Lifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable NetworksLifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable Networks
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
gpt3_presentation.pdf
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdf
 
MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational Experiments
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
2019 Levenshtein Transformer
2019 Levenshtein Transformer2019 Levenshtein Transformer
2019 Levenshtein Transformer
 
Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient Descent
 

More from HyunKyu Jeon

십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)
HyunKyu Jeon
 

More from HyunKyu Jeon (20)

[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
 
Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models
 
Domain Invariant Representation Learning with Domain Density Transformations
Domain Invariant Representation Learning with Domain Density TransformationsDomain Invariant Representation Learning with Domain Density Transformations
Domain Invariant Representation Learning with Domain Density Transformations
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Maxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningMaxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearning
 
Adversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationAdversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine Translation
 
십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN
 
십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence
 
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
 
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
 
십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)
 
십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)
 
십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)
 
십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)
 
십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)
 
십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)
 
십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)
 
십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN
 
십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)
 
십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)
 

Recently uploaded

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Super tickets in pre trained language models

  • 1. Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization Chen Liang, Simiao Zuo, Minshuo Chen , Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, Weizhu Chen
  • 2. Lottery Ticket Hypothesis • A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
  • 4. Phase Transition on LTH 1) Phase Transition: The change in the test accuracy of the compressed model 2) Super Ticket: The best value for weight remaining(esp. between Phase1 and Phase2 in this paper ).
  • 5. Contributions • The first to identify the phase transition phenomenon in pruning large neural language models • The first to show that pruning can improve the generalization when the models are lightly compressed • Propose a new pruning approach for multi-task fine-tunning of neural language models
  • 6. Transformer – MultiHeadAttention • Attention • Multi-Head Attention
  • 7. Finding Super Tickets • Prunning of attention heads and feed-forward layers. • Adopt Importance score Low Importance Score: small contribution towards the output High Importance Score: high expressive power for the output
  • 8. Multi-task learning with Tickets Sharing
  • 9. Experiments - Single Task - Baseline: ST-DNN(Base/Large): BERT(Base/Large) with Single Task FFN. - Proposed: SuperT(Base/Large): BERT(Base/Large) with Super Tickets. • Models spec. pruning by 8 different sparsity(e.g. 10% heads/20% FFN) -> choose best! - Optimizer: Adamax - Learning rate: {5e-5, 1e-4, 2e-4} - Batch size: {8, 6, 32} • Compile/Train Options
  • 10. Experiment results on GLUE Benchmarks
  • 11. Experiment results on GLUE Benchmarks In all the tasks, SuperT consistently archieves better generalization than ST-DNN. Performance gain of the super tickets is more Significant in small task. Performance of the super tickets is related to Model size. In large models, more non- expressive tickets can be pruned without Performance degradation.
  • 12. Experiment results on GLUE Benchmarks Single task fine-tunning evaluation results of 1) Super tickets(blue) 2) random(orange) 3) losing tickets(8 different sparsity levels)
  • 13. Experiments – Multi Task Baseline: 1) MT-DNN(Base/Large): BERT(Base/Large) with task shared layers 2) MT-DNN(Base/Large)+ ST Fine-tuning: further trained MT-DNN on individual downstream task • Models -Same as the ones of Single Task. • Compile/Train Options Proposed: 1) Ticket-Share(Base/Large): MT-DNN model refined through the ticket sharing strategy. 2) Ticket-Share(Base/Large)+ ST Fine-tunning: A fine-tuned single-task Ticket-Share model.
  • 15. Experiment results on SNLI/ SciTail
  • 16. Analysis • Sensitivity to Random Seed Training with super tickets effectively reduces model variance on the performance caused by the random initialization. • Tickets Importance Across Tasks SST-2 benefits little from tickets sharing(see Figure6(a)(c)(d)) CoLA (Figure 6(c)), or dominated jointly by two tasks, e.g., CoLA and STS-B (Figure 6(d)) are dominated by a single Task. Thus, some tickets only learn task-specific knowledge, and the two tasks may share certain task-specific knowledge.
  • 17. Discussion • Structured Lottery Tickets • Searching Better Generalized Super Tickets • Searching Super Tickets Efficiently

Editor's Notes

  1. In multi-task learning, the shared model is highly over-parameterized to ensure a sufficient capacity for fitting individual tasks Multi-task model inevitably exhibits task-dependent redundancy when being adapted to individual tasks.