SlideShare a Scribd company logo
1 of 36
Download to read offline
SWITCH TRANSFORMERS: SCALING TO
TRILLION PARAMETER MODELS WITH
SIMPLE AND EFFICIENT SPARSITY
자연어처리팀: 박희수(발표자), 백지윤, 진명훈
Motivation
NLP 모델 하나 학습시키는데 도대체
얼마나 많은 에너지가 소비되는 걸까?
Energy and Policy Considerations for Deep Learning in NLP
NAS 를 통해 Transformer model 하나
학습시키면 지구 온난화의 주범이…
Motivation
그럼에도 불구하고 NLP 모델 사이즈는…!
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
Motivation
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Mixture of Expert (MoE) 기반의 sparsely-
active transformers
MoE 는
(1) complexity,
(2) communication costs, and
(3) training instabilities
위 세가지 문제로 널리 쓰이지는 못하고 있다!
위 세가지 문제를 해결한, 효율적인 spasely-
active transformers 를 만들어보자!!!!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
강아지의 일부분을 보
고 강아지를 인식하는
능력 필요
배경과 여러 다른 객체
사이에서 강아지를 찾
아내는 능력 필요
Too much load!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
배경 분리
전문가
객체 탐지
전문가
강아지 부분
인식전문가
MoE Layer
Two Imagenet images from the same class
ℎ 𝑥 = 𝑊
𝑟 ∙ 𝑥
→ 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥
→ 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘
𝑦 = ෍
𝑘
𝑝(𝑥) ∙ 𝐸(𝑥)
MoE for Transformer
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Transformer 에는 feed
forward network 에만 MoE 적용
Routing은 token 단위로 적용
Any Question?
Basic idea for Switch Transformer
오직 하나의 expert 만 선택하자!
1. Single expert를 사용하여 Router
연산을 줄임! (top-k -> top-1)
2. K개의 Expert 선택할 경우 데이터를 k
개로 복사해야 하지만 1개를 선택할 경
우 그럴 필요가 없으므로 기존 MoE 기
법에 비해 batch 사이즈가 줄어드는
효과
3. Routing 연산 후 Device 간의
communication 작업이 필요한데 1개
의 expert 만을 선택함으로써 이를 줄
일 수 있음
expert capacity =
tokens per batch
number of experts
× capacity factor
k = 1 routing strategy
(1) Distributed Switch Implementation.
k = 1 routing strategy
(2) A Differentiable Load Balancing Loss.
Benchmarking Switch versus MoE
a quality threshold of
Neg. Log Perp.=-1.495.
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Float32 precision is only used within the body of the
router function – on computations local to that device.
Float32
bfloat16
bfloat16
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Improved Training and Fine-Tuning Techniques
(2) Selective Dropout
Expert dropout 과 기존 dropout의 비율을 적절히 조정해
줬을 때 최적의 성능이 나옴
Improved Training and Fine-Tuning Techniques
(3) A Better Initialization
Truncated normal distribution 으로 초기화 해준 수에 0.1
을 곱했을 때 결과의 편차가 가장 적었음 → stable results
Any Question?
• FLOPS 등 다른 변수는 모두 동일한 상태에서 비교
• 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼
수 있는 성능을 60K 만에 도달
Scaling Properties
(1) Results on a step-basis (총 training step 고정)
• 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교
• 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음
Scaling Properties
(2) Results on a time-basis (총 training time 고정)
• T5-large 의 경우 각 토큰당 연산량이 3.5배 많음
• 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음
• 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음
Scaling Properties
(3) Scaling vs A Large Dense Model (총 parameter 수 고정)
Any Question?
Fine-tuning
(1) Baseline and Switch models used for fine-tuning
Fine-tuning
(2) Fine-tuning tasks and datasets.
Distillation
(1) Distillation techniques..
Distillation
(2) Achievable compression rates.
Distillation
(3) Distilling a fine-tuned model.
Multilingual Learning
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑛 𝑚 = 1
Designing models with data, model, and expert-parallelism
Data parallelism: This has the advantage that no communication is needed until the entire
forward and backward pass is finished and the gradients need to be then aggregated across
all cores
Model parallelism: All cores must keep the full B tokens and each core will contain a
unique slice of the weights. For each forward and backward pass, a communication cost is
now incurred.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 1
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑚
Designing models with data, model, and expert-parallelism
Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the
weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 4
Designing models with data, model, and expert-parallelism
Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of
both the weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 𝑁
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 1
Designing models with data, model, and expert-parallelism
Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and
𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation.
𝑛
= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
Designing models with data, model, and expert-parallelism
Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the
Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps.
Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts,
exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger
FLOPs per sequence, is sometimes unstable
Designing models with data, model, and expert-parallelism
• Isn’t Switch Transformer better due to sheer parameter count?
Yes, and by design! Parameters, independent of the total FLOPs used, are a useful
axis to scale neural language models.
• I don’t have access to a supercomputer – is this still useful for me?
Though this work has focused on extremely large models, we also find that models
with as few as two experts improves performance while easily fitting within memory
constraints of commonly available GPUs or TPUs
• Do sparse models outperform dense models on the speed-accuracy pareto curve?
Yes. Across a wide variety of different model’s sizes, sparse models outperform
dense models per step and on wall clock time.
• I can’t deploy a trillion parameters model – can we shrink these models?
We cannot fully preserve the model quality, but compression rates of 10 to 100x are
achievable by distilling our sparse models into dense models while achieving ≈30% of
the quality gain of the expert model.
DISCUSSION
Any Question?

More Related Content

What's hot

What's hot (20)

빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요
2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요
2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
 
Trends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_businessTrends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_business
 
HBase
HBaseHBase
HBase
 
Adversarial Attacks for Recommender Systems
Adversarial Attacks for Recommender SystemsAdversarial Attacks for Recommender Systems
Adversarial Attacks for Recommender Systems
 
(진성리더십 특강) 일과 삶의 문제를 드라이브하라! 퍼스널 애자일, 퍼스널 칸반
(진성리더십 특강) 일과 삶의 문제를 드라이브하라! 퍼스널 애자일, 퍼스널 칸반(진성리더십 특강) 일과 삶의 문제를 드라이브하라! 퍼스널 애자일, 퍼스널 칸반
(진성리더십 특강) 일과 삶의 문제를 드라이브하라! 퍼스널 애자일, 퍼스널 칸반
 
대용량 로그분석 Bigquery로 간단히 사용하기 (20170215 T아카데미)
대용량 로그분석 Bigquery로 간단히 사용하기 (20170215 T아카데미)대용량 로그분석 Bigquery로 간단히 사용하기 (20170215 T아카데미)
대용량 로그분석 Bigquery로 간단히 사용하기 (20170215 T아카데미)
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Text summarization
Text summarizationText summarization
Text summarization
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
딥러닝의 기본
딥러닝의 기본딥러닝의 기본
딥러닝의 기본
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Albert
AlbertAlbert
Albert
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (20)

presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language models
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
C3 w3
C3 w3C3 w3
C3 w3
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Duplicate_Quora_Question_Detection
Duplicate_Quora_Question_DetectionDuplicate_Quora_Question_Detection
Duplicate_Quora_Question_Detection
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
Graph processing
Graph processingGraph processing
Graph processing
 
Interpretable ML
Interpretable MLInterpretable ML
Interpretable ML
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 

More from taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Recently uploaded (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  • 1. SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY 자연어처리팀: 박희수(발표자), 백지윤, 진명훈
  • 2. Motivation NLP 모델 하나 학습시키는데 도대체 얼마나 많은 에너지가 소비되는 걸까? Energy and Policy Considerations for Deep Learning in NLP NAS 를 통해 Transformer model 하나 학습시키면 지구 온난화의 주범이…
  • 3. Motivation 그럼에도 불구하고 NLP 모델 사이즈는…! http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 4. Motivation GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Mixture of Expert (MoE) 기반의 sparsely- active transformers MoE 는 (1) complexity, (2) communication costs, and (3) training instabilities 위 세가지 문제로 널리 쓰이지는 못하고 있다! 위 세가지 문제를 해결한, 효율적인 spasely- active transformers 를 만들어보자!!!!
  • 5. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 강아지의 일부분을 보 고 강아지를 인식하는 능력 필요 배경과 여러 다른 객체 사이에서 강아지를 찾 아내는 능력 필요 Too much load!
  • 6. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 배경 분리 전문가 객체 탐지 전문가 강아지 부분 인식전문가
  • 7. MoE Layer Two Imagenet images from the same class ℎ 𝑥 = 𝑊 𝑟 ∙ 𝑥 → 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥 → 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘 𝑦 = ෍ 𝑘 𝑝(𝑥) ∙ 𝐸(𝑥)
  • 8. MoE for Transformer GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Transformer 에는 feed forward network 에만 MoE 적용 Routing은 token 단위로 적용
  • 10. Basic idea for Switch Transformer 오직 하나의 expert 만 선택하자! 1. Single expert를 사용하여 Router 연산을 줄임! (top-k -> top-1) 2. K개의 Expert 선택할 경우 데이터를 k 개로 복사해야 하지만 1개를 선택할 경 우 그럴 필요가 없으므로 기존 MoE 기 법에 비해 batch 사이즈가 줄어드는 효과 3. Routing 연산 후 Device 간의 communication 작업이 필요한데 1개 의 expert 만을 선택함으로써 이를 줄 일 수 있음
  • 11. expert capacity = tokens per batch number of experts × capacity factor k = 1 routing strategy (1) Distributed Switch Implementation.
  • 12. k = 1 routing strategy (2) A Differentiable Load Balancing Loss.
  • 13. Benchmarking Switch versus MoE a quality threshold of Neg. Log Perp.=-1.495.
  • 14. (1) Selective Precision Improved Training and Fine-Tuning Techniques Float32 precision is only used within the body of the router function – on computations local to that device. Float32 bfloat16 bfloat16
  • 15. (1) Selective Precision Improved Training and Fine-Tuning Techniques
  • 16. Improved Training and Fine-Tuning Techniques (2) Selective Dropout Expert dropout 과 기존 dropout의 비율을 적절히 조정해 줬을 때 최적의 성능이 나옴
  • 17. Improved Training and Fine-Tuning Techniques (3) A Better Initialization Truncated normal distribution 으로 초기화 해준 수에 0.1 을 곱했을 때 결과의 편차가 가장 적었음 → stable results
  • 19. • FLOPS 등 다른 변수는 모두 동일한 상태에서 비교 • 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼 수 있는 성능을 60K 만에 도달 Scaling Properties (1) Results on a step-basis (총 training step 고정)
  • 20. • 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교 • 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음 Scaling Properties (2) Results on a time-basis (총 training time 고정)
  • 21. • T5-large 의 경우 각 토큰당 연산량이 3.5배 많음 • 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음 • 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음 Scaling Properties (3) Scaling vs A Large Dense Model (총 parameter 수 고정)
  • 23. Fine-tuning (1) Baseline and Switch models used for fine-tuning
  • 27. Distillation (3) Distilling a fine-tuned model.
  • 29. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑛 𝑚 = 1 Designing models with data, model, and expert-parallelism Data parallelism: This has the advantage that no communication is needed until the entire forward and backward pass is finished and the gradients need to be then aggregated across all cores
  • 30. Model parallelism: All cores must keep the full B tokens and each core will contain a unique slice of the weights. For each forward and backward pass, a communication cost is now incurred. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 1 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑚 Designing models with data, model, and expert-parallelism
  • 31. Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 4 Designing models with data, model, and expert-parallelism
  • 32. Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 𝑁 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 1 Designing models with data, model, and expert-parallelism
  • 33. Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 Designing models with data, model, and expert-parallelism
  • 34. Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps. Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts, exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger FLOPs per sequence, is sometimes unstable Designing models with data, model, and expert-parallelism
  • 35. • Isn’t Switch Transformer better due to sheer parameter count? Yes, and by design! Parameters, independent of the total FLOPs used, are a useful axis to scale neural language models. • I don’t have access to a supercomputer – is this still useful for me? Though this work has focused on extremely large models, we also find that models with as few as two experts improves performance while easily fitting within memory constraints of commonly available GPUs or TPUs • Do sparse models outperform dense models on the speed-accuracy pareto curve? Yes. Across a wide variety of different model’s sizes, sparse models outperform dense models per step and on wall clock time. • I can’t deploy a trillion parameters model – can we shrink these models? We cannot fully preserve the model quality, but compression rates of 10 to 100x are achievable by distilling our sparse models into dense models while achieving ≈30% of the quality gain of the expert model. DISCUSSION