SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Kabir Nagrecha & Arun Kumar
5/4/2023
Saturn
Unifying Parallelism, Resource Allocation, and Scheduling for
Multi-Large-Model Deep Learning
Agenda
• Introduction
• Deep Learning - A New Era
• Critical Challenges
• Why Unify?
• Saturn - A New Way?
• Conclusion
Introduction
(5mins)
A New Era
• Deep Learning has changed….
Fine-Tuning & Applications
• Off-the-shelf models have to be fine-tuned and adapted
• Model is big…data might not be
• Model Selection is critical - motivating multi-model
• Democratizing fine-tuning for domain scientists & practitioners
Critical Challenges - Parallelism
• Parallelism has become essential but complex
• Model Parallel?
• Pipelining?
• Offloading?
• Data Parallel / Sharded Data Parallel?
• Hybrids?
Critical Challenges - Resource Allocation
• Non-Linear Scaling Complicates Resource Apportioning
• In a multi-job, how should GPUs be distributed?
• How does each model’s performance scale?
• Local performance vs global throughput
Critical Challenges - Scheduling
• Scheduling requires both local & global understanding
• What’s the estimated runtime of each job?
• How can I most effectively utilize my GPUs to minimize makespan?
Unification
Parallelism
Conclusion: We have to join these problems!
GPU Apportioning
Scheduling
Saturn - A New Way
(7mins)
SPASE: A New Optimization Problem
• Select Parallelism Pipeline Parallel or Data Parallel?
• Allocate resources How Many GPUs per Job?
• SchedulE jobs A before B, or B before A?
Given a Multi-Job of Large Models, we have to….
Saturn - A SPASE System
1. Library
2. Profiler
3. Joint Optimizer
4. Executor
User
Parallelism Registration
Job Submission
Saturn - A SPASE System
Library: register & retrieve parallelism techniques
Already supports popular techniques such as pipelining, DDP, FSDP, and more!
Saturn - A SPASE System
Profiler: performance estimates for each model
under each parallelism & possible apportionment
Saturn - A SPASE System
Introspective Solver: MILP-solving tool to produce
parallelisms, apportionments, & start times for each
model
Pro
fi
ler Results
Hardware Information
Parallelism Selection
per Model
GPU Allocation Per
Model
Start Time Per Model
Evaluations - Background
• GPT Fine-Tuning hyperparameter selection
• 12 6B parameter models
• WikiText data
• Different learning rates, batch sizes
• Vision Transformer
• Neural Architecture Evaluation
• ImageNet
• 12 500M - 2B parameter models
• 8-GPU A100 nodes
Evaluations: Single-Node, 8-GPU
Baseline: 8-GPUs per model, run in sequence
Standard Practice
30.6 hours
Standard Practice
19.05 hours
ViT
GPT
Saturn Saturn
17.4 hours
10.75 hours
1.76X Speedup!
1.77X Speedup!
Evaluations: Two-Node, 16-GPU
Standard Practice
14.57 hours
10.15 hours
ViT
GPT
Saturn Saturn
8.23 hours
5.17 hours
1.77X Speedup!
1.96X Speedup!
Baseline: 8-GPUs per model, run in sequence
Standard Practice
Conclusion
(2mins)
Conclusion
• Modern DL Scale challenges motivate automated, easy-to-use, and
resource-efficient training systems
• We should consider DL efficiency holistically
• Saturn, the first work to tackle this new joint problem of
Parallelism, Allocation, and Scheduling demonstrates 40-50%
runtime reductions

Weitere ähnliche Inhalte

Ähnlich wie Unifying Parallelism, Resource Allocation, and Scheduling for Multi-Large-Model Deep Learning

Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendationsRik Van Bruggen
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko Neotys
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learningjie cao
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-finalsupportlogic
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 

Ähnlich wie Unifying Parallelism, Resource Allocation, and Scheduling for Multi-Large-Model Deep Learning (20)

Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendations
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 

Kürzlich hochgeladen

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Kürzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Unifying Parallelism, Resource Allocation, and Scheduling for Multi-Large-Model Deep Learning

  • 1. Kabir Nagrecha & Arun Kumar 5/4/2023 Saturn Unifying Parallelism, Resource Allocation, and Scheduling for Multi-Large-Model Deep Learning
  • 2. Agenda • Introduction • Deep Learning - A New Era • Critical Challenges • Why Unify? • Saturn - A New Way? • Conclusion
  • 4. A New Era • Deep Learning has changed….
  • 5. Fine-Tuning & Applications • Off-the-shelf models have to be fine-tuned and adapted • Model is big…data might not be • Model Selection is critical - motivating multi-model • Democratizing fine-tuning for domain scientists & practitioners
  • 6. Critical Challenges - Parallelism • Parallelism has become essential but complex • Model Parallel? • Pipelining? • Offloading? • Data Parallel / Sharded Data Parallel? • Hybrids?
  • 7. Critical Challenges - Resource Allocation • Non-Linear Scaling Complicates Resource Apportioning • In a multi-job, how should GPUs be distributed? • How does each model’s performance scale? • Local performance vs global throughput
  • 8. Critical Challenges - Scheduling • Scheduling requires both local & global understanding • What’s the estimated runtime of each job? • How can I most effectively utilize my GPUs to minimize makespan?
  • 9. Unification Parallelism Conclusion: We have to join these problems! GPU Apportioning Scheduling
  • 10. Saturn - A New Way (7mins)
  • 11. SPASE: A New Optimization Problem • Select Parallelism Pipeline Parallel or Data Parallel? • Allocate resources How Many GPUs per Job? • SchedulE jobs A before B, or B before A? Given a Multi-Job of Large Models, we have to….
  • 12. Saturn - A SPASE System 1. Library 2. Profiler 3. Joint Optimizer 4. Executor User Parallelism Registration Job Submission
  • 13. Saturn - A SPASE System Library: register & retrieve parallelism techniques Already supports popular techniques such as pipelining, DDP, FSDP, and more!
  • 14. Saturn - A SPASE System Profiler: performance estimates for each model under each parallelism & possible apportionment
  • 15. Saturn - A SPASE System Introspective Solver: MILP-solving tool to produce parallelisms, apportionments, & start times for each model Pro fi ler Results Hardware Information Parallelism Selection per Model GPU Allocation Per Model Start Time Per Model
  • 16. Evaluations - Background • GPT Fine-Tuning hyperparameter selection • 12 6B parameter models • WikiText data • Different learning rates, batch sizes • Vision Transformer • Neural Architecture Evaluation • ImageNet • 12 500M - 2B parameter models • 8-GPU A100 nodes
  • 17. Evaluations: Single-Node, 8-GPU Baseline: 8-GPUs per model, run in sequence Standard Practice 30.6 hours Standard Practice 19.05 hours ViT GPT Saturn Saturn 17.4 hours 10.75 hours 1.76X Speedup! 1.77X Speedup!
  • 18. Evaluations: Two-Node, 16-GPU Standard Practice 14.57 hours 10.15 hours ViT GPT Saturn Saturn 8.23 hours 5.17 hours 1.77X Speedup! 1.96X Speedup! Baseline: 8-GPUs per model, run in sequence Standard Practice
  • 20. Conclusion • Modern DL Scale challenges motivate automated, easy-to-use, and resource-efficient training systems • We should consider DL efficiency holistically • Saturn, the first work to tackle this new joint problem of Parallelism, Allocation, and Scheduling demonstrates 40-50% runtime reductions