SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
Agenda
• Evolution of data infrastructure
• ML workflow: Data prep & DNN training
• Intro to deep learning and computing needs
• Distributed deep learning and challenges
• Unified platform using Spark
– Infra considerations, challenges
• ML Pipelines
3#UnifiedAnalytics #SparkAISummit
Video
Feeds
Call Logs
Data
Web logs
Products
Images
……
Organization’s Data
Database /
Data
Warehouse
Organization’s data
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
+ Machine Learning and
Deep Learning workloads
6#UnifiedAnalytics #SparkAISummit
How long does it take to train Resnet-50 on ImageNet?
7#UnifiedAnalytics #SparkAISummit
14 daysBefore
2017
NVIDIA M40 GPU
Training Resnet-50 on Imagenet
8#UnifiedAnalytics #SparkAISummit
1 hour 31 mins 15 mins
Apr Sept Nov
Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024
Facebook
Caffe2
UC Berkeley,
TACC, UC Davis
Tensorflow
Preferred Network
ChainerMN
2017
6.6 mins
Tesla P40 x 2,048
Tencent
TensorFlow
July Nov
2.0 mins
Sony
Neural Network
Library (NNL)
Tesla V100 x 3,456
2018 2019
Fujitsu
MXNet
1.2 mins
Tesla V100 x 2,048
Apr
Considerations for Deep Learning @ Scale
• CPU vs. GPU
• Single vs. multi-GPU
• MPI vs. non-MPI
• Infiniband vs. Ethernet
9#UnifiedAnalytics #SparkAISummit
Credits: Mathew Salvaris
https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
“Things” you need to deal with when training
machine learning/deep learning models
Gather results
Secure Access
Scale resources
Schedule jobs
Dependencies and Containers
Provision VM clusters
Distribute data
Handling failures
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
Machine Learning and Deep Learning
12#UnifiedAnalytics #SparkAISummit
Top figure source;
Bottom figure from NVIDIA
ML
DL
Lots of ML
Frameworks ….
13#UnifiedAnalytics #SparkAISummit
TensorFlow PyTorch
Scikit-Learn
MXNet Chainer
Keras
Design Choices for Big Data and Machine Learning/Deep Learning
14#UnifiedAnalytics #SparkAISummit
Laptop Spark +
Separate infrastructure for
ML/DL training/inference
Cloud
Spark
Execution Models for Spark and Deep Learning
15#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
Data Parallelism Model Parallelism
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
16#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
17#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
• Re-run crashed task
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
• Re-run all tasks
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Spark + ML/DL
18#UnifiedAnalytics #SparkAISummit
www.aka.ms/spark Sparkflow
TensorFlowOnSpark
Project Hydrogen
HorovodRunner
19#UnifiedAnalytics #SparkAISummit
Microsoft Machine Learning for
Apache Spark v0.16
Microsoft’s Open Source
Contributions to Apache Spark
www.aka.ms/spark Azure/mmlspark
Cognitive
Services
Spark
Serving
Model
Interpretability
LightGBM
Gradient Boosting
Deep Networks
with CNTK
HTTP on
Spark
Demo - Azure Databricks
and Deep Learning
20#UnifiedAnalytics #SparkAISummit
Demo – Distributed Deep
Learning using Tensorflow
with HorovodRunner
21#UnifiedAnalytics #SparkAISummit
What do you
need for
training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
GPU Device Interconnect
• NVLink
• GPUDirect P2P
• GPUDirect RDMA
Interconnect topology sample
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
From CUDA to NCCL1 to NCCL2
Multi-Core
CPU
GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x (multi-node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x
(multi-
node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
Spark & GPU
• Using GPU with Spark options:
1. Native support (cluster manager, GPU tasks): SPARK-
24615
2. Use cores/memory as proxy for GPU resources and
allow GPU-enabled code execution
3. Code implementation/generation for GPU offload
• Considerations
– Flexibility
– Data management
– Multi-GPU execution
27#UnifiedAnalytics #SparkAISummit
Infrastructure Considerations
• Data format, storage and reuse
– Co-locate Data Engineering storage infrastructure (cluster-local)
– DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation)
– Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm)
• Job execution
– Gang scheduling – Refer to SPARK-24374
– Support for GPU (and other accelerators) – Refer to SPARK-24615
– Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster)
– Quota management
– Support for Docker containers
– MPI vs. non-MPI
– Difference GPU generations
• Node, GPU connectivity
– Infiniband, RDMA
– GPU Interconnect options
– Interconnect-aware scheduling, minimize distribution, repacking
ML Pipelines
• Using machine learning pipelines, data scientists, data engineers,
and IT professionals can collaborate on different steps/phases
• Enable use of best tech for different phases in ML/DL workflow
29#UnifiedAnalytics #SparkAISummit
Demo – Azure ML
Pipelines & Databricks
30#UnifiedAnalytics #SparkAISummit
What do you
need for training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsSimeon Fitch
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be StreamedDatabricks
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices ZalandoHayley
 

Was ist angesagt? (20)

Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data Payloads
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 

Ähnlich wie Spark summit 2019 infrastructure for deep learning in apache spark 0425

Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkRui Liu
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterJen Aman
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayNick Pentreath
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 

Ähnlich wie Spark summit 2019 infrastructure for deep learning in apache spark 0425 (20)

Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache Spark
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark Cluster
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 

Kürzlich hochgeladen

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Kürzlich hochgeladen (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Spark summit 2019 infrastructure for deep learning in apache spark 0425

  • 1.
  • 2. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. Agenda • Evolution of data infrastructure • ML workflow: Data prep & DNN training • Intro to deep learning and computing needs • Distributed deep learning and challenges • Unified platform using Spark – Infra considerations, challenges • ML Pipelines 3#UnifiedAnalytics #SparkAISummit
  • 4. Video Feeds Call Logs Data Web logs Products Images …… Organization’s Data Database / Data Warehouse Organization’s data
  • 5. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  • 6. + Machine Learning and Deep Learning workloads 6#UnifiedAnalytics #SparkAISummit
  • 7. How long does it take to train Resnet-50 on ImageNet? 7#UnifiedAnalytics #SparkAISummit 14 daysBefore 2017 NVIDIA M40 GPU
  • 8. Training Resnet-50 on Imagenet 8#UnifiedAnalytics #SparkAISummit 1 hour 31 mins 15 mins Apr Sept Nov Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024 Facebook Caffe2 UC Berkeley, TACC, UC Davis Tensorflow Preferred Network ChainerMN 2017 6.6 mins Tesla P40 x 2,048 Tencent TensorFlow July Nov 2.0 mins Sony Neural Network Library (NNL) Tesla V100 x 3,456 2018 2019 Fujitsu MXNet 1.2 mins Tesla V100 x 2,048 Apr
  • 9. Considerations for Deep Learning @ Scale • CPU vs. GPU • Single vs. multi-GPU • MPI vs. non-MPI • Infiniband vs. Ethernet 9#UnifiedAnalytics #SparkAISummit Credits: Mathew Salvaris https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
  • 10. “Things” you need to deal with when training machine learning/deep learning models Gather results Secure Access Scale resources Schedule jobs Dependencies and Containers Provision VM clusters Distribute data Handling failures
  • 11. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  • 12. Machine Learning and Deep Learning 12#UnifiedAnalytics #SparkAISummit Top figure source; Bottom figure from NVIDIA ML DL
  • 13. Lots of ML Frameworks …. 13#UnifiedAnalytics #SparkAISummit TensorFlow PyTorch Scikit-Learn MXNet Chainer Keras
  • 14. Design Choices for Big Data and Machine Learning/Deep Learning 14#UnifiedAnalytics #SparkAISummit Laptop Spark + Separate infrastructure for ML/DL training/inference Cloud Spark
  • 15. Execution Models for Spark and Deep Learning 15#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark Data Parallelism Model Parallelism • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 16. Execution Models for Spark and Deep Learning 16#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 17. Execution Models for Spark and Deep Learning 17#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable • Re-run crashed task Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes • Re-run all tasks Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 18. Spark + ML/DL 18#UnifiedAnalytics #SparkAISummit www.aka.ms/spark Sparkflow TensorFlowOnSpark Project Hydrogen HorovodRunner
  • 19. 19#UnifiedAnalytics #SparkAISummit Microsoft Machine Learning for Apache Spark v0.16 Microsoft’s Open Source Contributions to Apache Spark www.aka.ms/spark Azure/mmlspark Cognitive Services Spark Serving Model Interpretability LightGBM Gradient Boosting Deep Networks with CNTK HTTP on Spark
  • 20. Demo - Azure Databricks and Deep Learning 20#UnifiedAnalytics #SparkAISummit
  • 21. Demo – Distributed Deep Learning using Tensorflow with HorovodRunner 21#UnifiedAnalytics #SparkAISummit
  • 22. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 23. GPU Device Interconnect • NVLink • GPUDirect P2P • GPUDirect RDMA Interconnect topology sample Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
  • 24. From CUDA to NCCL1 to NCCL2 Multi-Core CPU GPU Multi-GPU Multi-GPU Multi-Node NCCL 2NCCL 1CUDA Multi-GPU Communication Library Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 25. NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 26. NCCL 2.x (multi- node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 27. Spark & GPU • Using GPU with Spark options: 1. Native support (cluster manager, GPU tasks): SPARK- 24615 2. Use cores/memory as proxy for GPU resources and allow GPU-enabled code execution 3. Code implementation/generation for GPU offload • Considerations – Flexibility – Data management – Multi-GPU execution 27#UnifiedAnalytics #SparkAISummit
  • 28. Infrastructure Considerations • Data format, storage and reuse – Co-locate Data Engineering storage infrastructure (cluster-local) – DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation) – Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm) • Job execution – Gang scheduling – Refer to SPARK-24374 – Support for GPU (and other accelerators) – Refer to SPARK-24615 – Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster) – Quota management – Support for Docker containers – MPI vs. non-MPI – Difference GPU generations • Node, GPU connectivity – Infiniband, RDMA – GPU Interconnect options – Interconnect-aware scheduling, minimize distribution, repacking
  • 29. ML Pipelines • Using machine learning pipelines, data scientists, data engineers, and IT professionals can collaborate on different steps/phases • Enable use of best tech for different phases in ML/DL workflow 29#UnifiedAnalytics #SparkAISummit
  • 30. Demo – Azure ML Pipelines & Databricks 30#UnifiedAnalytics #SparkAISummit
  • 31. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 32. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 33. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT