SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
ML at Facebook:
An Infrastructure View
Yangqing Jia
Director, Facebook AI Infra
(* This is not
Facebook AI Infra)
The Machine Learning Moore’s Law?
0
500
1000
1500
2000
2500
2001 2003 2005 2007 2009 2011 2013 2015 2017
NumberofCitations
Machine Learning Execution Flow
Data Features Training Evaluation Inference
Offline Online
Machine Learning Execution Flow
Data Features Training Eval Inference PredictionsModel
It’s an infrastructure challenge
Data
Offline Training Online Inference
Storage
Challenges!
Network
Challenges!
Compute
Challenges!
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Does Facebook Use Machine Learning?
News Feed Ads Search
Language Translation
Sigma Facer Lumos
Speech Recognition
Content
Understanding
Classification
& Ranking
Services
What ML Models Do We Leverage?
GBDTSVM MLP CNN RNN
Support
Vector
Machines
Gradient-
Boosted
Decision Trees
Multi-Layer
Perceptron
Convolutional
Neural Nets
Recurrent
Neural Nets
Facer Sigma News Feed
Ads
Search
Sigma
Facer
Lumos
Language
Translation
Content
Understanding
Speech Rec
How Often Do We Train Models?
minutes hours days months
How Long Does Training Take?
seconds minutes hours days
How Much Compute Does Inference Consume?
100X 10x 1x
Does Facebook Design Hardware?
• Yes! Since 2010! All designs released through open compute!
• Facebook Server Design Philosophy
• Identify a small number of major services with unique resource requirements
• Design servers for those major services
One major
server design
versus
A B
D
C
Customized, dedicated hardwareGlobal shared pool
Does Facebook Design Hardware?
Yosemite/Twin Lakes:
For the web tier and
other “stateless services”
Open Compute “Sleds”
are 2U x 3 Across in an
Open Compute Rack
Server Card
Chassis
4-way
Shared NIC
SSD Boot
Rack
Does Facebook Design Hardware?
Tioga Pass:
For compute or memory-
intensive workloads:
Bryce Canyon:
For storage-heavy workloads:
Does Facebook Design Hardware for AI/ML?
• HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt.
• Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100)
Big Sur
Integrated Compute
8 Nvidia M40 GPUs
Big Basin
JBOG Design (CPU headnode)
8 Nvidia P100 / V100 GPUs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
OpenGL ESNNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN
Deep Learning Frameworks
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
O(n2) pairs
Shared model and operator representation
Open Neural Network Exchange
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
From O(n2) to O(n) pairs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
Facebook AI Ecosystem
Frameworks: Core ML Software
Caffe2 / PyTorch / ONNX
Platforms: Workflow Management, Deployment
FB Learner
Infrastructure: Servers, Storage, Network Strategy
Open Compute Project
FB Learner Platform
FB Learner
Feature Store
FB Learner
Flow
FB Learner
Predictor
• AI Workflow
• Model Management and Deployment
FBLearner in ML
Data Features Training Eval Inference PredictionsModel
FBLearner
Flow
FBLearner
Predictor
FBLearner
Feature
Store
CPU+GPUCPU CPU
Putting it All Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Feature Store FBLearner Flow FBL Predictor
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
2 Billion People
What changes when you scale to over
Scaling Challenges / Opportunities
Lots of Data Lots of Compute
Scaling Opportunity: Free Compute!
Santa Clara, California
Ashburn, Virginia
Prineville, Oregon
Forest City, North Carolina
Lulea, Sweden
Altoona, Iowa
Fort Worth, Texas
Clonee, Ireland
Los Lunas, New Mexico
Odense, Denmark
New Albany, Ohio
Papillion, Nebraska
Key Takeaways
Facebook AI
Lots of
Data
Wide variety
of models
Full stack
challenges Global scale
Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov
Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law
Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang
Facebook ML Infrastructure - 2018 slides

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 

Was ist angesagt? (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
Bizweb Microservices Architecture
Bizweb Microservices ArchitectureBizweb Microservices Architecture
Bizweb Microservices Architecture
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 

Ähnlich wie Facebook ML Infrastructure - 2018 slides

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
Alan Tsai
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 

Ähnlich wie Facebook ML Infrastructure - 2018 slides (20)

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
Technology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and BeyondTechnology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and Beyond
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Microsoft power platform
Microsoft power platformMicrosoft power platform
Microsoft power platform
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
 
Cookbook for Building An App
Cookbook for Building An AppCookbook for Building An App
Cookbook for Building An App
 

Mehr von Karthik Murugesan

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 

Mehr von Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Facebook ML Infrastructure - 2018 slides

  • 1. ML at Facebook: An Infrastructure View Yangqing Jia Director, Facebook AI Infra
  • 2. (* This is not Facebook AI Infra)
  • 3.
  • 4. The Machine Learning Moore’s Law? 0 500 1000 1500 2000 2500 2001 2003 2005 2007 2009 2011 2013 2015 2017 NumberofCitations
  • 5. Machine Learning Execution Flow Data Features Training Evaluation Inference Offline Online
  • 6. Machine Learning Execution Flow Data Features Training Eval Inference PredictionsModel
  • 7. It’s an infrastructure challenge Data Offline Training Online Inference Storage Challenges! Network Challenges! Compute Challenges!
  • 8. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 9. Does Facebook Use Machine Learning? News Feed Ads Search Language Translation Sigma Facer Lumos Speech Recognition Content Understanding Classification & Ranking Services
  • 10. What ML Models Do We Leverage? GBDTSVM MLP CNN RNN Support Vector Machines Gradient- Boosted Decision Trees Multi-Layer Perceptron Convolutional Neural Nets Recurrent Neural Nets Facer Sigma News Feed Ads Search Sigma Facer Lumos Language Translation Content Understanding Speech Rec
  • 11. How Often Do We Train Models? minutes hours days months
  • 12. How Long Does Training Take? seconds minutes hours days
  • 13. How Much Compute Does Inference Consume? 100X 10x 1x
  • 14. Does Facebook Design Hardware? • Yes! Since 2010! All designs released through open compute! • Facebook Server Design Philosophy • Identify a small number of major services with unique resource requirements • Design servers for those major services One major server design versus A B D C Customized, dedicated hardwareGlobal shared pool
  • 15. Does Facebook Design Hardware? Yosemite/Twin Lakes: For the web tier and other “stateless services” Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack Server Card Chassis 4-way Shared NIC SSD Boot Rack
  • 16. Does Facebook Design Hardware? Tioga Pass: For compute or memory- intensive workloads: Bryce Canyon: For storage-heavy workloads:
  • 17. Does Facebook Design Hardware for AI/ML? • HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt. • Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100) Big Sur Integrated Compute 8 Nvidia M40 GPUs Big Basin JBOG Design (CPU headnode) 8 Nvidia P100 / V100 GPUs
  • 18. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes
  • 19. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 20. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust
  • 21. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust OpenGL ESNNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  • 22.
  • 23.
  • 24.
  • 25. Deep Learning Frameworks Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … O(n2) pairs
  • 26. Shared model and operator representation Open Neural Network Exchange Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … From O(n2) to O(n) pairs
  • 27. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 28. Facebook AI Ecosystem Frameworks: Core ML Software Caffe2 / PyTorch / ONNX Platforms: Workflow Management, Deployment FB Learner Infrastructure: Servers, Storage, Network Strategy Open Compute Project
  • 29. FB Learner Platform FB Learner Feature Store FB Learner Flow FB Learner Predictor • AI Workflow • Model Management and Deployment
  • 30. FBLearner in ML Data Features Training Eval Inference PredictionsModel FBLearner Flow FBLearner Predictor FBLearner Feature Store CPU+GPUCPU CPU
  • 31. Putting it All Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Feature Store FBLearner Flow FBL Predictor Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 32. 2 Billion People What changes when you scale to over
  • 33. Scaling Challenges / Opportunities Lots of Data Lots of Compute
  • 35. Santa Clara, California Ashburn, Virginia Prineville, Oregon Forest City, North Carolina Lulea, Sweden Altoona, Iowa Fort Worth, Texas Clonee, Ireland Los Lunas, New Mexico Odense, Denmark New Albany, Ohio Papillion, Nebraska
  • 36. Key Takeaways Facebook AI Lots of Data Wide variety of models Full stack challenges Global scale
  • 37. Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang