Machine Learning Systems at Scale: Building Reliable ML Infrastructure

•

3 likes•2,135 views

Machine Learning Systems at Scale: OpenAI is a non-profit research company, discovering and enacting the path to safe artificial general intelligence. As part of our work, we regularly push the limits of scalability in cutting-edge ML algorithms. We’ve found that in many cases, designing the systems we build around the core algorithms is as important as designing the algorithms themselves. This means that many systems engineering areas, such as distributed computing, networking, and orchestration, are crucial for machine learning to succeed on large problems requiring thousands of computers. As a result, at OpenAI engineers and researchers work closely together to build these large systems as opposed to a strict researcher/engineer split. In this talk, we will go over some of the lessons we’ve learned, and how they come together in the design and internals of our system for learning-based robotics research. Bio: Jonas leads technology development for OpenAI’s robotics group, developing methods to apply machine learning and AI to robots. He also helped build the infrastructure to scale OpenAI’s distributed ML systems to thousands of machines.

Technology

Machine Learning
Systems at Scale
MLconf San Francisco
Jonas Schneider
November 10th, 2017

OpenAI
Non-proﬁt research lab
Goal: ensure AGI is good for humanity
Teams: Robotics, Dota, basic research, …

Robots that Learn
https://blog.openai.com/robots-that-learn/

What’s in a ML system?
ML core
(e.g. PPO, A3C, …)

Data
munging
Compute infra Networking
Observability
Tooling
Regression tests
ML core
(e.g. PPO, A3C, …)
Deployment/
Inference
Storage
Orchestration

Example: Orchestration
Kubernetes
Azure
Our Model

Kubernetes
Azure
Kubernetes
GCE
Kubernetes
On-Premises
Hardware
Our Model Our Model Our Model
Example: Orchestration

Scriptable infrastructure
exp = Experiment()
exp.add_parameter_server()
for i in range(NUM_WORKERS):
exp.add_tensorflow_worker(my_tf_graph, cpu=24, gpu=4)
exp.run(mode=’kube’) # or ’docker’
https://blog.openai.com/infrastructure-for-deep-learning/
“Building the Infrastructure that powers the future of AI”, KubeCon 2017

Think:
Instead of:
Research Engineering
Systems
Algorithms
TRPO
PPO
DQN
ES
?
https://blog.openai.com/evolution-strategies/
https://blog.openai.com/openai-baselines-ppo/

How to scale RL?
Supervised learning: gradient averaging
Large batch sizes ﬁx many problems
Turns out, it works for reinforcement learning too

Example: DDPG+HER
optimizer
worker worker worker
evaluator

Know your stack
CUDA bindings
TF Graph Language
Distributed TF
TensorFlow

Know your stack
CUDA bindings
TF Graph Language
Distributed TF
Seems fast until
you see PyTorch
Performance issues
on plain Ethernet
Nice design,
takes getting used to

TensorFlow++
One of our stacks
CUDA
bindings
TF Graph Language
MPI + Redis
Custom
Ops

Track performance
https://blog.openai.com/more-on-dota-2/

1. Hire a team of diverse skills.
2. Think about the entire system.
3. Track your performance.

Thanks!
Interested in working at OpenAI? Ping jonas@openai.com!

What's hot

Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska

Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf

Introduction to KerasJohn Ramey

Anomaly detection in deep learning (Updated) EnglishAdam Gibson

Strata Beijing 2017: Jumpy, a python interface for nd4jAdam Gibson

Deep Learning with CNTKAshish Jaiman

Khan farhan cvfarhan0039

Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon

Poonam data scientistPoonam Agrawal

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks

Scaling AI in production using PyTorchgeetachauhan

Ferruzza g automl deckEric Dill

Keras: Deep Learning Library for PythonRafi Khan

Deep learning with Tensorflow in Rmikaelhuss

MongoDB & Machine LearningTom Maiaroto

DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks

Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler

Large-Scale Malicious Domain Detection with Spark AIDatabricks

PyConline AU 2021 - Things might go wrong in a data-intensive applicationHua Chu

What's hot (20)

Deep Learning with Microsoft Cognitive Toolkit

Kaz Sato, Evangelist, Google at MLconf ATL 2016

Introduction to Keras

Anomaly detection in deep learning (Updated) English

Strata Beijing 2017: Jumpy, a python interface for nd4j

Deep Learning with CNTK

Khan farhan cv

Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017

Poonam data scientist

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...

Scaling AI in production using PyTorch

Ferruzza g automl deck

Keras: Deep Learning Library for Python

Deep learning with Tensorflow in R

MongoDB & Machine Learning

DeepLearning and Advanced Machine Learning on IoT

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...

Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink

Large-Scale Malicious Domain Detection with Spark AI

PyConline AU 2021 - Things might go wrong in a data-intensive application

Viewers also liked

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

Daniel Shank, Data Scientist, Talla at MLconf SF 2017MLconf

LN Renganarayana, Architect, ML Platform and Services and Madhura Dudhgaonkar...MLconf

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Ashfaq Munshi, ML7 Fellow, PepperdataMLconf

Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017MLconf

Doug Eck, Research Scientist, Google Magenta, at MLconf SF 2017MLconf

Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017MLconf

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf

Rushin Shah, Engineering Manager, Facebook at MLconf SF 2017MLconf

Dr. June Andrews, Principal Data Scientist, Wise.io, From GE Digital at MLcon...MLconf

Talha Obaid, Email Security, Symantec at MLconf ATL 2017MLconf

Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017MLconf

Jessica Rudd, PhD Student, Analytics and Data Science, Kennesaw State Univers...MLconf

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf

Ryan West, Machine Learning Engineer, Nexosis at MLconf ATL 2017MLconf

Ashrith Barthur, Security Scientist, H2o.ai, at MLconf 2017MLconf

Viewers also liked (17)

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...

Daniel Shank, Data Scientist, Talla at MLconf SF 2017

LN Renganarayana, Architect, ML Platform and Services and Madhura Dudhgaonkar...

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Ashfaq Munshi, ML7 Fellow, Pepperdata

Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017

Doug Eck, Research Scientist, Google Magenta, at MLconf SF 2017

Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...

Rushin Shah, Engineering Manager, Facebook at MLconf SF 2017

Dr. June Andrews, Principal Data Scientist, Wise.io, From GE Digital at MLcon...

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017

Jessica Rudd, PhD Student, Analytics and Data Science, Kennesaw State Univers...

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017

Ryan West, Machine Learning Engineer, Nexosis at MLconf ATL 2017

Ashrith Barthur, Security Scientist, H2o.ai, at MLconf 2017

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...Vaticle

SearchLove San Diego 2019 - Britney Muller - Machine Learning: Know Enough To...Distilled

Machine Learning - Know Enough To Be Dangerous #SearchLoveBritney Muller

Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems

[第45回 Machine Learning 15minutes! Broadcast] Azure AI - Build 2020 UpdatesNaoki (Neo) SATO

DF1 - ML - Petukhov - Azure Ml Machine Learning as a ServiceMoscowDataFest

Big Data: the weakest linkCS, NcState

Machine Learning on the Cloud with Apache MXNetdelagoya

Data Science Challenges in Personal Program AnalysisWork-Bench

Oleksander Krakovetskyi "Artificial Intelligence and Machine Learning for .NE...Fwdays

2019 04-13 ai for .net developers (fwdays)Oleksandr Krakovetskyi

Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov

Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Rafael Ferreira da Silva

Usb 3.0 technology mindshareNguyen Nhat Han

Saving Human Lives with the IoTDat Tran

Biometric Systems - Automate Video Streaming Analysis with Azure and AWSRoberto Falconi

2018 11 14 Artificial Intelligence and Machine Learning in AzureBruno Capuano

Microsoft DryadColin Clark

ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava

Deep Learning: Application Landscape - March 2018Grigory Sapunov

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure (20)

Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...

SearchLove San Diego 2019 - Britney Muller - Machine Learning: Know Enough To...

Machine Learning - Know Enough To Be Dangerous #SearchLove

Analyzing Big Data's Weakest Link (hint: it might be you)

[第45回 Machine Learning 15minutes! Broadcast] Azure AI - Build 2020 Updates

DF1 - ML - Petukhov - Azure Ml Machine Learning as a Service

Big Data: the weakest link

Machine Learning on the Cloud with Apache MXNet

Data Science Challenges in Personal Program Analysis

Oleksander Krakovetskyi "Artificial Intelligence and Machine Learning for .NE...

2019 04-13 ai for .net developers (fwdays)

Parallel and Distributed Algorithms for Large Text Datasets Analysis

Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...

Usb 3.0 technology mindshare

Saving Human Lives with the IoT

Biometric Systems - Automate Video Streaming Analysis with Azure and AWS

2018 11 14 Artificial Intelligence and Machine Learning in Azure

Microsoft Dryad

ALM Search Presentation for the VSS Arch Council

Deep Learning: Application Landscape - March 2018

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Install Stable Diffusion in windows machinePadma Pradeep

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Gen AI in Business - Global Trends Report 2024.pdfAddepto

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Install Stable Diffusion in windows machine

Advanced Test Driven-Development @ php[tek] 2024

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

What's New in Teams Calling, Meetings and Devices March 2024

Designing IA for AI - Information Architecture Conference 2024

Gen AI in Business - Global Trends Report 2024.pdf

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Are Multi-Cloud and Serverless Good or Bad?

Unraveling Multimodality with Large Language Models.pdf

SAP Build Work Zone - Overview L2-L3.pptx

Search Engine Optimization SEO PDF for 2024.pdf

Connect Wave/ connectwave Pitch Deck Presentation

The Future of Software Development - Devin AI Innovative Approach.pdf

Vertex AI Gemini Prompt Engineering Tips

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Ensuring Technical Readiness For Copilot in Microsoft 365

Streamlining Python Development: A Guide to a Modern Project Setup

Dev Dives: Streamline document processing with UiPath Studio Web

Unleash Your Potential - Namagunga Girls Coding Club

Machine Learning Systems at Scale: Building Reliable ML Infrastructure

1. Machine Learning Systems at Scale MLconf San Francisco Jonas Schneider November 10th, 2017

2. OpenAI Non-proﬁt research lab Goal: ensure AGI is good for humanity Teams: Robotics, Dota, basic research, …

3. Robots that Learn https://blog.openai.com/robots-that-learn/

4. Dota 2 https://blog.openai.com/dota-2/

5. What’s in a ML system? ML core (e.g. PPO, A3C, …)

6. What’s in a ML system? ML core (e.g. PPO, A3C, …)

7. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration

8. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration

9. Example: Orchestration Kubernetes Azure Our Model

10. Kubernetes Azure Kubernetes GCE Kubernetes On-Premises Hardware Our Model Our Model Our Model Example: Orchestration

11. Scriptable infrastructure exp = Experiment() exp.add_parameter_server() for i in range(NUM_WORKERS): exp.add_tensorflow_worker(my_tf_graph, cpu=24, gpu=4) exp.run(mode=’kube’) # or ’docker’ https://blog.openai.com/infrastructure-for-deep-learning/ “Building the Infrastructure that powers the future of AI”, KubeCon 2017

12. Think: Instead of: Research Engineering

13. Think: Instead of: Research Engineering Systems Algorithms TRPO PPO DQN ES ? https://blog.openai.com/evolution-strategies/ https://blog.openai.com/openai-baselines-ppo/

14. How to scale RL? Supervised learning: gradient averaging Large batch sizes ﬁx many problems Turns out, it works for reinforcement learning too

15. Example: DDPG+HER optimizer worker worker worker evaluator

16. 1. Scale your models 2. Scale your team

17. Know your stack CUDA bindings TF Graph Language Distributed TF TensorFlow

18. Know your stack CUDA bindings TF Graph Language Distributed TF Seems fast until you see PyTorch Performance issues on plain Ethernet Nice design, takes getting used to

19. TensorFlow++ One of our stacks CUDA bindings TF Graph Language MPI + Redis Custom Ops

20. Track performance https://blog.openai.com/more-on-dota-2/

21. Track regressions

22. If OpenAI can do it…

23. 1. Hire a team of diverse skills. 2. Think about the entire system. 3. Track your performance.

24. Thanks! Interested in working at OpenAI? Ping jonas@openai.com!

Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Systems at Scale: Building Reliable ML Infrastructure