Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Jonas Schneider, Head of Engineering for Robotics, OpenAI

1.152 Aufrufe

Veröffentlicht am

Machine Learning Systems at Scale:
OpenAI is a non-profit research company, discovering and enacting the path to safe artificial general intelligence. As part of our work, we regularly push the limits of scalability in cutting-edge ML algorithms. We’ve found that in many cases, designing the systems we build around the core algorithms is as important as designing the algorithms themselves. This means that many systems engineering areas, such as distributed computing, networking, and orchestration, are crucial for machine learning to succeed on large problems requiring thousands of computers. As a result, at OpenAI engineers and researchers work closely together to build these large systems as opposed to a strict researcher/engineer split. In this talk, we will go over some of the lessons we’ve learned, and how they come together in the design and internals of our system for learning-based robotics research.

Bio: Jonas leads technology development for OpenAI’s robotics group, developing methods to apply machine learning and AI to robots. He also helped build the infrastructure to scale OpenAI’s distributed ML systems to thousands of machines.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Jonas Schneider, Head of Engineering for Robotics, OpenAI

  1. 1. Machine Learning Systems at Scale MLconf San Francisco Jonas Schneider November 10th, 2017
  2. 2. OpenAI Non-profit research lab Goal: ensure AGI is good for humanity Teams: Robotics, Dota, basic research, …
  3. 3. Robots that Learn https://blog.openai.com/robots-that-learn/
  4. 4. Dota 2 https://blog.openai.com/dota-2/
  5. 5. What’s in a ML system? ML core (e.g. PPO, A3C, …)
  6. 6. What’s in a ML system? ML core (e.g. PPO, A3C, …)
  7. 7. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration
  8. 8. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration
  9. 9. Example: Orchestration Kubernetes Azure Our Model
  10. 10. Kubernetes Azure Kubernetes GCE Kubernetes On-Premises Hardware Our Model Our Model Our Model Example: Orchestration
  11. 11. Scriptable infrastructure exp = Experiment() exp.add_parameter_server() for i in range(NUM_WORKERS): exp.add_tensorflow_worker(my_tf_graph, cpu=24, gpu=4) exp.run(mode=’kube’) # or ’docker’ https://blog.openai.com/infrastructure-for-deep-learning/ “Building the Infrastructure that powers the future of AI”, KubeCon 2017
  12. 12. Think: Instead of: Research Engineering
  13. 13. Think: Instead of: Research Engineering Systems Algorithms TRPO PPO DQN ES ? https://blog.openai.com/evolution-strategies/ https://blog.openai.com/openai-baselines-ppo/
  14. 14. How to scale RL? Supervised learning: gradient averaging Large batch sizes fix many problems Turns out, it works for reinforcement learning too
  15. 15. Example: DDPG+HER optimizer worker worker worker evaluator
  16. 16. 1. Scale your models 2. Scale your team
  17. 17. Know your stack CUDA bindings TF Graph Language Distributed TF TensorFlow
  18. 18. Know your stack CUDA bindings TF Graph Language Distributed TF Seems fast until you see PyTorch Performance issues on plain Ethernet Nice design, takes getting used to
  19. 19. TensorFlow++ One of our stacks CUDA bindings TF Graph Language MPI + Redis Custom Ops
  20. 20. Track performance https://blog.openai.com/more-on-dota-2/
  21. 21. Track regressions
  22. 22. If OpenAI can do it…
  23. 23. 1. Hire a team of diverse skills. 2. Think about the entire system. 3. Track your performance.
  24. 24. Thanks! Interested in working at OpenAI? Ping jonas@openai.com!

×