Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

•Als PPTX, PDF herunterladen•

2 gefällt mir•1,474 views

Training Recurrent Neural Networks at Scale: One of our projects at Baidu’s Silicon Valley AI Lab is using deep learning to develop state of the art end-to-end speech recognition systems based on recurrent neural networks for multiple languages. The training set for each language is multiple terabytes in size and each model requires in excess of 10 Exaflops to train. Training such models requires scale and techniques that are unusual for deep learning but more common in high performance computing. I will talk about the challenges involved and the software and hardware solutions that we employ.

Technologie

Training Recurrent Neural
Networks at Scale
Erich Elsen
Research Scientist

Erich Elsen
Natural User Interfaces
• Goal: Make interacting with computers as
natural as interacting with humans
• AI problems:
– Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis

Erich Elsen
Deep Speech Applications
• Voice controlled apps
• Peel Partnership
• English and Mandarin APIs in the US
• Integration into Baidu’s products in China

Erich Elsen
Deep Speech: End-to-end learning
• Deep neural network predicts
probability of characters directly from
audio
. . .
. . .
T H _ E … D O G

Erich Elsen
Connectionist Temporal Classification

Erich Elsen
Deep Speech: CTC
E .01 .05 .1 .1 .8 .05
H .01 .1 .1 .6 .05 .05
T .01 .8 .75 .2 .05 .1
BLANK .97 .05 .05 .1 .1 .8
• Simplified sequence of network outputs
(probabilities)
• Generally many more timesteps than letters
• Need to look at all the ways we can write “the”
• Adjacent characters collapse
• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….
• Solve with dynamic programming
Time

Erich Elsen
warp-ctc
• Recently open sourced our CTC
implementation
• Efficient, parallel CPU and GPU backend
• 100-400X faster than other implementations
• Apache license, C interface
https://github.com/baidu-research/warp-ctc

Erich Elsen
Accuracy scales with Data
Data & Model Size
Performance
Deep Learning algorithms
Many previous methods
• 40% error reduction for each 10x increase in dataset size

Erich Elsen
Training sets
• Train on ~1½ years of data (and growing)
• English and Mandarin
• End-to-end deep learning is key to
assembling large datasets
• Datasets drive accuracy

Erich Elsen
Large Datasets = Large Models
Dataset Size
Big Model
Small Model
Accuracy
• Models require over 20 Exa-flops to train (exa =
10^18)
• Trained on 4+ Terabytes of audio

Erich Elsen
Virtuous Cycle of Innovation
Perform ExperimentLearn
Iterate
Design New Experiment

Erich Elsen
Experiment Scaling
• Batch Norm impact with deeper networks
• Sequence wise normalization:

Erich Elsen
Parallelism across GPUs
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
For these models, Data Parallelism works best

Erich Elsen
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically
challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TFLOP/s
Number of GPUs
Typical
training run
one node multi node

Erich Elsen
All-reduce
• We implemented our own all-reduce out of
send and receive
• Several algorithm choices based on size
• Careful attention to affinity and topology

Erich Elsen
Scalability
• Batch size is hard to increase
– algorithm, memory limits
• Performance at small batch sizes (32, 64)
leads to scalability limits

Erich Elsen
Precision
• FP16 also mostly works
– Use FP32 for softmax and weight updates
• More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Count
Magnitude
Weight Distribution

Erich Elsen
Conclusion
• We have to do experiments at scale
• Pushing compute scaling for end-to-end
deep learning
• Efficient training for large datasets
– 50 Teraflops/second sustained on one model
– 20 Exaflops to train each model
• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides
Erich Elsen

Empfohlen

The deep learning tour - Q1 2017 Eran Shlomo

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!

Вебинар: Инструменты для работы Data ScientistFlyElephant

Dmitry Spodarets_Infrastructure for the work of data scientistsFlyElephant

Infrastructure for the work of Data ScientistsFlyElephant

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf

Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...MLconf

MLconf NYC Corinna CortesMLconf

Empfohlen

The deep learning tour - Q1 2017 Eran Shlomo

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!

Вебинар: Инструменты для работы Data ScientistFlyElephant

Dmitry Spodarets_Infrastructure for the work of data scientistsFlyElephant

Infrastructure for the work of Data ScientistsFlyElephant

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf

Soumith Chintala, Artificial Intelligence Research Engineer, Facebook at MLco...MLconf

MLconf NYC Corinna CortesMLconf

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16MLconf

Corinna Cortes, Head of Research, Google at MLconf NYCMLconf

Notes from 2016 bay area deep learning school Niketan Pansare

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit

Lessons from 2MM machine learning modelsExtract Data Conference

Deep Learning in real world @Deep Learning TokyoPreferred Networks

[251] implementing deep learning using cu dnnNAVER D2

aiconf2017okanoharaPreferred Networks

Deep Learning: a birds eye viewRoelof Pieters

Deep DomainZachary S. Brown

Scalable Deep Learning on AWS with Apache MXNetJulien SIMON

Deep learning introductionAdwait Bhave

Smaller and Easier: Machine Learning on Embedded ThingsNUS-ISS

R tech introcomputerRose Rajput

Pdc lecture1SyedSafeer1

Introduction to deep learningAmr Rashed

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services

Repeating History...On Purpose...with ElixirBarry Jones

Large scalecplexoptimizatiodirectdirect

CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studiesoptimizatiodirectdirect

Concurrency & Parallel ProgrammingRamazan AYYILDIZ

Weitere ähnliche Inhalte

Andere mochten auch

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16MLconf

Corinna Cortes, Head of Research, Google at MLconf NYCMLconf

Notes from 2016 bay area deep learning school Niketan Pansare

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit

Lessons from 2MM machine learning modelsExtract Data Conference

Deep Learning in real world @Deep Learning TokyoPreferred Networks

[251] implementing deep learning using cu dnnNAVER D2

aiconf2017okanoharaPreferred Networks

Deep Learning: a birds eye viewRoelof Pieters

Andere mochten auch (9)

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

Corinna Cortes, Head of Research, Google at MLconf NYC

Notes from 2016 bay area deep learning school

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...

Lessons from 2MM machine learning models

Deep Learning in real world @Deep Learning Tokyo

[251] implementing deep learning using cu dnn

aiconf2017okanohara

Deep Learning: a birds eye view

Ähnlich wie Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Deep DomainZachary S. Brown

Scalable Deep Learning on AWS with Apache MXNetJulien SIMON

Deep learning introductionAdwait Bhave

Smaller and Easier: Machine Learning on Embedded ThingsNUS-ISS

R tech introcomputerRose Rajput

Pdc lecture1SyedSafeer1

Introduction to deep learningAmr Rashed

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services

Repeating History...On Purpose...with ElixirBarry Jones

Large scalecplexoptimizatiodirectdirect

CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studiesoptimizatiodirectdirect

Concurrency & Parallel ProgrammingRamazan AYYILDIZ

Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...inside-BigData.com

Scalable Deep Learning on AWS using Apache MXNet (May 2017)Julien SIMON

ElixirFuat Buğra AYDIN

Building a Neural Machine Translation System From ScratchNatasha Latysheva

Deep learning - a primerUwe Friedrichsen

Deep learning - a primerShirin Elsinghorst

Windows Server 2008 R2 Dev Session 02Clint Edmonson

Ähnlich wie Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16 (20)

Deep Domain

Scalable Deep Learning on AWS with Apache MXNet

Deep learning introduction

Smaller and Easier: Machine Learning on Embedded Things

R tech introcomputer

Pdc lecture1

Introduction to deep learning

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks

Repeating History...On Purpose...with Elixir

Large scalecplex

CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies

Concurrency & Parallel Programming

Unum Computing: An Energy Efficient and Massively Parallel Approach to Valid ...

Scalable Deep Learning on AWS using Apache MXNet (May 2017)

Elixir

Building a Neural Machine Translation System From Scratch

Deep learning - a primer

Windows Server 2008 R2 Dev Session 02

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf

Josh Wills - Data Labeling as Religious ExperienceMLconf

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf

Meghana Ravikumar - Optimized Image Classification on the CheapMLconf

Noam Finkelstein - The Importance of Modeling Data CollectionMLconf

June Andrews - The Uncanny Valley of MLMLconf

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf

Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf

Neel Sundaresan - Teaching a machine to codeMLconf

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf

Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf

Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush

Josh Wills - Data Labeling as Religious Experience

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...

Meghana Ravikumar - Optimized Image Classification on the Cheap

Noam Finkelstein - The Importance of Modeling Data Collection

June Andrews - The Uncanny Valley of ML

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...

Vito Ostuni - The Voice: New Challenges in a Zero UI World

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...

Neel Sundaresan - Teaching a machine to code

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...

Soumith Chintala - Increasing the Impact of AI Through Better Software

Roy Lowrance - Predicting Bond Prices: Regime Changes

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Slack Application Development 101 Slidespraypatel2

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

A Year of the Servo Reboot: Where Are We Now?Igalia

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

How to convert PDF to text with Nanonetsnaman860154

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Real Time Object Detection Using Open CVKhem

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Scaling API-first – The story of a global engineering organizationRadu Cotescu

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Slack Application Development 101 Slides

How to Troubleshoot Apps for the Modern Connected Worker

A Year of the Servo Reboot: Where Are We Now?

Advantages of Hiring UIUX Design Service Providers for Your Business

Exploring the Future Potential of AI-Enabled Smartphone Processors

Boost Fertility New Invention Ups Success Rates.pdf

How to convert PDF to text with Nanonets

🐬 The future of MySQL is Postgres 🐘

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Real Time Object Detection Using Open CV

CNv6 Instructor Chapter 6 Quality of Service

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Scaling API-first – The story of a global engineering organization

08448380779 Call Girls In Friends Colony Women Seeking Men

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

What Are The Drone Anti-jamming Systems Technology?

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

1. Training Recurrent Neural Networks at Scale Erich Elsen Research Scientist

2. Erich Elsen Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis

3. Erich Elsen Deep Speech Applications • Voice controlled apps • Peel Partnership • English and Mandarin APIs in the US • Integration into Baidu’s products in China

4. Erich Elsen Deep Speech: End-to-end learning • Deep neural network predicts probability of characters directly from audio . . . . . . T H _ E … D O G

5. Erich Elsen Connectionist Temporal Classification

6. Erich Elsen Deep Speech: CTC E .01 .05 .1 .1 .8 .05 H .01 .1 .1 .6 .05 .05 T .01 .8 .75 .2 .05 .1 BLANK .97 .05 .05 .1 .1 .8 • Simplified sequence of network outputs (probabilities) • Generally many more timesteps than letters • Need to look at all the ways we can write “the” • Adjacent characters collapse • TTTHEE, TTTTHE, TTHHEE, THEEEE, …. • Solve with dynamic programming Time

7. Erich Elsen warp-ctc • Recently open sourced our CTC implementation • Efficient, parallel CPU and GPU backend • 100-400X faster than other implementations • Apache license, C interface https://github.com/baidu-research/warp-ctc

8. Erich Elsen Accuracy scales with Data Data & Model Size Performance Deep Learning algorithms Many previous methods • 40% error reduction for each 10x increase in dataset size

9. Erich Elsen Training sets • Train on ~1½ years of data (and growing) • English and Mandarin • End-to-end deep learning is key to assembling large datasets • Datasets drive accuracy

10. Erich Elsen Large Datasets = Large Models Dataset Size Big Model Small Model Accuracy • Models require over 20 Exa-flops to train (exa = 10^18) • Trained on 4+ Terabytes of audio

11. Erich Elsen Virtuous Cycle of Innovation Perform ExperimentLearn Iterate Design New Experiment

12. Erich Elsen Experiment Scaling • Batch Norm impact with deeper networks • Sequence wise normalization:

13. Erich Elsen Parallelism across GPUs Model Parallel Data Parallel MPI_Allreduce() Training Data Training Data For these models, Data Parallelism works best

14. Erich Elsen Performance for RNN training • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • Weak scaling very efficient, albeit algorithmically challenged 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 TFLOP/s Number of GPUs Typical training run one node multi node

15. Erich Elsen All-reduce • We implemented our own all-reduce out of send and receive • Several algorithm choices based on size • Careful attention to affinity and topology

16. Erich Elsen Scalability • Batch size is hard to increase – algorithm, memory limits • Performance at small batch sizes (32, 64) leads to scalability limits

17. Erich Elsen Precision • FP16 also mostly works – Use FP32 for softmax and weight updates • More sensitive to labeling error 1 10 100 1000 10000 100000 1000000 10000000 100000000 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Count Magnitude Weight Distribution

18. Erich Elsen Conclusion • We have to do experiments at scale • Pushing compute scaling for end-to-end deep learning • Efficient training for large datasets – 50 Teraflops/second sustained on one model – 20 Exaflops to train each model • Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides Erich Elsen

Hinweis der Redaktion

Model Parallel: Latency sensitive Data Parallel: Bandwidth sensitive