Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Learning at Scale:
Deep, Distributed and Multi-dimensional
Anima Anandkumar
..
Amazon AI & Caltech
Significantly improve many applications on multiple domains
“deep learning” trend in the past 10 years
image understanding...
Image Classification
Layer 1 Layer 2 Output
multilevel feature extractions from raw pixels
to semantic meanings
explore sp...
Image Classification
§ Hard to define the network
§ the definition of the inception network has >1k lines of codes in Caff...
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
3. MXNet
image credit - wikipedia
• Imperative and Declarative Programming
• Language Support
• Backend and Automatic Para...
Writing Parallel Programs is Painful
Each forward-backward-update
involves O(num_layer), which is
often 100—1,000, tensor
...
Auto Parallelization
18
Write serial programs Run in parallel
>>> import mxnet as mx
>>> A = mx.nd.ones((2,2)) *2
>>> C = ...
Data Parallelism
19
key-value store
examples
1. Read a data partition
2. Pull the parameters
3. Compute the gradient
4. Pu...
Scale to Multiple GPU Machines
21
PCIe Switch
GPU
GPU
GPU
GPU
CPU
Network Switch
63 GB/s
4 PCIe 3.0 16x
15.75 GB/s
PCIe 3....
Experiment Setup
✧
✓ 1.2 million images with 1000 classes
✧ Resnet 152-layer model
✧ EC2 P2.16xlarge
22
GPU 0-15
PCIe swit...
Scalability over Multiple Machines
23
time(sec)/bath
0
0.25
0.5
0.75
1
# of GPUs
0 32 64 96 128
Comm Cost
batch size/GPU=2...
8
2012before 2013 2014 2015 2016 2017
mxnet
imperative
symbolic
gluon
Back-end System
✧ Optimization
✓ Memory optimization
✓ Operator fusion
✧ Scheduling
✓ Auto-parallelization
11
a b
1
+
⨉
c
...
In summary
✦ Symbolic
❖ efficient & portable
❖ but hard to use
10
✦ tesla
✦ Imperative
❖ flexible
❖ may be slow
✦ Gluon
❖ imp...
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
Tensors: Beyond 2D world
Modern data is inherently multi-dimensional
Tensors: Beyond 2D world
Modern data is inherently multi-dimensional
Input Hidden 1 Hidden 2 Output
Tensor Contraction
Extends the notion of matrix product
Matrix product
Mv =
j
vjMj
= +
Tensor Contraction
T(u, v, ·) =
i,j...
Employing Tensor Contractions in Alexnet
Replace fully connected layer with tensor contraction layer
Enabling Tensor Contraction Layer in Mxnet
Performance	of	the	TCL
• Trained	end-to-end
• On	ImageNet	with	VGG:	
• 65.9%	space	savings
• performance	drop	of	0.6%	only...
Low-rank	tensor	regression
Tensor	Regression	Networks,		J.	Kossaifi,	Z.C.Lipton,	A.Khanna,	
T.Furlanello and	A.Anandkumar,...
Performance	and	rank
Speeding up Tensor Contractions
1 Tensor contractions are a core primitive of multilinear algebra.
2 BLAS 3: Unbounded com...
Speeding up Tensor Contraction
Explicit permutation dominates,
especially for small tensors.
Consider Cmnp = Akm Bpkn.
1 A...
Existing Primitives
GEMM
Suboptimal for many small matrices.
Pointer-to-Pointer BatchedGEMM
Available in MKL 11.3β and cuB...
Tensor Contraction with Extended BLAS Primitives
Cmn[p] = AmkBkn[p]
cublasDgemmStridedBatched(handle,
CUBLAS_OP_N, CUBLAS_...
Tensor Contraction with Extended BLAS Primitives
Cmnp = A∗∗ × B∗∗∗
Cmnp ≡ C[m + n · ldC1 + p · ldC2]
Case Contraction Kern...
A new primitive: StridedBatchedGEMM
Performance on par with pure GEMM (P100 and beyond).
Applications: Tucker Decomposition
Tmnp = GijkAmiBnjCpk
mnp ijk
mi
njT G
A
B
pkC Main steps in the algorithm
Ymjk = TmnpBt...
Tensor Sketches
Randomized dimensionality reduction
through sketching.
◮ Complexity independent of tensor order:
exponenti...
MCT in Visual Question & Answering
CNN
RNN
‡ ¡t is the
musta™  ¢¡£¤ ¥¦§
C
¨
r
w©
v
ev

FC
Relu
f
!
m
FC
4#$%$na4
Softma
x
Multimodal Tensor Pooling
C W
H
L
Text feature
Image feature d1
d2
d3Spatial sketch
Count sketch
3D FFT
1D FFT
3D IFFT
(op...
Tensor Decompositions
Extracting Topics from Documents
Topics
Topic Proportion
police
witness
campus
police
witness
campus
police
witness
campus...
Tensor Methods for Topic Modeling
campus
police
witness
Topic-word matrix P[word = i|topic = j]
Linearly independent colum...
Tensors vs. Variational Inference
Criterion: Perplexity = exp[−likelihood].
Learning Topics from PubMed on Spark, 8mil art...
Tensors vs. Variational Inference
Criterion: Perplexity = exp[−likelihood].
Learning Topics from PubMed on Spark, 8mil art...
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
Conclusion
Distributed Deep Learning at Scale
Mxnet has many attractive features
◮ Flexible programming
◮ Portable
◮ Highl...
Nächste SlideShare
Wird geladen in …5
×

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

1.717 Aufrufe

Veröffentlicht am

Large-scale Machine Learning: Deep, Distributed and Multi-Dimensional:
Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.

Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

  1. 1. Learning at Scale: Deep, Distributed and Multi-dimensional Anima Anandkumar .. Amazon AI & Caltech
  2. 2. Significantly improve many applications on multiple domains “deep learning” trend in the past 10 years image understanding speech recognition natural language processing … Deep Learning autonomy
  3. 3. Image Classification Layer 1 Layer 2 Output multilevel feature extractions from raw pixels to semantic meanings explore spatial information with convolution layers
  4. 4. Image Classification § Hard to define the network § the definition of the inception network has >1k lines of codes in Caffe § A single image requires billions floating-point operations § Intel i7 ~500 GFLOPS § Nvidia Titan X: ~5 TFLOPS § Memory consumption is linear with number of layers State-of-the-art networks have tens to hundreds layers
  5. 5. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  6. 6. 3. MXNet image credit - wikipedia • Imperative and Declarative Programming • Language Support • Backend and Automatic Parallelization
  7. 7. Writing Parallel Programs is Painful Each forward-backward-update involves O(num_layer), which is often 100—1,000, tensor computations and communications data = next_batch()data[gpu0].copyfrom(data[0:50]) _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] , fc1_weight[gpu0]) fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50]) fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0]) fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0]) fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_weight[gpu1]) fc2_weight[cpu] -= lr*fc12_wgrad[gpu0] fc1_weight[cpu] -= lr * fc1_wgrad[gpu0] fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_weight[cpu].copyto( fc1_weight[gpu0] , fc1_weight[gpu1]) data[gpu0].copyfrom(data[51:100]) _, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] , fc1_weight[gpu1]) fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc2_weight[gpu1]) fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100]) fc2[gpu1] = FullcForward(fc1[gpu1], fc2_weight[gpu1]) fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1]) Dependency graph for 2-layer neural networks with 2 GPUs
  8. 8. Auto Parallelization 18 Write serial programs Run in parallel >>> import mxnet as mx >>> A = mx.nd.ones((2,2)) *2 >>> C = A + 2 >>> B = A + 1 >>> D = B * C >>> D.wait_to_read() A = 2 C = A + 2 B = A + 1 D = B ⨉ C
  9. 9. Data Parallelism 19 key-value store examples 1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters
  10. 10. Scale to Multiple GPU Machines 21 PCIe Switch GPU GPU GPU GPU CPU Network Switch 63 GB/s 4 PCIe 3.0 16x 15.75 GB/s PCIe 3.0 16x 1.25 GB/s 10 Gbit Ethernet Hierarchical parameter server Level-1 Servers Workers Level-2 Servers GPUs CPUs
  11. 11. Experiment Setup ✧ ✓ 1.2 million images with 1000 classes ✧ Resnet 152-layer model ✧ EC2 P2.16xlarge 22 GPU 0-15 PCIe switches CPU ✧ Minibatch SGD ✧ Synchronized Updating
  12. 12. Scalability over Multiple Machines 23 time(sec)/bath 0 0.25 0.5 0.75 1 # of GPUs 0 32 64 96 128 Comm Cost batch size/GPU=2 batch size/GPU=4 batch size/GPU=8 batch size/GPU=16 115x
  13. 13. 8 2012before 2013 2014 2015 2016 2017 mxnet imperative symbolic gluon
  14. 14. Back-end System ✧ Optimization ✓ Memory optimization ✓ Operator fusion ✧ Scheduling ✓ Auto-parallelization 11 a b 1 + ⨉ c fullc softmax weight bias Back-end import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1 import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward() Front-end
  15. 15. In summary ✦ Symbolic ❖ efficient & portable ❖ but hard to use 10 ✦ tesla ✦ Imperative ❖ flexible ❖ may be slow ✦ Gluon ❖ imperative for developing ❖ symbolic for deploying
  16. 16. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  17. 17. Tensors: Beyond 2D world Modern data is inherently multi-dimensional
  18. 18. Tensors: Beyond 2D world Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output
  19. 19. Tensor Contraction Extends the notion of matrix product Matrix product Mv = j vjMj = + Tensor Contraction T(u, v, ·) = i,j uivjTi,j,: = ++ +
  20. 20. Employing Tensor Contractions in Alexnet Replace fully connected layer with tensor contraction layer
  21. 21. Enabling Tensor Contraction Layer in Mxnet
  22. 22. Performance of the TCL • Trained end-to-end • On ImageNet with VGG: • 65.9% space savings • performance drop of 0.6% only • On ImageNet with AlexNet: • 56.6% space savings • Performance improvement of 0.5%
  23. 23. Low-rank tensor regression Tensor Regression Networks, J. Kossaifi, Z.C.Lipton, A.Khanna, T.Furlanello and A.Anandkumar, ArXiv pre-publication
  24. 24. Performance and rank
  25. 25. Speeding up Tensor Contractions 1 Tensor contractions are a core primitive of multilinear algebra. 2 BLAS 3: Unbounded compute intensity (no. of ops per I/O) Consider single-index contractions: CC = AA BB = = A(:,1,:) A(:,2,:)A422 B21 C421 e.g. Cmnp = Amnk Bkp
  26. 26. Speeding up Tensor Contraction Explicit permutation dominates, especially for small tensors. Consider Cmnp = Akm Bpkn. 1 Akm → Amk 2 Bpkn → Bkpn 3 Cmnp → Cmpn 4 Cm(pn) = Amk Bk(pn) 5 Cmpn → Cmnp 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 n (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions.
  27. 27. Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cuBLAS 4.1 C[p] = α op(A[p]) op(B[p]) + β C[p] cublas<T>gemmBatched(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int M, int N, int K, const T* alpha, const T** A, int ldA, const T** B, int ldB, const T* beta, T** C, int ldC, int batchCount)
  28. 28. Tensor Contraction with Extended BLAS Primitives Cmn[p] = AmkBkn[p] cublasDgemmStridedBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, &alpha, A, ldA1, 0, B, ldB1, ldB2, &beta, C, ldC1, ldC2, P)
  29. 29. Tensor Contraction with Extended BLAS Primitives Cmnp = A∗∗ × B∗∗∗ Cmnp ≡ C[m + n · ldC1 + p · ldC2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = Bkm[p]Akn 1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = Bk[p]mAkn 1.3 AmkBnkp Cmn[p] = AmkBnk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn 1.4 AmkBpkn Cm[n]p = AmkBpk[n] 4.4 AknBpkm 1.5 AmkBnpk Cm(np) = AmkB(np)k Cmn[p] = AmkBn[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn 1.6 AmkBpnk Cm[n]p = AmkBp[n]k 4.6 AknBpmk 2.1 AkmBknp Cm(np) = AkmBk(np) Cmn[p] = AkmBkn[p] 5.1 ApkBkmn C(mn)p = Bk(mn)Apk Cm[n]p = Bkm[n]Apk 2.2 AkmBkpn Cmn[p] = AkmBk[p]n Cm[n]p = AkmBkp[n] 5.2 ApkBknm Cm[n]p = Bk[n]mApk 2.3 AkmBnkp Cmn[p] = AkmBnk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]Apk 2.4 AkmBpkn Cm[n]p = AkmBpk[n] 5.4 ApkBnkm 2.5 AkmBnpk Cm(np) = AkmB(np)k Cmn[p] = AkmBn[p]k 5.5 ApkBmnk C(mn)p = B(mn)kApk Cm[n]p = Bm[n]kApk 2.6 AkmBpnk Cm[n]p = AkmBp[n]k 5.6 ApkBnmk 3.1 AnkBkmp Cmn[p] = Bkm[p]Ank 6.1 AkpBkmn C(mn)p = Bk(mn)Akp Cm[n]p = Bkm[n]Akp 3.2 AnkBkpm Cmn[p] = Bk[p]mAnk 6.2 AkpBknm Cm[n]p = Bk[n]mAkp 3.3 AnkBmkp Cmn[p] = Bmk[p]Ank 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk Cmn[p] = Bm[p]kAnk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp 3.6 AnkBpmk 6.6 AkpBnmk
  30. 30. A new primitive: StridedBatchedGEMM Performance on par with pure GEMM (P100 and beyond).
  31. 31. Applications: Tucker Decomposition Tmnp = GijkAmiBnjCpk mnp ijk mi njT G A B pkC Main steps in the algorithm Ymjk = TmnpBt njCt pk Yink = TmnpAt+1 mi Ct pk Yijp = TmnpBt+1 nj At+1 mi Performance on Tucker decomposition: 20 40 60 80 100 120 10−2 100 102 104 106 n Time(sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched
  32. 32. Tensor Sketches Randomized dimensionality reduction through sketching. ◮ Complexity independent of tensor order: exponential gain! +1 +1 -1 Tensor T Sketch s Applications Tensor Decomposition via Sketching Visual Question and Answering CNN RNN What is the mustach made of? C W H MCT L Avgpooling FC Relu BatchNorm FC "Banana" Softmax
  33. 33. MCT in Visual Question & Answering CNN RNN ‡ ¡t is the musta™  ¢¡£¤ ¥¦§ C ¨ r w© v ev FC Relu f ! m FC 4#$%$na4 Softma x
  34. 34. Multimodal Tensor Pooling C W H L Text feature Image feature d1 d2 d3Spatial sketch Count sketch 3D FFT 1D FFT 3D IFFT (optional) d4 d1 d2 d3
  35. 35. Tensor Decompositions
  36. 36. Extracting Topics from Documents Topics Topic Proportion police witness campus police witness campus police witness campus police witness crime Sports Educaon campus A., D. P. Foster, D. Hsu, S.M. Kakade, Y.K. Liu.“Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation,” NIPS 2012.
  37. 37. Tensor Methods for Topic Modeling campus police witness Topic-word matrix P[word = i|topic = j] Linearly independent columns Moment Tensor: Co-occurrence of Word Triplets = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness
  38. 38. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
  39. 39. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP Orders of Magnitude Faster More Accurate F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
  40. 40. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  41. 41. Conclusion Distributed Deep Learning at Scale Mxnet has many attractive features ◮ Flexible programming ◮ Portable ◮ Highly efficient Easy to deploy large-scale DL on AWS cloud ◮ Deep Learning AMI ◮ Cloud formation templates Tensors are the future of ML Tensor contractions: space savings in deep architectures. New primitives speed up tensor contractions: extended BLAS = ++ + T u v = + ....

×