AI at scale requires a perfect storm of data, algorithms and cloud infrastructure. Modern deep learning requires large amounts of training data. We develop methods that improve data collection, aggregation and augmentation. This involves active learning, partial feedback, crowdsourcing, and generative models. We analyze large-scale machine learning methods for distributed training. We show that gradient quantization can yield best of both the worlds: accuracy and communication efficiency. We extend matrix methods to higher dimensions using tensor algebraic techniques and show superior performance. Finally, at AWS, we are developing robust software frameworks and AI cloud services at all levels of the stack.
9. RESULTS
NER task on largest open benchmark (Onto-notes)
Active learning heuristics:
• Least confidence (LC)
• Max. normalized log
probability (MNLP)
• Deep active learning matches :
• SOTA with just 25% data on English, 30% on Chinese.
• Best shallow model (on full data) with 12% data on English, 17% on Chinese.
Test F1 score vs. % of labeled words
English
0 20 40 60 80
70
75
80
85
Percent of words annotated
TestF1score
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
0 20 40 60 80 100
65
70
75
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
Chinese
10. • Uncertainty sampling works. Normalizing for length helps under low data.
• With active learning, deep beats shallow even in low data regime.
• With active learning, SOTA achieved with far fewer samples.
TAKE-AWAY
11. ACTIVE LEARNING WITH PARTIAL FEEDBACK
images
questions dog?
dog
non-dog
partial labels
• Hierarchical class labeling: Labor proportional to # of binary questions asked
• Actively pick informative questions ?
12. RESULTS ON TINY IMAGENET (100K SAMPLES)
• Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)
• Obtain full annotation with 40% less binary questions
ALPF-ERC
active data
active questions
Uniform
inactive data
inactive questions
AL-ME
active data
inactive questions
AQ-ERC
inactive data
active questions
0.1
0.2
0.3
0.4
0.5
0% 25% 50% 75% 100%
Accuracy vs. # of Questions
Uniform AL-ME AQ-ERC ALPF-ERC
- 40%
+8%
13. • Don’t annotate from scratch
• Select questions actively based on the learned model
• Don’t sleep on partial labels
• Re-train model from partial labels
TWO TAKE-AWAYS
14. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.
15. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.
Annotator-quality models
• Can improve accuracy.
• Hard: needs to be estimated
without ground-truth.
16. SOME INTUITIONS
Majority rule to estimate annotator quality
• Justification: Majority rule approaches
ground-truth when enough workers.
• Downside: Requires large number of
annotations for each example for majority
rule to be correct.
Annotator quality model
(Prob. of correctness)
17. OUR APPROACH
• Use ML model being trained as ground-truth.
• Use annotator quality model to train ML
model with weighted loss.
• Justification: As iterations proceed, ML
model agrees more with correct workers.
• Works with single annotation per sample!
Annotator quality model
(Prob. of correctness)
ML model to estimate
ground truth
18. LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE
MS-COCO dataset. Fixed budget: 35k annotations
No. of workers
Theorem: Under fixed budget,
generalization error minimized with
single annotation per sample.
Assumptions:
• Best predictor is accurate enough
(under no label noise).
• Simplified case: All workers
have same quality.
• Prob. of being correct > 83%
5% wrt Majority rule
19. DATA AUGMENTATION
• To improve generalization, augment training data.
• Mostly in computer vision. Simple techniques: rotation, cropping etc.
• More sophisticated approaches ?
20. DATA AUGMENTATION
Merits
• Captures statistics of
natural images
• Learnable
Peril
• Quality of generated
images not high
• Introduces artifacts
Our GAN-based framework – Mr.GANs – narrows gap between synthetic and real data
GAN
Merits
• High-quality rendering
• Full annotation for free
• Generate infinite data
PerilSynthetic Data
• Domain mis-match
• Rendering for visual
appeal and not
classification
21. MIXED-REALITY GENERATIVE ADVERSARIAL NETWORKS (MR-GAN)
Improvement over training on real data:
• Real + Synthetic: 5.43% (Stage 0)
• Real + CycleGAN on synthetic: 5.09% (Stage 1)
• Real + Mr.GAN on synthetic: 8.85% (Stage 2)
CIFAR-DA 4 : ~1million synthetic from 3D models.~0.25% real from CIFAR100. 4 classes.
Stage 1
Stage 2
: Mapping at stage 1 of Mr.GANs
: Mapping at stage 2 of Mr.GANs
D(P(real-2) || P(synt-2)) < D(P(real-1) || P(synt-1))
< D(P(real) || P(synt))
22. DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS
Goal: Learn a domain of functions (sin, cos, log, add…)
• Training on numerical input-output does not generalize.
Data Augmentation with Symbolic Expressions
• Efficiently encode relationships between functions.
Solution:
• Design networks to use both: symbolic + numerical
23. ARCHITECTURE : TREE LSTM
sin$ % + cos$ % = 1 sin −2.5 = −0.6
• Symbolic expression trees. Function evaluation tree.
• Decimal trees: encode numbers with decimal representation (numerical).
• Can encode any expression, function evaluation and number.
Decimal Tree for 2.5
24. RESULTS
• Vastly Improved numerical evaluation: 90% over function-fitting baseline.
• Generalization to verifying symbolic equations of higher depth
• Combining symbolic + numerical data helps in better generalization
for both tasks: symbolic and numerical evaluation.
LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric
76.40 % 93.27 % 96.17 %
26. OPTIMIZATION IN DEEP LEARNING
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function
27. THE DEEP LEARNING COOKBOOK
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function
Randomlyaugmenteddata
with {insert arbitrary}
regularizer applied
Running average of
rescaled, clipped stochastic
gradients
dropped out/batch
normalized/bypassed
29. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameter
server
GPU 1 GPU 2
With 1/2 data With 1/2 data
Compress?
Compress?
Compress?
30. DISTRIBUTED TRAINING BY MAJORITY VOTE
Parameter
server
GPU 1
GPU 2
GPU 3
sign(g)
sign(g)
sign(g)
Parameter
server
GPU 1
GPU 2
GPU 3
sign [sum(sign(g))]
31. SIGN QUANTIZATION COMPETITIVE ON IMAGENET
Communication gain without performance loss
• Signum: Sign over momentum.
• Accuracy of Signum similar to ADAM.
32. SGD THEORY
— Gradient vector
— Lipscitz scalar (curvature)
— Variance scalar
— Gradient density
— Lipscitz density
— Variance density
signSGD THEORY
• SignSGD wins : Function mostly flat (sparse curvature vector).
• SGD wins : Function has sparse gradients.
2
L
g (g)
(~L)
(~)
33. DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY
If gradients are unimodal and symmetric…
…reasonable by central limit theorem…
…majority vote with M
workers converges at rate:
Same variance
reduction as SGD
34. TAKE-AWAYS FOR SIGN-SGD
• Convergence even under biased gradients and noise.
• Faster than SGD under sparse curvature directions.
• For distributed training, similar variance reduction as SGD.
• In practice, similar accuracy as Adam but with far less communication.
36. 1
0
Ground Truth
1
0
Model Outputs
Output Values
Contribution to Cross Entropy Loss
Only correct category
contributes to loss function.
Ground-truth baseline
INTUITIONS
37. 1
0
Teacher Outputs
1
Student Outputs
Output Values
Contribution to Cross Entropy Loss
0
• All categories in teacher
output contributes to loss.
• If teacher is highly
accurate and confident, it
is similar to original labels.
KNOWLEDGE DISTILLATION
What is the role of wrong categories?
38. 1
0
Ground Truth
1
Student Outputs
Output Values
Loss with High Confidence Teacher
0
Loss with Low Confidence Teacher
CONFIDENCE WEIGHTED BY TEACHER MAX
Contribution of max
category is isolated.
39. 1
0
Permuted Teacher
Outputs
1
Student Outputs
Output Values
Contribution to Cross Entropy Loss
0
Permutation destroys
pairwise correlations.
But some higher-order
information is invariant.
DARK KNOWLEDGE WITH PERMUTED PREDICTIONS
41. RESULTS ON CIFAR 100
• Best results on CIFAR-100 using simple
KD on Densenet (BAN).
•Multiple BAN stages helped for smaller
networks.
•Similar behavior in other experiments:
Resnet to Densenet, Penn tree bank.
• CWTM worse than DKPP: Information
about higher-order moments help.
Test accuracy on Densenet
Born Again: Student surpasses teacher !
42. TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world
Modern data is inherently multi-dimensional
43. UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS
Extracting Topics from Documents
Justice
Education
Sports
Topics
44. TENSOR-BASED LDA TRAINING IS FASTER
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
• Mallet is an open-source framework for topic modeling
• Benchmarks on AWS SageMaker Platform
• Bulit into AWS Comprehend NLP service.
300000 documents
50. TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository
51. Frameworks
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU IoT (Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
INFRASTRUCTURE AND FRAMEWORKS ON AWS
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
55. • DATA
• Collection: Active learning and partial feedback
• Aggregation: Crowdsourcing models
• Augmentation: Graphics rendering + GANs, Symbolic expressions
• ALGORITHMS
• Convergence: SignSGD has good rates in theory and practice
• Scalability: SignSGD has same variance reduction as SGD for multi-machine
• Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.
• INFRASTRUCTURE:
• Frameworks: Tensorly is high-level API for deep tensorized networks.
• Platform: SageMaker is ML platform with auto parallelization
CONCLUSION
AI needs integration of data, algorithms and infrastructure