Trinity of AI: data, algorithms and cloud

ADVANCES IN TRINITY OF AI:
DATA, ALGORITHMS & CLOUD
Anima Anandkumar
Principal Scientist at Amazon AI
Bren Professor at Caltech

ALGORITHMS
DATAINFRASTRUCTURE
TRINITY FUELING ARTIFICIAL INTELLIGENCE

ALGORITHMS
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE

ALGORITHMS
• OPTIMIZATION
• SCALABILITY
• MULTI-DIMENSIONALITY
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE

ALGORITHMS
• OPTIMIZATION
• SCALABILITY
• MULTI-DIMENSIONALITY
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE
ML STACK ON CLOUD
• APPLICATION SERVICES
• ML PLATFORM
• ML FRAMEWORKS

• COLLECTION: ACTIVE LEARNING, PARTIAL LABELS..
• AGGREGATION: CROWDSOURCING MODELS..
• AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS..
DATA

ACTIVE LEARNING
Labeled
data
Unlabeled
data
Goal
• Reach SOTA with a smaller dataset
• Active learning analyzed in theory
• In practice, only small classical models
Can it work at scale with deep learning?

TASK: NAMED ENTITY RECOGNITION

RESULTS
NER task on largest open benchmark (Onto-notes)
Active learning heuristics:
• Least confidence (LC)
• Max. normalized log
probability (MNLP)
• Deep active learning matches :
• SOTA with just 25% data on English, 30% on Chinese.
• Best shallow model (on full data) with 12% data on English, 17% on Chinese.
Test F1 score vs. % of labeled words
English
0 20 40 60 80
70
75
80
85
Percent of words annotated
TestF1score
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
0 20 40 60 80 100
65
70
75
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
Chinese

• Uncertainty sampling works. Normalizing for length helps under low data.
• With active learning, deep beats shallow even in low data regime.
• With active learning, SOTA achieved with far fewer samples.
TAKE-AWAY

ACTIVE LEARNING WITH PARTIAL FEEDBACK
images
questions dog?
dog
non-dog
partial labels
• Hierarchical class labeling: Labor proportional to # of binary questions asked
• Actively pick informative questions ?

RESULTS ON TINY IMAGENET (100K SAMPLES)
• Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)
• Obtain full annotation with 40% less binary questions
ALPF-ERC
active data
active questions
Uniform
inactive data
inactive questions
AL-ME
active data
inactive questions
AQ-ERC
inactive data
active questions
0.1
0.2
0.3
0.4
0.5
0% 25% 50% 75% 100%
Accuracy vs. # of Questions
Uniform AL-ME AQ-ERC ALPF-ERC
- 40%
+8%

• Don’t annotate from scratch
• Select questions actively based on the learned model
• Don’t sleep on partial labels
• Re-train model from partial labels
TWO TAKE-AWAYS

CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.

CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.
Annotator-quality models
• Can improve accuracy.
• Hard: needs to be estimated
without ground-truth.

SOME INTUITIONS
Majority rule to estimate annotator quality
• Justification: Majority rule approaches
ground-truth when enough workers.
• Downside: Requires large number of
annotations for each example for majority
rule to be correct.
Annotator quality model
(Prob. of correctness)

OUR APPROACH
• Use ML model being trained as ground-truth.
• Use annotator quality model to train ML
model with weighted loss.
• Justification: As iterations proceed, ML
model agrees more with correct workers.
• Works with single annotation per sample!
Annotator quality model
(Prob. of correctness)
ML model to estimate
ground truth

LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE
MS-COCO dataset. Fixed budget: 35k annotations
No. of workers
Theorem: Under fixed budget,
generalization error minimized with
single annotation per sample.
Assumptions:
• Best predictor is accurate enough
(under no label noise).
• Simplified case: All workers
have same quality.
• Prob. of being correct > 83%
5% wrt Majority rule

DATA AUGMENTATION
• To improve generalization, augment training data.
• Mostly in computer vision. Simple techniques: rotation, cropping etc.
• More sophisticated approaches ?

DATA AUGMENTATION
Merits
• Captures statistics of
natural images
• Learnable
Peril
• Quality of generated
images not high
• Introduces artifacts
Our GAN-based framework – Mr.GANs – narrows gap between synthetic and real data
GAN
Merits
• High-quality rendering
• Full annotation for free
• Generate infinite data
PerilSynthetic Data
• Domain mis-match
• Rendering for visual
appeal and not
classification

MIXED-REALITY GENERATIVE ADVERSARIAL NETWORKS (MR-GAN)
Improvement over training on real data:
• Real + Synthetic: 5.43% (Stage 0)
• Real + CycleGAN on synthetic: 5.09% (Stage 1)
• Real + Mr.GAN on synthetic: 8.85% (Stage 2)
CIFAR-DA 4 : ~1million synthetic from 3D models.~0.25% real from CIFAR100. 4 classes.
Stage 1
Stage 2
: Mapping at stage 1 of Mr.GANs
: Mapping at stage 2 of Mr.GANs
D(P(real-2) || P(synt-2)) < D(P(real-1) || P(synt-1))
< D(P(real) || P(synt))

DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS
Goal: Learn a domain of functions (sin, cos, log, add…)
• Training on numerical input-output does not generalize.
Data Augmentation with Symbolic Expressions
• Efficiently encode relationships between functions.
Solution:
• Design networks to use both: symbolic + numerical

ARCHITECTURE : TREE LSTM
sin$ % + cos$ % = 1 sin −2.5 = −0.6
• Symbolic expression trees. Function evaluation tree.
• Decimal trees: encode numbers with decimal representation (numerical).
• Can encode any expression, function evaluation and number.
Decimal Tree for 2.5

RESULTS
• Vastly Improved numerical evaluation: 90% over function-fitting baseline.
• Generalization to verifying symbolic equations of higher depth
• Combining symbolic + numerical data helps in better generalization
for both tasks: symbolic and numerical evaluation.
LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric
76.40 % 93.27 % 96.17 %

• OPTIMIZATION : ANALYSIS OF CONVERGENCE
• SCALABILITY : GRADIENT QUANTIZATION
• MULTI-DIMENSIONALITY : TENSOR ALGEBRA
ALGORITHMS

OPTIMIZATION IN DEEP LEARNING
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function

THE DEEP LEARNING COOKBOOK
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function
Randomlyaugmenteddata
with {insert arbitrary}
regularizer applied
Running average of
rescaled, clipped stochastic
gradients
dropped out/batch
normalized/bypassed

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameter
server
GPU 1 GPU 2
With 1/2 data With 1/2 data

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameter
server
GPU 1 GPU 2
With 1/2 data With 1/2 data
Compress?
Compress?
Compress?

DISTRIBUTED TRAINING BY MAJORITY VOTE
Parameter
server
GPU 1
GPU 2
GPU 3
sign(g)
sign(g)
sign(g)
Parameter
server
GPU 1
GPU 2
GPU 3
sign [sum(sign(g))]

SIGN QUANTIZATION COMPETITIVE ON IMAGENET
Communication gain without performance loss
• Signum: Sign over momentum.
• Accuracy of Signum similar to ADAM.

SGD THEORY
— Gradient vector
— Lipscitz scalar (curvature)
— Variance scalar
— Gradient density
— Lipscitz density
— Variance density
signSGD THEORY
• SignSGD wins : Function mostly flat (sparse curvature vector).
• SGD wins : Function has sparse gradients.
2
L
g (g)
(~L)
(~)

DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY
If gradients are unimodal and symmetric…
…reasonable by central limit theorem…
…majority vote with M
workers converges at rate:
Same variance
reduction as SGD

TAKE-AWAYS FOR SIGN-SGD
• Convergence even under biased gradients and noise.
• Faster than SGD under sparse curvature directions.
• For distributed training, similar variance reduction as SGD.
• In practice, similar accuracy as Adam but with far less communication.

KNOWLEDGE DISTILLATION
• Knowledge Distillation : conveys “Dark knowledge” from teacher to student

1
0
Ground Truth
1
0
Model Outputs
Output Values
Contribution to Cross Entropy Loss
Only correct category
contributes to loss function.
Ground-truth baseline
INTUITIONS

1
0
Teacher Outputs
1
Student Outputs
Output Values
0
• All categories in teacher
output contributes to loss.
• If teacher is highly
accurate and confident, it
is similar to original labels.
KNOWLEDGE DISTILLATION
What is the role of wrong categories?

1
0
Ground Truth
1
Student Outputs
Output Values
Loss with High Confidence Teacher
0
Loss with Low Confidence Teacher
CONFIDENCE WEIGHTED BY TEACHER MAX
Contribution of max
category is isolated.

1
0
Permuted Teacher
Outputs
1
Student Outputs
Output Values
0
Permutation destroys
pairwise correlations.
But some higher-order
information is invariant.
DARK KNOWLEDGE WITH PERMUTED PREDICTIONS

BORN-AGAIN NETWORKS
Knowledge Distillation between identical neural network architectures
Step 1Step 0 Step 2

RESULTS ON CIFAR 100
• Best results on CIFAR-100 using simple
KD on Densenet (BAN).
•Multiple BAN stages helped for smaller
networks.
•Similar behavior in other experiments:
Resnet to Densenet, Penn tree bank.
• CWTM worse than DKPP: Information
about higher-order moments help.
Test accuracy on Densenet
Born Again: Student surpasses teacher !

TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world
Modern data is inherently multi-dimensional

UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS
Extracting Topics from Documents
Justice
Education
Sports
Topics

TENSOR-BASED LDA TRAINING IS FASTER
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
• Mallet is an open-source framework for topic modeling
• Benchmarks on AWS SageMaker Platform
• Bulit into AWS Comprehend NLP service.
300000 documents

DEEP NEURAL NETS: TRANSFORMING TENSORS

SPACE SAVING IN DEEP TENSORIZED NETWORKS

Tensor Train RNN and LSTMs
TENSORS FOR LONG-TERM FORECASTING
Challenges:
• Long-term
dependencies
• High-order
correlations
• Error propagation

Climate datasetTraffic dataset
TENSOR LSTM FOR LONG-TERM FORECASTING

TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository

Frameworks
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU IoT (Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
INFRASTRUCTURE AND FRAMEWORKS ON AWS
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens

Fully managed
hosting with
auto-scaling
One-click
deployment
Pre-built
notebooks
for common
problems
Built-in, high
performance
algorithms
One-click
training
Hyperparameter
optimization
BUILD TRAIN DEPLOY
AMAZON SAGEMAKER SOLVES THESE PROBLEMS

A New Vision for Autonomy
Center for Autonomous Systems and Technologies

CAST: BRINGING DRONES, ROBOTICS AND AI TOGETHER

• DATA
• Collection: Active learning and partial feedback
• Aggregation: Crowdsourcing models
• Augmentation: Graphics rendering + GANs, Symbolic expressions
• ALGORITHMS
• Convergence: SignSGD has good rates in theory and practice
• Scalability: SignSGD has same variance reduction as SGD for multi-machine
• Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.
• INFRASTRUCTURE:
• Frameworks: Tensorly is high-level API for deep tensorized networks.
• Platform: SageMaker is ML platform with auto parallelization
CONCLUSION
AI needs integration of data, algorithms and infrastructure

Trinity of AI: data, algorithms and cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Trinity of AI: data, algorithms and cloud

Ähnlich wie Trinity of AI: data, algorithms and cloud (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Trinity of AI: data, algorithms and cloud