SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
ADVANCES IN TRINITY OF AI:
DATA, ALGORITHMS & CLOUD
Anima Anandkumar
Principal Scientist at Amazon AI
Bren Professor at Caltech
ALGORITHMS
DATAINFRASTRUCTURE
TRINITY FUELING ARTIFICIAL INTELLIGENCE
ALGORITHMS
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE
TRINITY FUELING ARTIFICIAL INTELLIGENCE
ALGORITHMS
• OPTIMIZATION
• SCALABILITY
• MULTI-DIMENSIONALITY
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE
TRINITY FUELING ARTIFICIAL INTELLIGENCE
ALGORITHMS
• OPTIMIZATION
• SCALABILITY
• MULTI-DIMENSIONALITY
DATA
• COLLECTION
• AGGREGATION
• AUGMENTATION
INFRASTRUCTURE
ML STACK ON CLOUD
• APPLICATION SERVICES
• ML PLATFORM
• ML FRAMEWORKS
TRINITY FUELING ARTIFICIAL INTELLIGENCE
• COLLECTION: ACTIVE LEARNING, PARTIAL LABELS..
• AGGREGATION: CROWDSOURCING MODELS..
• AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS..
DATA
ACTIVE LEARNING
Labeled
data
Unlabeled
data
Goal
• Reach SOTA with a smaller dataset
• Active learning analyzed in theory
• In practice, only small classical models
Can it work at scale with deep learning?
TASK: NAMED ENTITY RECOGNITION
RESULTS
NER task on largest open benchmark (Onto-notes)
Active learning heuristics:
• Least confidence (LC)
• Max. normalized log
probability (MNLP)
• Deep active learning matches :
• SOTA with just 25% data on English, 30% on Chinese.
• Best shallow model (on full data) with 12% data on English, 17% on Chinese.
Test F1 score vs. % of labeled words
English
0 20 40 60 80
70
75
80
85
Percent of words annotated
TestF1score
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
0 20 40 60 80 100
65
70
75
MNLP
LC
RAND
Best Deep Model
Best Shallow Model
Chinese
• Uncertainty sampling works. Normalizing for length helps under low data.
• With active learning, deep beats shallow even in low data regime.
• With active learning, SOTA achieved with far fewer samples.
TAKE-AWAY
ACTIVE LEARNING WITH PARTIAL FEEDBACK
images
questions dog?
dog
non-dog
partial labels
• Hierarchical class labeling: Labor proportional to # of binary questions asked
• Actively pick informative questions ?
RESULTS ON TINY IMAGENET (100K SAMPLES)
• Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)
• Obtain full annotation with 40% less binary questions
ALPF-ERC
active data
active questions
Uniform
inactive data
inactive questions
AL-ME
active data
inactive questions
AQ-ERC
inactive data
active questions
0.1
0.2
0.3
0.4
0.5
0% 25% 50% 75% 100%
Accuracy vs. # of Questions
Uniform AL-ME AQ-ERC ALPF-ERC
- 40%
+8%
• Don’t annotate from scratch
• Select questions actively based on the learned model
• Don’t sleep on partial labels
• Re-train model from partial labels
TWO TAKE-AWAYS
CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.
CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule
• Simple and common.
• Wasteful: ignores annotator
quality of different workers.
Annotator-quality models
• Can improve accuracy.
• Hard: needs to be estimated
without ground-truth.
SOME INTUITIONS
Majority rule to estimate annotator quality
• Justification: Majority rule approaches
ground-truth when enough workers.
• Downside: Requires large number of
annotations for each example for majority
rule to be correct.
Annotator quality model
(Prob. of correctness)
OUR APPROACH
• Use ML model being trained as ground-truth.
• Use annotator quality model to train ML
model with weighted loss.
• Justification: As iterations proceed, ML
model agrees more with correct workers.
• Works with single annotation per sample!
Annotator quality model
(Prob. of correctness)
ML model to estimate
ground truth
LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE
MS-COCO dataset. Fixed budget: 35k annotations
No. of workers
Theorem: Under fixed budget,
generalization error minimized with
single annotation per sample.
Assumptions:
• Best predictor is accurate enough
(under no label noise).
• Simplified case: All workers
have same quality.
• Prob. of being correct > 83%
5% wrt Majority rule
DATA AUGMENTATION
• To improve generalization, augment training data.
• Mostly in computer vision. Simple techniques: rotation, cropping etc.
• More sophisticated approaches ?
DATA AUGMENTATION
Merits
• Captures statistics of
natural images
• Learnable
Peril
• Quality of generated
images not high
• Introduces artifacts
Our GAN-based framework – Mr.GANs – narrows gap between synthetic and real data
GAN
Merits
• High-quality rendering
• Full annotation for free
• Generate infinite data
PerilSynthetic Data
• Domain mis-match
• Rendering for visual
appeal and not
classification
MIXED-REALITY GENERATIVE ADVERSARIAL NETWORKS (MR-GAN)
Improvement over training on real data:
• Real + Synthetic: 5.43% (Stage 0)
• Real + CycleGAN on synthetic: 5.09% (Stage 1)
• Real + Mr.GAN on synthetic: 8.85% (Stage 2)
CIFAR-DA 4 : ~1million synthetic from 3D models.~0.25% real from CIFAR100. 4 classes.
Stage 1
Stage 2
: Mapping at stage 1 of Mr.GANs
: Mapping at stage 2 of Mr.GANs
D(P(real-2) || P(synt-2)) < D(P(real-1) || P(synt-1))
< D(P(real) || P(synt))
DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS
Goal: Learn a domain of functions (sin, cos, log, add…)
• Training on numerical input-output does not generalize.
Data Augmentation with Symbolic Expressions
• Efficiently encode relationships between functions.
Solution:
• Design networks to use both: symbolic + numerical
ARCHITECTURE : TREE LSTM
sin$ % + cos$ % = 1 sin −2.5 = −0.6
• Symbolic expression trees. Function evaluation tree.
• Decimal trees: encode numbers with decimal representation (numerical).
• Can encode any expression, function evaluation and number.
Decimal Tree for 2.5
RESULTS
• Vastly Improved numerical evaluation: 90% over function-fitting baseline.
• Generalization to verifying symbolic equations of higher depth
• Combining symbolic + numerical data helps in better generalization
for both tasks: symbolic and numerical evaluation.
LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric
76.40 % 93.27 % 96.17 %
• OPTIMIZATION : ANALYSIS OF CONVERGENCE
• SCALABILITY : GRADIENT QUANTIZATION
• MULTI-DIMENSIONALITY : TENSOR ALGEBRA
ALGORITHMS
OPTIMIZATION IN DEEP LEARNING
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function
THE DEEP LEARNING COOKBOOK
•Data
•Deep, layered model with lots of parameters
•Descend gradient of objective function
Randomlyaugmenteddata
with {insert arbitrary}
regularizer applied
Running average of
rescaled, clipped stochastic
gradients
dropped out/batch
normalized/bypassed
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameter
server
GPU 1 GPU 2
With 1/2 data With 1/2 data
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameter
server
GPU 1 GPU 2
With 1/2 data With 1/2 data
Compress?
Compress?
Compress?
DISTRIBUTED TRAINING BY MAJORITY VOTE
Parameter
server
GPU 1
GPU 2
GPU 3
sign(g)
sign(g)
sign(g)
Parameter
server
GPU 1
GPU 2
GPU 3
sign [sum(sign(g))]
SIGN QUANTIZATION COMPETITIVE ON IMAGENET
Communication gain without performance loss
• Signum: Sign over momentum.
• Accuracy of Signum similar to ADAM.
SGD THEORY
— Gradient vector
— Lipscitz scalar (curvature)
— Variance scalar
— Gradient density
— Lipscitz density
— Variance density
signSGD THEORY
• SignSGD wins : Function mostly flat (sparse curvature vector).
• SGD wins : Function has sparse gradients.
2
L
g (g)
(~L)
(~)
DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY
If gradients are unimodal and symmetric…
…reasonable by central limit theorem…
…majority vote with M
workers converges at rate:
Same variance
reduction as SGD
TAKE-AWAYS FOR SIGN-SGD
• Convergence even under biased gradients and noise.
• Faster than SGD under sparse curvature directions.
• For distributed training, similar variance reduction as SGD.
• In practice, similar accuracy as Adam but with far less communication.
KNOWLEDGE DISTILLATION
• Knowledge Distillation : conveys “Dark knowledge” from teacher to student
1
0
Ground Truth
1
0
Model Outputs
Output Values
Contribution to Cross Entropy Loss
Only correct category
contributes to loss function.
Ground-truth baseline
INTUITIONS
1
0
Teacher Outputs
1
Student Outputs
Output Values
Contribution to Cross Entropy Loss
0
• All categories in teacher
output contributes to loss.
• If teacher is highly
accurate and confident, it
is similar to original labels.
KNOWLEDGE DISTILLATION
What is the role of wrong categories?
1
0
Ground Truth
1
Student Outputs
Output Values
Loss with High Confidence Teacher
0
Loss with Low Confidence Teacher
CONFIDENCE WEIGHTED BY TEACHER MAX
Contribution of max
category is isolated.
1
0
Permuted Teacher
Outputs
1
Student Outputs
Output Values
Contribution to Cross Entropy Loss
0
Permutation destroys
pairwise correlations.
But some higher-order
information is invariant.
DARK KNOWLEDGE WITH PERMUTED PREDICTIONS
BORN-AGAIN NETWORKS
Knowledge Distillation between identical neural network architectures
Step 1Step 0 Step 2
RESULTS ON CIFAR 100
• Best results on CIFAR-100 using simple
KD on Densenet (BAN).
•Multiple BAN stages helped for smaller
networks.
•Similar behavior in other experiments:
Resnet to Densenet, Penn tree bank.
• CWTM worse than DKPP: Information
about higher-order moments help.
Test accuracy on Densenet
Born Again: Student surpasses teacher !
TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world
Modern data is inherently multi-dimensional
UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS
Extracting Topics from Documents
Justice
Education
Sports
Topics
TENSOR-BASED LDA TRAINING IS FASTER
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
• Mallet is an open-source framework for topic modeling
• Benchmarks on AWS SageMaker Platform
• Bulit into AWS Comprehend NLP service.
300000 documents
DEEP NEURAL NETS: TRANSFORMING TENSORS
DEEP TENSORIZED NETWORKS
SPACE SAVING IN DEEP TENSORIZED NETWORKS
Tensor Train RNN and LSTMs
TENSORS FOR LONG-TERM FORECASTING
Challenges:
• Long-term
dependencies
• High-order
correlations
• Error propagation
Climate datasetTraffic dataset
TENSOR LSTM FOR LONG-TERM FORECASTING
TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository
Frameworks
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU IoT (Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
INFRASTRUCTURE AND FRAMEWORKS ON AWS
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
Fully managed
hosting with
auto-scaling
One-click
deployment
Pre-built
notebooks
for common
problems
Built-in, high
performance
algorithms
One-click
training
Hyperparameter
optimization
BUILD TRAIN DEPLOY
AMAZON SAGEMAKER SOLVES THESE PROBLEMS
A New Vision for Autonomy
Center for Autonomous Systems and Technologies
CAST: BRINGING DRONES, ROBOTICS AND AI TOGETHER
• DATA
• Collection: Active learning and partial feedback
• Aggregation: Crowdsourcing models
• Augmentation: Graphics rendering + GANs, Symbolic expressions
• ALGORITHMS
• Convergence: SignSGD has good rates in theory and practice
• Scalability: SignSGD has same variance reduction as SGD for multi-machine
• Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.
• INFRASTRUCTURE:
• Frameworks: Tensorly is high-level API for deep tensorized networks.
• Platform: SageMaker is ML platform with auto parallelization
CONCLUSION
AI needs integration of data, algorithms and infrastructure
COLLABORATORS (LIMITED LIST)
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Simplilearn
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
chenhm
 

Was ist angesagt? (20)

Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
AlexNet
AlexNetAlexNet
AlexNet
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 

Ähnlich wie Trinity of AI: data, algorithms and cloud

Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 

Ähnlich wie Trinity of AI: data, algorithms and cloud (20)

NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Mattar_PhD_Thesis
Mattar_PhD_ThesisMattar_PhD_Thesis
Mattar_PhD_Thesis
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Trinity of AI: data, algorithms and cloud

  • 1. ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & CLOUD Anima Anandkumar Principal Scientist at Amazon AI Bren Professor at Caltech
  • 3. ALGORITHMS DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE TRINITY FUELING ARTIFICIAL INTELLIGENCE
  • 4. ALGORITHMS • OPTIMIZATION • SCALABILITY • MULTI-DIMENSIONALITY DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE TRINITY FUELING ARTIFICIAL INTELLIGENCE
  • 5. ALGORITHMS • OPTIMIZATION • SCALABILITY • MULTI-DIMENSIONALITY DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE ML STACK ON CLOUD • APPLICATION SERVICES • ML PLATFORM • ML FRAMEWORKS TRINITY FUELING ARTIFICIAL INTELLIGENCE
  • 6. • COLLECTION: ACTIVE LEARNING, PARTIAL LABELS.. • AGGREGATION: CROWDSOURCING MODELS.. • AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS.. DATA
  • 7. ACTIVE LEARNING Labeled data Unlabeled data Goal • Reach SOTA with a smaller dataset • Active learning analyzed in theory • In practice, only small classical models Can it work at scale with deep learning?
  • 8. TASK: NAMED ENTITY RECOGNITION
  • 9. RESULTS NER task on largest open benchmark (Onto-notes) Active learning heuristics: • Least confidence (LC) • Max. normalized log probability (MNLP) • Deep active learning matches : • SOTA with just 25% data on English, 30% on Chinese. • Best shallow model (on full data) with 12% data on English, 17% on Chinese. Test F1 score vs. % of labeled words English 0 20 40 60 80 70 75 80 85 Percent of words annotated TestF1score MNLP LC RAND Best Deep Model Best Shallow Model 0 20 40 60 80 100 65 70 75 MNLP LC RAND Best Deep Model Best Shallow Model Chinese
  • 10. • Uncertainty sampling works. Normalizing for length helps under low data. • With active learning, deep beats shallow even in low data regime. • With active learning, SOTA achieved with far fewer samples. TAKE-AWAY
  • 11. ACTIVE LEARNING WITH PARTIAL FEEDBACK images questions dog? dog non-dog partial labels • Hierarchical class labeling: Labor proportional to # of binary questions asked • Actively pick informative questions ?
  • 12. RESULTS ON TINY IMAGENET (100K SAMPLES) • Yield 8% higher accuracy at 30% questions (w.r.t. Uniform) • Obtain full annotation with 40% less binary questions ALPF-ERC active data active questions Uniform inactive data inactive questions AL-ME active data inactive questions AQ-ERC inactive data active questions 0.1 0.2 0.3 0.4 0.5 0% 25% 50% 75% 100% Accuracy vs. # of Questions Uniform AL-ME AQ-ERC ALPF-ERC - 40% +8%
  • 13. • Don’t annotate from scratch • Select questions actively based on the learned model • Don’t sleep on partial labels • Re-train model from partial labels TWO TAKE-AWAYS
  • 14. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers.
  • 15. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers. Annotator-quality models • Can improve accuracy. • Hard: needs to be estimated without ground-truth.
  • 16. SOME INTUITIONS Majority rule to estimate annotator quality • Justification: Majority rule approaches ground-truth when enough workers. • Downside: Requires large number of annotations for each example for majority rule to be correct. Annotator quality model (Prob. of correctness)
  • 17. OUR APPROACH • Use ML model being trained as ground-truth. • Use annotator quality model to train ML model with weighted loss. • Justification: As iterations proceed, ML model agrees more with correct workers. • Works with single annotation per sample! Annotator quality model (Prob. of correctness) ML model to estimate ground truth
  • 18. LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE MS-COCO dataset. Fixed budget: 35k annotations No. of workers Theorem: Under fixed budget, generalization error minimized with single annotation per sample. Assumptions: • Best predictor is accurate enough (under no label noise). • Simplified case: All workers have same quality. • Prob. of being correct > 83% 5% wrt Majority rule
  • 19. DATA AUGMENTATION • To improve generalization, augment training data. • Mostly in computer vision. Simple techniques: rotation, cropping etc. • More sophisticated approaches ?
  • 20. DATA AUGMENTATION Merits • Captures statistics of natural images • Learnable Peril • Quality of generated images not high • Introduces artifacts Our GAN-based framework – Mr.GANs – narrows gap between synthetic and real data GAN Merits • High-quality rendering • Full annotation for free • Generate infinite data PerilSynthetic Data • Domain mis-match • Rendering for visual appeal and not classification
  • 21. MIXED-REALITY GENERATIVE ADVERSARIAL NETWORKS (MR-GAN) Improvement over training on real data: • Real + Synthetic: 5.43% (Stage 0) • Real + CycleGAN on synthetic: 5.09% (Stage 1) • Real + Mr.GAN on synthetic: 8.85% (Stage 2) CIFAR-DA 4 : ~1million synthetic from 3D models.~0.25% real from CIFAR100. 4 classes. Stage 1 Stage 2 : Mapping at stage 1 of Mr.GANs : Mapping at stage 2 of Mr.GANs D(P(real-2) || P(synt-2)) < D(P(real-1) || P(synt-1)) < D(P(real) || P(synt))
  • 22. DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS Goal: Learn a domain of functions (sin, cos, log, add…) • Training on numerical input-output does not generalize. Data Augmentation with Symbolic Expressions • Efficiently encode relationships between functions. Solution: • Design networks to use both: symbolic + numerical
  • 23. ARCHITECTURE : TREE LSTM sin$ % + cos$ % = 1 sin −2.5 = −0.6 • Symbolic expression trees. Function evaluation tree. • Decimal trees: encode numbers with decimal representation (numerical). • Can encode any expression, function evaluation and number. Decimal Tree for 2.5
  • 24. RESULTS • Vastly Improved numerical evaluation: 90% over function-fitting baseline. • Generalization to verifying symbolic equations of higher depth • Combining symbolic + numerical data helps in better generalization for both tasks: symbolic and numerical evaluation. LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric 76.40 % 93.27 % 96.17 %
  • 25. • OPTIMIZATION : ANALYSIS OF CONVERGENCE • SCALABILITY : GRADIENT QUANTIZATION • MULTI-DIMENSIONALITY : TENSOR ALGEBRA ALGORITHMS
  • 26. OPTIMIZATION IN DEEP LEARNING •Data •Deep, layered model with lots of parameters •Descend gradient of objective function
  • 27. THE DEEP LEARNING COOKBOOK •Data •Deep, layered model with lots of parameters •Descend gradient of objective function Randomlyaugmenteddata with {insert arbitrary} regularizer applied Running average of rescaled, clipped stochastic gradients dropped out/batch normalized/bypassed
  • 28. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data
  • 29. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data Compress? Compress? Compress?
  • 30. DISTRIBUTED TRAINING BY MAJORITY VOTE Parameter server GPU 1 GPU 2 GPU 3 sign(g) sign(g) sign(g) Parameter server GPU 1 GPU 2 GPU 3 sign [sum(sign(g))]
  • 31. SIGN QUANTIZATION COMPETITIVE ON IMAGENET Communication gain without performance loss • Signum: Sign over momentum. • Accuracy of Signum similar to ADAM.
  • 32. SGD THEORY — Gradient vector — Lipscitz scalar (curvature) — Variance scalar — Gradient density — Lipscitz density — Variance density signSGD THEORY • SignSGD wins : Function mostly flat (sparse curvature vector). • SGD wins : Function has sparse gradients. 2 L g (g) (~L) (~)
  • 33. DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Same variance reduction as SGD
  • 34. TAKE-AWAYS FOR SIGN-SGD • Convergence even under biased gradients and noise. • Faster than SGD under sparse curvature directions. • For distributed training, similar variance reduction as SGD. • In practice, similar accuracy as Adam but with far less communication.
  • 35. KNOWLEDGE DISTILLATION • Knowledge Distillation : conveys “Dark knowledge” from teacher to student
  • 36. 1 0 Ground Truth 1 0 Model Outputs Output Values Contribution to Cross Entropy Loss Only correct category contributes to loss function. Ground-truth baseline INTUITIONS
  • 37. 1 0 Teacher Outputs 1 Student Outputs Output Values Contribution to Cross Entropy Loss 0 • All categories in teacher output contributes to loss. • If teacher is highly accurate and confident, it is similar to original labels. KNOWLEDGE DISTILLATION What is the role of wrong categories?
  • 38. 1 0 Ground Truth 1 Student Outputs Output Values Loss with High Confidence Teacher 0 Loss with Low Confidence Teacher CONFIDENCE WEIGHTED BY TEACHER MAX Contribution of max category is isolated.
  • 39. 1 0 Permuted Teacher Outputs 1 Student Outputs Output Values Contribution to Cross Entropy Loss 0 Permutation destroys pairwise correlations. But some higher-order information is invariant. DARK KNOWLEDGE WITH PERMUTED PREDICTIONS
  • 40. BORN-AGAIN NETWORKS Knowledge Distillation between identical neural network architectures Step 1Step 0 Step 2
  • 41. RESULTS ON CIFAR 100 • Best results on CIFAR-100 using simple KD on Densenet (BAN). •Multiple BAN stages helped for smaller networks. •Similar behavior in other experiments: Resnet to Densenet, Penn tree bank. • CWTM worse than DKPP: Information about higher-order moments help. Test accuracy on Densenet Born Again: Student surpasses teacher !
  • 42. TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world Modern data is inherently multi-dimensional
  • 43. UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS Extracting Topics from Documents Justice Education Sports Topics
  • 44. TENSOR-BASED LDA TRAINING IS FASTER 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 5 10 15 20 25 30 50 75 100 Timeinminutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes) 0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100 Timeinminutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes) 8 million documents 22x faster on average 12x faster on average • Mallet is an open-source framework for topic modeling • Benchmarks on AWS SageMaker Platform • Bulit into AWS Comprehend NLP service. 300000 documents
  • 45. DEEP NEURAL NETS: TRANSFORMING TENSORS
  • 47. SPACE SAVING IN DEEP TENSORIZED NETWORKS
  • 48. Tensor Train RNN and LSTMs TENSORS FOR LONG-TERM FORECASTING Challenges: • Long-term dependencies • High-order correlations • Error propagation
  • 49. Climate datasetTraffic dataset TENSOR LSTM FOR LONG-TERM FORECASTING
  • 50. TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA • Python programming • User-friendly API • Multiple backends: flexible + scalable • Example notebooks in repository
  • 51. Frameworks Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Greengrass) Vision: Rekognition Image Rekognition Video Speech: Polly Transcribe Language: Lex Translate Comprehend Apache MXNet PyTorch Cognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon INFRASTRUCTURE AND FRAMEWORKS ON AWS Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens
  • 52. Fully managed hosting with auto-scaling One-click deployment Pre-built notebooks for common problems Built-in, high performance algorithms One-click training Hyperparameter optimization BUILD TRAIN DEPLOY AMAZON SAGEMAKER SOLVES THESE PROBLEMS
  • 53. A New Vision for Autonomy Center for Autonomous Systems and Technologies
  • 54. CAST: BRINGING DRONES, ROBOTICS AND AI TOGETHER
  • 55. • DATA • Collection: Active learning and partial feedback • Aggregation: Crowdsourcing models • Augmentation: Graphics rendering + GANs, Symbolic expressions • ALGORITHMS • Convergence: SignSGD has good rates in theory and practice • Scalability: SignSGD has same variance reduction as SGD for multi-machine • Multi-dimensionality: Tensor algebra for neural networks and probabilistic models. • INFRASTRUCTURE: • Frameworks: Tensorly is high-level API for deep tensorized networks. • Platform: SageMaker is ML platform with auto parallelization CONCLUSION AI needs integration of data, algorithms and infrastructure