Machine Learning
A managed supervised learning environment to build different models, including Binary Classification / Multi-class classification / Regression ML. The demos will show a dataset of banking customers with demographics, predicting the likelihood of whether they are going to default using binary classification. Second one will be predicting a UK bike rental shop traffic using linear regression, and third one for predicting a rainforest soil type using multi-class classification.
Benefits: Managed and on-demand environment for supervised learning algorithm, available as batch processing or real-time API.
Spark ML Cluster
Running spark on AWS managed cluster, storing data on HDFS / S3 persistent storage, modules include MLib and Zeppelin (Web Notebook), to build a movie recommendation engine based on “Collaborative Filtering”. The dataset contains 10M ratings provided by grouplens from MovieLens website.
Benefits: Fully managed clusters, with HA, Scalability, Elasticity and Spot instance pricing
3. Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR
4. Three Supported Types of Predictions
Binary Classification
Predict the answer to a Yes/No question
Multi-class classification
Predict the correct category from a list
Regression
Predict the value of a numeric variable
5. Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Based on what you
know about an order:
Is this order
fraudulent?
Based on what you know
about a news article:
What other articles are
interesting?
15. Cost of Errors
• Cost of Customer Churn and Acquisition (false
negative):
• foregone cashflow
• advertising costs
• POS and sign-up admin costs
• Customer Retention Cost (false + true positive)
• Discounts
• Phone upgrades
• etc
16. Financial Outcome of Applying a Model
Prior Churn Churn Cost Cost without ML
14.49% $500.00 $72.46
False Negative True + False Pos Retention Cost Cost with ML
4.80% 26.40% $100.00 $50.40
• $22.06 of savings per customer
• With 100,000 customers over $2MM in savings with ML
20. ”
“
Fraud.net Uses AWS to Quickly, Easily Detect Online Fraud
Fraud.net is the world’s leading crowdsourced
fraud prevention platform.
Amazon Machine Learning
helps us reduce complexity
and make sense of emerging
fraud patterns.
• Needed to build and train a larger number of more
targeted and precise machine-learning models
• Uses Amazon Machine Learning to provide more than
20 machine-learning models
• Easily builds and trains machine-learning models to
effectively detect online payment fraud
• Reduces complexity and makes sense of emerging
fraud patterns
• Saves clients $1 million weekly by helping them
detect and prevent fraud
Oliver Clark
CTO,
Fraud.net
”
“
21. ”
“
AdiMap Provides Financial Intelligence at Scale Using AWS
AdiMap is a data science company that
combines the disciplines of computer science,
statistics, and business.
Using Amazon Machine
Learning, we provide users
and customers with financial
intelligence at scale.
• Needed to cost-effectively meet compute needs and
increase machine learning capabilities.
• Uses Amazon Machine Learning to predict and infer
financials.
• Builds predictive models without spending millions on
compute resources and hardware.
• Provides scalable financial intelligence.
• Reduces time to market for new products.
Dr. Iddo Drori,
Founder and CEO,
AdiMap
”
“
23. Why aren’t there more smart applications?
1. Machine learning expertise is rare
2. Building and scaling machine learning technology is
hard
3. Closing the gap between models and applications is
time-consuming and expensive
25. Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
27. An Example EMR Cluster
Master Node
r3.2xlarge
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge
Slave Group – Task
m3.2xlarge (EC2 Spot)
HDFS (DataNode).
YARN (NodeManager).
NameNode (HDFS)
ResourceManager
(YARN)
28. Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS