How to Wrangle Data for Machine Learning on AWS

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How to wrangle data for
machine learning on AWS
May 31, 2018 | 10:00 AM PT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Pratap Ramamurthy, Partner Solutions Architect,
Amazon Web Services, Inc.
David McNamara, Customer Success Manager, Trifacta
Harrison Lynch, Senior Director of Product
Development, Consensus Corporation
Today’s speakers

• An overview of machine learning (ML) solutions offered through
AWS and the AWS Partner Network
• Featured AWS Machine Learning Partner: Trifacta
• Case study: Consensus Corporation
• Q&A / Discussion
Today’s agenda

Learning objectives
• How easy it is to get started with machine learning solutions for data
wrangling on the cloud
• Why automating your data wrangling tasks can lead to greater data
accuracy and more meaningful insights
• How you can reduce your data preparation time by 60% and more
with self-service data wrangling tools built for AWS
• How Consensus Corporation is using Trifacta on AWS to detect fraud

Machine learning on AWS

A long heritage of machine learning at Amazon
Personalized
recommendation
s
Inventing
entirely new
customer
experiences
Fulfillment
automation
and inventory
management
Drones Voice driven
interactions

Our mission:
Put machine learning in the
hands of every developer and
data scientist

Source: McKinsey Global Institute, Artificial Intelligence The Next
Digital Frontier.
• Strong overall appetite for
adopting AI
• Top heavy in High Tech due to
expertise
• Opportunities exist in Health
Care, Education, Retail, and other
segments
• 3000+ startups today (up from 100
in 2011)
Market adoption: $46B market by 2020

Amazon Machine Learning Stack
Vision
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU
IoT (Amazon
Greengrass)
Platform
Services
Application
Services
Amazon
SageMaker
AWS
DeepLens
Amazon
Rekognition
Image
Amazon
Rekognition
Video
Speech
Amazo
n
Polly
Amazon
Transcribe
Language
Amazon
Translate
Amazon
Comprehend
Amazo
n
Lex
Amazon Machine
Learning
Amazon Spark on
Amazon EMR
Amazon
Mechanical Turk
TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras

New: Amazon Rekognition Video
Object and activity
detection
Person
tracking
Face
recognition
Real-time live
stream
Content
moderation
Celebrity
recognition
Video analysis

Customers running machine learning on AWS
today

The AWS Competency Program is designed to
highlight APN Partners who have demonstrated
technical proficiency and proven customer success
in specialized solution areas. Attaining an AWS
Competency allows partners to differentiate
themselves to customers by showcasing expertise in
a specific solution area.
W H AT IS TH E AW S
C OMPETEN C Y PR OGR A M?
The AWS Machine Learning Competency Program

Data wrangling for machine learning on AWS
David McNamara, Customer Success Manager, Trifacta

We believe that our ability to solve big
problems in business and society
depends on seeing patterns in the data
we collect. But data comes in all shapes
and sizes, and too often the messy
process of pulling it together gets in the
way of progress. At Trifacta, we
empower change-makers to work with
diverse and fragmented data—as it’s
being cleaned and refined—so they can
ask more interesting questions and
create a better future.

A global leader in data preparation
#1 Rankings from
Media & Analysts
85+ Global Technology
and SI Partners
#1 in Users with 10,000+ Companies
Enterprise Standard for Data Preparation at
100+ Accounts

Self-service data wrangling: the critical enabler
*Wrangler: Interactive Visual Specification of Data Transformation Scripts –
Heer, Hellerstein, Kandel, Paepke; Stanford University & University California, Berkeley (2011)
DATA PLATFORMS
ANALYSIS & CONSUMPTION
80%
”There's the joke that 80 percent of data science
is cleaning the data and 20 percent is
complaining about cleaning the data”
— Kaggle founder and CEO Anthony Goldbloom

DATA PLATFORMS
ANALYSIS & CONSUMPTION
DATA WRANGLING ACTIVITIES
Discover Structure Clean Enrich Validate Publish
The Solution: Trifacta Data Wrangling Platform

Typical AI/ML modeling data pipeline to
empower business self-service

Consensus Corporation:
Improved data wrangling to increase the
speed of a machine learning model building
for anti-fraud software
Harrison Lynch, Senior Director of Product Development,
Consensus Corporation

Your speaker
Harrison Lynch
Sr. Director of Product Development

Consensus history
24Consensus / Proprietary & Confidential
1999 20182007 2012 2014
IPhone Launches
acquireslaunches as an online
retailer of wireless phones
& services
acquires
LetsTalk.com &
rebrands as
launches Connected
Commerce in
First Client
Launch

Use our wireless capabilities to perform new tricks
25
Multiplex Bundles
Product & Connected Service Bundles
LoyaltySubscriptions
$625 Cash or $29 per month
Underwriters Services Supply Chain
3. COLLECT
2. CONNECT
1. CATCH A TRANSACTION

Multiplex is about two things
CONFIDENTIAL 26
Cart Margin Enhancement (CME): Add products and
activate services to improve margin for retailer and deliver
a more complete customer experience.
Future Proof Subscriptions (FPS): Give guests
subscriptions to bundles of latest products and services
with options to pay over time
Retailers face increasing price competition leading to
downward margin and revenue pressure.
Consumers face fragmented ‘buying to using’ experiences,
and sticker shock on large ticket items.
INSIGHTS
KEY BENEFITS OF OUR SOLUTION
Get More with Assisted Sales Pay Less with Subscriptions

• COGS: $850
• Customer finances 100%
• Carrier pays device cost subsidy
• Carrier pays commission
• Retailer makes money on warranty, accessories
• Carrier takes back commission and device subsidy if fraud
• 1 bad sale wipes out the profit from up to 10 good sales
Wireless retailing economics

Current industry practice relies on insufficient credit scoring methods
WillThey Pay OnTime?
• FICO
• Income
• Payment History
• Length of Employment
• Credit Utilization
• # of Accounts
V.
Identity Thieves are trying to steal the
best credit scores
WillThey Ever Pay?
• Distance from Store
• Basket Composition
• RepTenure
• Time of Day
• Number Port from

Extract:
• Start with a robust data set – preferably at least 3 months of data from orders that
are at least 120 days old
• This dataset (of orders) should indicate whether the carrier has deactivated a
correspondingline (and if possible, whether the carrier classified the order as
fraudulent)
Analyze:
• Applyhunches, theories and business
knowledge
• this is what’s known as “Ground Truth”
Identify:
• To extract and analyze a set of
characteristics from the set of orders
• These are referred to as “Features”
Augment:
• Identify characteristics and tangential
data elements about Features that
enhance the model’s usefulness
• If reliable cell phone customers tend to
come from particular areas, it may be
prudent to model the regions from which
customers drive to reach the cell phone
store
• If an annual festival increase a store region’s
population by 100,000 people and a higher
percentage of fraud cases come from this
annual period, it may be prudent to cluster
purchase orders that are proximate to the
festival period
Transform:
• Put the data into a format into one that makes it easier for data
modeling systems to read
• This is usually one or a series of two-dimensional tables with one of
the followingtypes of data
 Continuous: numeric, numbers,like dollar values or distance
 Binary: 1/0, or a yes/no
 Categorical: colors (white, black, gold) or a carrier
Split data into two sets:
• Larger set is the training set that feeds into a set of
models; it allows the models to identify good vs bad
orders and to generate correlations.
• Smaller test set is one that the data scientist puts away
for future use against the model
Choose:
• Models that would best fit the data that comes through the system
based on the models that have worked in the past
• Put the training set through the set of candidate models, which
results in:
• Risk assessments for each order in the training
set
• Refinement of the candidate models such they
become “pickled models”
Train & Test:
• After the candidate models generate risk assessments
for the training data set, a data scientist pulls the test
set out of a drawer (so to speak)
• The models are then judged for accuracy against data
they’ve never seen - this is how you make sure you’re
not building a model to predict yesterday’s weather
• The data scientist evaluates the candidate models
based upon the comparison.
• The candidate models undergo tuning and refinement
such that they produce the best results possible against
the test data set.
• The data scientist selects the candidate model that
produces the best results against the test data set
Pickle, Promote & Deploy:
• Compare the results for the best candidate model against the model currently in use, if it’s a winner move forward
• Test the performance of the new model in use to ensure that it meets performance standards and promote it to production
Model building
1
2
3
4
5
6 7
8
9

System for machine learning
1
7
6
5
3
2
4
8
1.Orders arrive at
the machine
learning system via
several channels
2.The system
parses the data. It is
able to do so
regardless of the
channel of origin
3. The system
extracts that data
which is most
relevant to the
order scoring
process, transforms
the extracted data
into a format the
model can read &
loads reformatted
data into the model
4. The system
scores the extracted
and reformatted
order using the risk
scoring model
5. The system
determines at
random whether an
order should be in a
control group. It
approves those
orders immediately
and without regard
to their score
6. The system
scores the extracted
and reformatted
order using the risk
scoring model
7. Based on the results of the rules
application, the system routes orders
to third-party services and manual
review processes as appropriate8. The system sends a yes/no
determination and/or a risk score for
the order back to the originating
commercial channel

Conventional
• Many systems claim to
use “models”
• What they’re wedded
to are linear models
• LM’s are great for some
things, not for others
• Overfitting is an issue
The whole universe of statistical models
V.
Consensus
• The universe of models
• Select the one that best fits the job
Gaussian Kernel
Random Forest
Support Vector Machine
Support Vector Machine

Searching for a data wrangling solution
• Join disparate data sources together
• Carrier reconciliation data
• Internal order data
• External data
• Data is messy
• Shifting date formats
• Inconsistent data types
• Shifts in logic across partners
• Reduce reliance on developers/ data analysts
• I’m lousy at SQL
• Day jobs need to be attended to
• Scaling Discovery
• Needed to be able to act on hunches
• Explore ground truth
• Data Prep is a labor of love
CONFIDENTIAL 32

• Data is in different places (and sometimes toxic)
• Black box data
• Pulling standalone sets of hashed values
• Script Dev > SRE > Output > back to me
• Making repairing of data repeatable
• Geocode failures
• Exploration of Census data
• Tracking your changes
• Showing your work & Data Provenance
How Trifacta helps me

• Discovery of data
• From 2-3 days, to less than one day
• Data preparation
• From 8 hours to less than one hour
Our results with Trifacta

Q & A

Start wrangling today with Trifacta Wrangler Pro in AWS Marketplace:
• aws.amazon.com/marketplace/
• Search for “Trifacta Wrangler Pro”
Learn more about machine learning on AWS
• aws.amazon.com/machine-learning/featured-partner-solutions/
Try AWS for free:
• aws.amazon.com/free/
Next steps and further information:

Thank you!

How to Wrangle Data for Machine Learning on AWS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How to Wrangle Data for Machine Learning on AWS

Ähnlich wie How to Wrangle Data for Machine Learning on AWS (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

How to Wrangle Data for Machine Learning on AWS