SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How to wrangle data for
machine learning on AWS
May 31, 2018 | 10:00 AM PT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pratap Ramamurthy, Partner Solutions Architect,
Amazon Web Services, Inc.
David McNamara, Customer Success Manager, Trifacta
Harrison Lynch, Senior Director of Product
Development, Consensus Corporation
Today’s speakers
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• An overview of machine learning (ML) solutions offered through
AWS and the AWS Partner Network
• Featured AWS Machine Learning Partner: Trifacta
• Case study: Consensus Corporation
• Q&A / Discussion
Today’s agenda
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning objectives
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• How easy it is to get started with machine learning solutions for data
wrangling on the cloud
• Why automating your data wrangling tasks can lead to greater data
accuracy and more meaningful insights
• How you can reduce your data preparation time by 60% and more
with self-service data wrangling tools built for AWS
• How Consensus Corporation is using Trifacta on AWS to detect fraud
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A long heritage of machine learning at Amazon
Personalized
recommendation
s
Inventing
entirely new
customer
experiences
Fulfillment
automation
and inventory
management
Drones Voice driven
interactions
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our mission:
Put machine learning in the
hands of every developer and
data scientist
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Source: McKinsey Global Institute, Artificial Intelligence The Next
Digital Frontier.
• Strong overall appetite for
adopting AI
• Top heavy in High Tech due to
expertise
• Opportunities exist in Health
Care, Education, Retail, and other
segments
• 3000+ startups today (up from 100
in 2011)
Market adoption: $46B market by 2020
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Machine Learning Stack
Vision
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU
IoT (Amazon
Greengrass)
Platform
Services
Application
Services
Amazon
SageMaker
AWS
DeepLens
Amazon
Rekognition
Image
Amazon
Rekognition
Video
Speech
Amazo
n
Polly
Amazon
Transcribe
Language
Amazon
Translate
Amazon
Comprehend
Amazo
n
Lex
Amazon Machine
Learning
Amazon Spark on
Amazon EMR
Amazon
Mechanical Turk
TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New: Amazon Rekognition Video
Object and activity
detection
Person
tracking
Face
recognition
Real-time live
stream
Content
moderation
Celebrity
recognition
Video analysis
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customers running machine learning on AWS
today
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The AWS Competency Program is designed to
highlight APN Partners who have demonstrated
technical proficiency and proven customer success
in specialized solution areas. Attaining an AWS
Competency allows partners to differentiate
themselves to customers by showcasing expertise in
a specific solution area.
W H AT IS TH E AW S
C OMPETEN C Y PR OGR A M?
The AWS Machine Learning Competency Program
Data wrangling for machine learning on AWS
David McNamara, Customer Success Manager, Trifacta
We believe that our ability to solve big
problems in business and society
depends on seeing patterns in the data
we collect. But data comes in all shapes
and sizes, and too often the messy
process of pulling it together gets in the
way of progress. At Trifacta, we
empower change-makers to work with
diverse and fragmented data—as it’s
being cleaned and refined—so they can
ask more interesting questions and
create a better future.
A global leader in data preparation
#1 Rankings from
Media & Analysts
85+ Global Technology
and SI Partners
#1 in Users with 10,000+ Companies
Enterprise Standard for Data Preparation at
100+ Accounts
Self-service data wrangling: the critical enabler
*Wrangler: Interactive Visual Specification of Data Transformation Scripts –
Heer, Hellerstein, Kandel, Paepke; Stanford University & University California, Berkeley (2011)
DATA PLATFORMS
ANALYSIS & CONSUMPTION
80%
”There's the joke that 80 percent of data science
is cleaning the data and 20 percent is
complaining about cleaning the data”
— Kaggle founder and CEO Anthony Goldbloom
DATA PLATFORMS
ANALYSIS & CONSUMPTION
DATA WRANGLING ACTIVITIES
Discover Structure Clean Enrich Validate Publish
The Solution: Trifacta Data Wrangling Platform
Typical AI/ML modeling data pipeline to
empower business self-service
Demo
Consensus Corporation:
Improved data wrangling to increase the
speed of a machine learning model building
for anti-fraud software
Harrison Lynch, Senior Director of Product Development,
Consensus Corporation
Your speaker
Harrison Lynch
Sr. Director of Product Development
Consensus history
24Consensus / Proprietary & Confidential
1999 20182007 2012 2014
IPhone Launches
acquireslaunches as an online
retailer of wireless phones
& services
acquires
LetsTalk.com &
rebrands as
launches Connected
Commerce in
First Client
Launch
Use our wireless capabilities to perform new tricks
25
Multiplex Bundles
Product & Connected Service Bundles
LoyaltySubscriptions
$625 Cash or $29 per month
Underwriters Services Supply Chain
3. COLLECT
2. CONNECT
1. CATCH A TRANSACTION
Multiplex is about two things
CONFIDENTIAL 26
Cart Margin Enhancement (CME): Add products and
activate services to improve margin for retailer and deliver
a more complete customer experience.
Future Proof Subscriptions (FPS): Give guests
subscriptions to bundles of latest products and services
with options to pay over time
Retailers face increasing price competition leading to
downward margin and revenue pressure.
Consumers face fragmented ‘buying to using’ experiences,
and sticker shock on large ticket items.
INSIGHTS
KEY BENEFITS OF OUR SOLUTION
Get More with Assisted Sales Pay Less with Subscriptions
• COGS: $850
• Customer finances 100%
• Carrier pays device cost subsidy
• Carrier pays commission
• Retailer makes money on warranty, accessories
• Carrier takes back commission and device subsidy if fraud
• 1 bad sale wipes out the profit from up to 10 good sales
Wireless retailing economics
27Consensus / Proprietary & Confidential
Current industry practice relies on insufficient credit scoring methods
28Consensus / Proprietary & Confidential
WillThey Pay OnTime?
• FICO
• Income
• Payment History
• Length of Employment
• Credit Utilization
• # of Accounts
V.
Identity Thieves are trying to steal the
best credit scores
WillThey Ever Pay?
• Distance from Store
• Basket Composition
• RepTenure
• Time of Day
• Number Port from
29Consensus / Proprietary & Confidential
Extract:
• Start with a robust data set – preferably at least 3 months of data from orders that
are at least 120 days old
• This dataset (of orders) should indicate whether the carrier has deactivated a
correspondingline (and if possible, whether the carrier classified the order as
fraudulent)
Analyze:
• Applyhunches, theories and business
knowledge
• this is what’s known as “Ground Truth”
Identify:
• To extract and analyze a set of
characteristics from the set of orders
• These are referred to as “Features”
Augment:
• Identify characteristics and tangential
data elements about Features that
enhance the model’s usefulness
• If reliable cell phone customers tend to
come from particular areas, it may be
prudent to model the regions from which
customers drive to reach the cell phone
store
• If an annual festival increase a store region’s
population by 100,000 people and a higher
percentage of fraud cases come from this
annual period, it may be prudent to cluster
purchase orders that are proximate to the
festival period
Transform:
• Put the data into a format into one that makes it easier for data
modeling systems to read
• This is usually one or a series of two-dimensional tables with one of
the followingtypes of data
 Continuous: numeric, numbers,like dollar values or distance
 Binary: 1/0, or a yes/no
 Categorical: colors (white, black, gold) or a carrier
Split data into two sets:
• Larger set is the training set that feeds into a set of
models; it allows the models to identify good vs bad
orders and to generate correlations.
• Smaller test set is one that the data scientist puts away
for future use against the model
Choose:
• Models that would best fit the data that comes through the system
based on the models that have worked in the past
• Put the training set through the set of candidate models, which
results in:
• Risk assessments for each order in the training
set
• Refinement of the candidate models such they
become “pickled models”
Train & Test:
• After the candidate models generate risk assessments
for the training data set, a data scientist pulls the test
set out of a drawer (so to speak)
• The models are then judged for accuracy against data
they’ve never seen - this is how you make sure you’re
not building a model to predict yesterday’s weather
• The data scientist evaluates the candidate models
based upon the comparison.
• The candidate models undergo tuning and refinement
such that they produce the best results possible against
the test data set.
• The data scientist selects the candidate model that
produces the best results against the test data set
Pickle, Promote & Deploy:
• Compare the results for the best candidate model against the model currently in use, if it’s a winner move forward
• Test the performance of the new model in use to ensure that it meets performance standards and promote it to production
Model building
1
2
3
4
5
6 7
8
9
System for machine learning
30Consensus / Proprietary & Confidential
1
7
6
5
3
2
4
8
1.Orders arrive at
the machine
learning system via
several channels
2.The system
parses the data. It is
able to do so
regardless of the
channel of origin
3. The system
extracts that data
which is most
relevant to the
order scoring
process, transforms
the extracted data
into a format the
model can read &
loads reformatted
data into the model
4. The system
scores the extracted
and reformatted
order using the risk
scoring model
5. The system
determines at
random whether an
order should be in a
control group. It
approves those
orders immediately
and without regard
to their score
6. The system
scores the extracted
and reformatted
order using the risk
scoring model
7. Based on the results of the rules
application, the system routes orders
to third-party services and manual
review processes as appropriate8. The system sends a yes/no
determination and/or a risk score for
the order back to the originating
commercial channel
Conventional
• Many systems claim to
use “models”
• What they’re wedded
to are linear models
• LM’s are great for some
things, not for others
• Overfitting is an issue
The whole universe of statistical models
V.
Consensus
• The universe of models
• Select the one that best fits the job
Gaussian Kernel
Random Forest
Support Vector Machine
Support Vector Machine
Searching for a data wrangling solution
• Join disparate data sources together
• Carrier reconciliation data
• Internal order data
• External data
• Data is messy
• Shifting date formats
• Inconsistent data types
• Shifts in logic across partners
• Reduce reliance on developers/ data analysts
• I’m lousy at SQL
• Day jobs need to be attended to
• Scaling Discovery
• Needed to be able to act on hunches
• Explore ground truth
• Data Prep is a labor of love
CONFIDENTIAL 32
• Data is in different places (and sometimes toxic)
• Black box data
• Pulling standalone sets of hashed values
• Script Dev > SRE > Output > back to me
• Making repairing of data repeatable
• Geocode failures
• Exploration of Census data
• Tracking your changes
• Showing your work & Data Provenance
How Trifacta helps me
33Consensus / Proprietary & Confidential
• Discovery of data
• From 2-3 days, to less than one day
• Data preparation
• From 8 hours to less than one hour
Our results with Trifacta
34Consensus / Proprietary & Confidential
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q & A
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Start wrangling today with Trifacta Wrangler Pro in AWS Marketplace:
• aws.amazon.com/marketplace/
• Search for “Trifacta Wrangler Pro”
Learn more about machine learning on AWS
• aws.amazon.com/machine-learning/featured-partner-solutions/
Try AWS for free:
• aws.amazon.com/free/
Next steps and further information:
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Aws Tools for Alexa Skills
Aws Tools for Alexa SkillsAws Tools for Alexa Skills
Aws Tools for Alexa SkillsBoaz Ziniman
 
Starting your cloud journey - AWSomeDay Israel
Starting your cloud journey - AWSomeDay IsraelStarting your cloud journey - AWSomeDay Israel
Starting your cloud journey - AWSomeDay IsraelBoaz Ziniman
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summits
 
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Boaz Ziniman
 
Interledger DvP Settlement on Amazon Managed Blockchain
Interledger DvP Settlement on Amazon Managed BlockchainInterledger DvP Settlement on Amazon Managed Blockchain
Interledger DvP Settlement on Amazon Managed BlockchainAmazon Web Services
 
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018Amazon Web Services Korea
 
利用AWS打造一站式旅遊服務平台
利用AWS打造一站式旅遊服務平台利用AWS打造一站式旅遊服務平台
利用AWS打造一站式旅遊服務平台Amazon Web Services
 
¿Qué significa Transformación Digital para las Empresas?
¿Qué significa Transformación Digital para las Empresas?¿Qué significa Transformación Digital para las Empresas?
¿Qué significa Transformación Digital para las Empresas?Amazon Web Services LATAM
 
Creating a Machine Learning Factory
Creating a Machine Learning FactoryCreating a Machine Learning Factory
Creating a Machine Learning FactoryAmazon Web Services
 
AWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAmazon Web Services
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)Amazon Web Services
 
AWS IoT: servizi costruiti per migliorare le performance di business
AWS IoT: servizi costruiti per migliorare le performance di businessAWS IoT: servizi costruiti per migliorare le performance di business
AWS IoT: servizi costruiti per migliorare le performance di businessAmazon Web Services
 
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Amazon Web Services
 
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...Amazon Web Services
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightAmazon Web Services LATAM
 
The Secret Treasures of Cloud Migration Journey
The Secret Treasures of Cloud Migration JourneyThe Secret Treasures of Cloud Migration Journey
The Secret Treasures of Cloud Migration JourneyAmazon Web Services
 

Was ist angesagt? (20)

AWS Analytics Experience Argentina - Intro
AWS Analytics Experience Argentina - IntroAWS Analytics Experience Argentina - Intro
AWS Analytics Experience Argentina - Intro
 
Innovación para Todos
Innovación para TodosInnovación para Todos
Innovación para Todos
 
Aws Tools for Alexa Skills
Aws Tools for Alexa SkillsAws Tools for Alexa Skills
Aws Tools for Alexa Skills
 
Starting your cloud journey - AWSomeDay Israel
Starting your cloud journey - AWSomeDay IsraelStarting your cloud journey - AWSomeDay Israel
Starting your cloud journey - AWSomeDay Israel
 
AWS in FSI 2019
AWS in FSI 2019AWS in FSI 2019
AWS in FSI 2019
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
 
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
Starting your Cloud Transformation Journey - Tel Aviv Summit 2018
 
Amazon SageMaker
Amazon SageMakerAmazon SageMaker
Amazon SageMaker
 
Interledger DvP Settlement on Amazon Managed Blockchain
Interledger DvP Settlement on Amazon Managed BlockchainInterledger DvP Settlement on Amazon Managed Blockchain
Interledger DvP Settlement on Amazon Managed Blockchain
 
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018
글로벌 미디어 고객사의 AWS 활용 사례-워싱턴 포스트 ::지정아::AWS Summit Seoul 2018
 
利用AWS打造一站式旅遊服務平台
利用AWS打造一站式旅遊服務平台利用AWS打造一站式旅遊服務平台
利用AWS打造一站式旅遊服務平台
 
¿Qué significa Transformación Digital para las Empresas?
¿Qué significa Transformación Digital para las Empresas?¿Qué significa Transformación Digital para las Empresas?
¿Qué significa Transformación Digital para las Empresas?
 
Creating a Machine Learning Factory
Creating a Machine Learning FactoryCreating a Machine Learning Factory
Creating a Machine Learning Factory
 
AWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial Services
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
AWS IoT: servizi costruiti per migliorare le performance di business
AWS IoT: servizi costruiti per migliorare le performance di businessAWS IoT: servizi costruiti per migliorare le performance di business
AWS IoT: servizi costruiti per migliorare le performance di business
 
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
 
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...
AWS IoT services: Extract value for industrial applications - SVC202 - Mexico...
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
The Secret Treasures of Cloud Migration Journey
The Secret Treasures of Cloud Migration JourneyThe Secret Treasures of Cloud Migration Journey
The Secret Treasures of Cloud Migration Journey
 

Ähnlich wie How to Wrangle Data for Machine Learning on AWS

How Deloitte Uses AI to Simplify Reporting and Increase Value
How Deloitte Uses AI to Simplify Reporting and Increase ValueHow Deloitte Uses AI to Simplify Reporting and Increase Value
How Deloitte Uses AI to Simplify Reporting and Increase ValueAmazon Web Services
 
Data Driven Decisions: Building an Insight Driven Culture
Data Driven Decisions: Building an Insight Driven CultureData Driven Decisions: Building an Insight Driven Culture
Data Driven Decisions: Building an Insight Driven CultureAmazon Web Services
 
Mining Intelligent Insights: AI/ML for Financial Services
Mining Intelligent Insights: AI/ML for Financial ServicesMining Intelligent Insights: AI/ML for Financial Services
Mining Intelligent Insights: AI/ML for Financial ServicesAmazon Web Services LATAM
 
How Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for PetsHow Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for PetsAmazon Web Services
 
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...Amazon Web Services
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Cloudera, Inc.
 
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Amazon Web Services
 
Machine Learning in Customer Analytics
Machine Learning in Customer AnalyticsMachine Learning in Customer Analytics
Machine Learning in Customer AnalyticsCourse5i
 
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...Amazon Web Services
 
Cloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of CloudCloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of CloudAmazon Web Services
 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersCloudera, Inc.
 
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfCloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfAmazon Web Services
 
Cloud choices johnenoch_theatre1_session3_1335
Cloud choices johnenoch_theatre1_session3_1335Cloud choices johnenoch_theatre1_session3_1335
Cloud choices johnenoch_theatre1_session3_1335John Enoch
 
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Amazon Web Services
 
Liberating data power of APIs
Liberating data power of APIsLiberating data power of APIs
Liberating data power of APIsBala Iyer
 
AWS Initiate - Tendências da Transformação Digital
AWS Initiate - Tendências da Transformação DigitalAWS Initiate - Tendências da Transformação Digital
AWS Initiate - Tendências da Transformação DigitalAmazon Web Services LATAM
 
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoD
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoDData Supply Chain Pipeline: Approach to Curating Data at Scale within the DoD
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoDAmazon Web Services
 
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak)
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak) Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak)
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak) Tealium
 
Digital marketing pharma - google event
Digital marketing   pharma - google eventDigital marketing   pharma - google event
Digital marketing pharma - google eventDaniel Viveiros
 

Ähnlich wie How to Wrangle Data for Machine Learning on AWS (20)

How Deloitte Uses AI to Simplify Reporting and Increase Value
How Deloitte Uses AI to Simplify Reporting and Increase ValueHow Deloitte Uses AI to Simplify Reporting and Increase Value
How Deloitte Uses AI to Simplify Reporting and Increase Value
 
Data Driven Decisions: Building an Insight Driven Culture
Data Driven Decisions: Building an Insight Driven CultureData Driven Decisions: Building an Insight Driven Culture
Data Driven Decisions: Building an Insight Driven Culture
 
Mining Intelligent Insights: AI/ML for Financial Services
Mining Intelligent Insights: AI/ML for Financial ServicesMining Intelligent Insights: AI/ML for Financial Services
Mining Intelligent Insights: AI/ML for Financial Services
 
How Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for PetsHow Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for Pets
 
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...
Implementation of Amazon Connect, Powered by Accenture (FSV306-S) - AWS re:In...
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence

 
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
 
Machine Learning in Customer Analytics
Machine Learning in Customer AnalyticsMachine Learning in Customer Analytics
Machine Learning in Customer Analytics
 
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...
Modeling the Customer Journey with AWS Analytics to Drive Revenue and Retenti...
 
Cloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of CloudCloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of Cloud
 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent Offers
 
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfCloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
 
Cloud choices johnenoch_theatre1_session3_1335
Cloud choices johnenoch_theatre1_session3_1335Cloud choices johnenoch_theatre1_session3_1335
Cloud choices johnenoch_theatre1_session3_1335
 
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
 
Liberating data power of APIs
Liberating data power of APIsLiberating data power of APIs
Liberating data power of APIs
 
Tendências na Transformação Digital
Tendências na Transformação DigitalTendências na Transformação Digital
Tendências na Transformação Digital
 
AWS Initiate - Tendências da Transformação Digital
AWS Initiate - Tendências da Transformação DigitalAWS Initiate - Tendências da Transformação Digital
AWS Initiate - Tendências da Transformação Digital
 
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoD
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoDData Supply Chain Pipeline: Approach to Curating Data at Scale within the DoD
Data Supply Chain Pipeline: Approach to Curating Data at Scale within the DoD
 
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak)
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak) Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak)
Gartner Digital Marketing Conference 2016: Theater Session (C. Slovak)
 
Digital marketing pharma - google event
Digital marketing   pharma - google eventDigital marketing   pharma - google event
Digital marketing pharma - google event
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

How to Wrangle Data for Machine Learning on AWS

  • 1. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. How to wrangle data for machine learning on AWS May 31, 2018 | 10:00 AM PT © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 2. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pratap Ramamurthy, Partner Solutions Architect, Amazon Web Services, Inc. David McNamara, Customer Success Manager, Trifacta Harrison Lynch, Senior Director of Product Development, Consensus Corporation Today’s speakers © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 3. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. • An overview of machine learning (ML) solutions offered through AWS and the AWS Partner Network • Featured AWS Machine Learning Partner: Trifacta • Case study: Consensus Corporation • Q&A / Discussion Today’s agenda © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 4. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learning objectives © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • How easy it is to get started with machine learning solutions for data wrangling on the cloud • Why automating your data wrangling tasks can lead to greater data accuracy and more meaningful insights • How you can reduce your data preparation time by 60% and more with self-service data wrangling tools built for AWS • How Consensus Corporation is using Trifacta on AWS to detect fraud
  • 5. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning on AWS © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 6. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. A long heritage of machine learning at Amazon Personalized recommendation s Inventing entirely new customer experiences Fulfillment automation and inventory management Drones Voice driven interactions
  • 7. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our mission: Put machine learning in the hands of every developer and data scientist
  • 9. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Source: McKinsey Global Institute, Artificial Intelligence The Next Digital Frontier. • Strong overall appetite for adopting AI • Top heavy in High Tech due to expertise • Opportunities exist in Health Care, Education, Retail, and other segments • 3000+ startups today (up from 100 in 2011) Market adoption: $46B market by 2020
  • 10. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Machine Learning Stack Vision Frameworks & Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Amazon Greengrass) Platform Services Application Services Amazon SageMaker AWS DeepLens Amazon Rekognition Image Amazon Rekognition Video Speech Amazo n Polly Amazon Transcribe Language Amazon Translate Amazon Comprehend Amazo n Lex Amazon Machine Learning Amazon Spark on Amazon EMR Amazon Mechanical Turk TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
  • 11. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. New: Amazon Rekognition Video Object and activity detection Person tracking Face recognition Real-time live stream Content moderation Celebrity recognition Video analysis
  • 12. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 13. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customers running machine learning on AWS today
  • 14. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. The AWS Competency Program is designed to highlight APN Partners who have demonstrated technical proficiency and proven customer success in specialized solution areas. Attaining an AWS Competency allows partners to differentiate themselves to customers by showcasing expertise in a specific solution area. W H AT IS TH E AW S C OMPETEN C Y PR OGR A M? The AWS Machine Learning Competency Program
  • 15. Data wrangling for machine learning on AWS David McNamara, Customer Success Manager, Trifacta
  • 16. We believe that our ability to solve big problems in business and society depends on seeing patterns in the data we collect. But data comes in all shapes and sizes, and too often the messy process of pulling it together gets in the way of progress. At Trifacta, we empower change-makers to work with diverse and fragmented data—as it’s being cleaned and refined—so they can ask more interesting questions and create a better future.
  • 17. A global leader in data preparation #1 Rankings from Media & Analysts 85+ Global Technology and SI Partners #1 in Users with 10,000+ Companies Enterprise Standard for Data Preparation at 100+ Accounts
  • 18. Self-service data wrangling: the critical enabler *Wrangler: Interactive Visual Specification of Data Transformation Scripts – Heer, Hellerstein, Kandel, Paepke; Stanford University & University California, Berkeley (2011) DATA PLATFORMS ANALYSIS & CONSUMPTION 80% ”There's the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data” — Kaggle founder and CEO Anthony Goldbloom
  • 19. DATA PLATFORMS ANALYSIS & CONSUMPTION DATA WRANGLING ACTIVITIES Discover Structure Clean Enrich Validate Publish The Solution: Trifacta Data Wrangling Platform
  • 20. Typical AI/ML modeling data pipeline to empower business self-service
  • 21. Demo
  • 22. Consensus Corporation: Improved data wrangling to increase the speed of a machine learning model building for anti-fraud software Harrison Lynch, Senior Director of Product Development, Consensus Corporation
  • 23. Your speaker Harrison Lynch Sr. Director of Product Development
  • 24. Consensus history 24Consensus / Proprietary & Confidential 1999 20182007 2012 2014 IPhone Launches acquireslaunches as an online retailer of wireless phones & services acquires LetsTalk.com & rebrands as launches Connected Commerce in First Client Launch
  • 25. Use our wireless capabilities to perform new tricks 25 Multiplex Bundles Product & Connected Service Bundles LoyaltySubscriptions $625 Cash or $29 per month Underwriters Services Supply Chain 3. COLLECT 2. CONNECT 1. CATCH A TRANSACTION
  • 26. Multiplex is about two things CONFIDENTIAL 26 Cart Margin Enhancement (CME): Add products and activate services to improve margin for retailer and deliver a more complete customer experience. Future Proof Subscriptions (FPS): Give guests subscriptions to bundles of latest products and services with options to pay over time Retailers face increasing price competition leading to downward margin and revenue pressure. Consumers face fragmented ‘buying to using’ experiences, and sticker shock on large ticket items. INSIGHTS KEY BENEFITS OF OUR SOLUTION Get More with Assisted Sales Pay Less with Subscriptions
  • 27. • COGS: $850 • Customer finances 100% • Carrier pays device cost subsidy • Carrier pays commission • Retailer makes money on warranty, accessories • Carrier takes back commission and device subsidy if fraud • 1 bad sale wipes out the profit from up to 10 good sales Wireless retailing economics 27Consensus / Proprietary & Confidential
  • 28. Current industry practice relies on insufficient credit scoring methods 28Consensus / Proprietary & Confidential WillThey Pay OnTime? • FICO • Income • Payment History • Length of Employment • Credit Utilization • # of Accounts V. Identity Thieves are trying to steal the best credit scores WillThey Ever Pay? • Distance from Store • Basket Composition • RepTenure • Time of Day • Number Port from
  • 29. 29Consensus / Proprietary & Confidential Extract: • Start with a robust data set – preferably at least 3 months of data from orders that are at least 120 days old • This dataset (of orders) should indicate whether the carrier has deactivated a correspondingline (and if possible, whether the carrier classified the order as fraudulent) Analyze: • Applyhunches, theories and business knowledge • this is what’s known as “Ground Truth” Identify: • To extract and analyze a set of characteristics from the set of orders • These are referred to as “Features” Augment: • Identify characteristics and tangential data elements about Features that enhance the model’s usefulness • If reliable cell phone customers tend to come from particular areas, it may be prudent to model the regions from which customers drive to reach the cell phone store • If an annual festival increase a store region’s population by 100,000 people and a higher percentage of fraud cases come from this annual period, it may be prudent to cluster purchase orders that are proximate to the festival period Transform: • Put the data into a format into one that makes it easier for data modeling systems to read • This is usually one or a series of two-dimensional tables with one of the followingtypes of data  Continuous: numeric, numbers,like dollar values or distance  Binary: 1/0, or a yes/no  Categorical: colors (white, black, gold) or a carrier Split data into two sets: • Larger set is the training set that feeds into a set of models; it allows the models to identify good vs bad orders and to generate correlations. • Smaller test set is one that the data scientist puts away for future use against the model Choose: • Models that would best fit the data that comes through the system based on the models that have worked in the past • Put the training set through the set of candidate models, which results in: • Risk assessments for each order in the training set • Refinement of the candidate models such they become “pickled models” Train & Test: • After the candidate models generate risk assessments for the training data set, a data scientist pulls the test set out of a drawer (so to speak) • The models are then judged for accuracy against data they’ve never seen - this is how you make sure you’re not building a model to predict yesterday’s weather • The data scientist evaluates the candidate models based upon the comparison. • The candidate models undergo tuning and refinement such that they produce the best results possible against the test data set. • The data scientist selects the candidate model that produces the best results against the test data set Pickle, Promote & Deploy: • Compare the results for the best candidate model against the model currently in use, if it’s a winner move forward • Test the performance of the new model in use to ensure that it meets performance standards and promote it to production Model building 1 2 3 4 5 6 7 8 9
  • 30. System for machine learning 30Consensus / Proprietary & Confidential 1 7 6 5 3 2 4 8 1.Orders arrive at the machine learning system via several channels 2.The system parses the data. It is able to do so regardless of the channel of origin 3. The system extracts that data which is most relevant to the order scoring process, transforms the extracted data into a format the model can read & loads reformatted data into the model 4. The system scores the extracted and reformatted order using the risk scoring model 5. The system determines at random whether an order should be in a control group. It approves those orders immediately and without regard to their score 6. The system scores the extracted and reformatted order using the risk scoring model 7. Based on the results of the rules application, the system routes orders to third-party services and manual review processes as appropriate8. The system sends a yes/no determination and/or a risk score for the order back to the originating commercial channel
  • 31. Conventional • Many systems claim to use “models” • What they’re wedded to are linear models • LM’s are great for some things, not for others • Overfitting is an issue The whole universe of statistical models V. Consensus • The universe of models • Select the one that best fits the job Gaussian Kernel Random Forest Support Vector Machine Support Vector Machine
  • 32. Searching for a data wrangling solution • Join disparate data sources together • Carrier reconciliation data • Internal order data • External data • Data is messy • Shifting date formats • Inconsistent data types • Shifts in logic across partners • Reduce reliance on developers/ data analysts • I’m lousy at SQL • Day jobs need to be attended to • Scaling Discovery • Needed to be able to act on hunches • Explore ground truth • Data Prep is a labor of love CONFIDENTIAL 32
  • 33. • Data is in different places (and sometimes toxic) • Black box data • Pulling standalone sets of hashed values • Script Dev > SRE > Output > back to me • Making repairing of data repeatable • Geocode failures • Exploration of Census data • Tracking your changes • Showing your work & Data Provenance How Trifacta helps me 33Consensus / Proprietary & Confidential
  • 34. • Discovery of data • From 2-3 days, to less than one day • Data preparation • From 8 hours to less than one hour Our results with Trifacta 34Consensus / Proprietary & Confidential
  • 35. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q & A
  • 36. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Start wrangling today with Trifacta Wrangler Pro in AWS Marketplace: • aws.amazon.com/marketplace/ • Search for “Trifacta Wrangler Pro” Learn more about machine learning on AWS • aws.amazon.com/machine-learning/featured-partner-solutions/ Try AWS for free: • aws.amazon.com/free/ Next steps and further information: © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 37. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.