Join our webinar to hear how Consensus, a Target-owned subsidiary, utilizes AWS and Trifacta to prepare data for use in fraud detection algorithms. You’ll learn how self-service automated data wrangling can save your organization time and money, and tips for getting started with Trifacta’s solution, built for AWS.
.
Webinar attendees will learn:
- Why automating your data wrangling tasks can lead to greater data accuracy and more meaningful insights.
- How you can reduce your data preparation time by 60% and more with self-service data wrangling tools built for AWS.
- How easy it is to get started with machine learning solutions for data wrangling on the cloud.
15. Data wrangling for machine learning on AWS
David McNamara, Customer Success Manager, Trifacta
16. We believe that our ability to solve big
problems in business and society
depends on seeing patterns in the data
we collect. But data comes in all shapes
and sizes, and too often the messy
process of pulling it together gets in the
way of progress. At Trifacta, we
empower change-makers to work with
diverse and fragmented data—as it’s
being cleaned and refined—so they can
ask more interesting questions and
create a better future.
17. A global leader in data preparation
#1 Rankings from
Media & Analysts
85+ Global Technology
and SI Partners
#1 in Users with 10,000+ Companies
Enterprise Standard for Data Preparation at
100+ Accounts
18. Self-service data wrangling: the critical enabler
*Wrangler: Interactive Visual Specification of Data Transformation Scripts –
Heer, Hellerstein, Kandel, Paepke; Stanford University & University California, Berkeley (2011)
DATA PLATFORMS
ANALYSIS & CONSUMPTION
80%
”There's the joke that 80 percent of data science
is cleaning the data and 20 percent is
complaining about cleaning the data”
— Kaggle founder and CEO Anthony Goldbloom
19. DATA PLATFORMS
ANALYSIS & CONSUMPTION
DATA WRANGLING ACTIVITIES
Discover Structure Clean Enrich Validate Publish
The Solution: Trifacta Data Wrangling Platform
22. Consensus Corporation:
Improved data wrangling to increase the
speed of a machine learning model building
for anti-fraud software
Harrison Lynch, Senior Director of Product Development,
Consensus Corporation
24. Consensus history
24Consensus / Proprietary & Confidential
1999 20182007 2012 2014
IPhone Launches
acquireslaunches as an online
retailer of wireless phones
& services
acquires
LetsTalk.com &
rebrands as
launches Connected
Commerce in
First Client
Launch
25. Use our wireless capabilities to perform new tricks
25
Multiplex Bundles
Product & Connected Service Bundles
LoyaltySubscriptions
$625 Cash or $29 per month
Underwriters Services Supply Chain
3. COLLECT
2. CONNECT
1. CATCH A TRANSACTION
26. Multiplex is about two things
CONFIDENTIAL 26
Cart Margin Enhancement (CME): Add products and
activate services to improve margin for retailer and deliver
a more complete customer experience.
Future Proof Subscriptions (FPS): Give guests
subscriptions to bundles of latest products and services
with options to pay over time
Retailers face increasing price competition leading to
downward margin and revenue pressure.
Consumers face fragmented ‘buying to using’ experiences,
and sticker shock on large ticket items.
INSIGHTS
KEY BENEFITS OF OUR SOLUTION
Get More with Assisted Sales Pay Less with Subscriptions
27. • COGS: $850
• Customer finances 100%
• Carrier pays device cost subsidy
• Carrier pays commission
• Retailer makes money on warranty, accessories
• Carrier takes back commission and device subsidy if fraud
• 1 bad sale wipes out the profit from up to 10 good sales
Wireless retailing economics
27Consensus / Proprietary & Confidential
28. Current industry practice relies on insufficient credit scoring methods
28Consensus / Proprietary & Confidential
WillThey Pay OnTime?
• FICO
• Income
• Payment History
• Length of Employment
• Credit Utilization
• # of Accounts
V.
Identity Thieves are trying to steal the
best credit scores
WillThey Ever Pay?
• Distance from Store
• Basket Composition
• RepTenure
• Time of Day
• Number Port from
29. 29Consensus / Proprietary & Confidential
Extract:
• Start with a robust data set – preferably at least 3 months of data from orders that
are at least 120 days old
• This dataset (of orders) should indicate whether the carrier has deactivated a
correspondingline (and if possible, whether the carrier classified the order as
fraudulent)
Analyze:
• Applyhunches, theories and business
knowledge
• this is what’s known as “Ground Truth”
Identify:
• To extract and analyze a set of
characteristics from the set of orders
• These are referred to as “Features”
Augment:
• Identify characteristics and tangential
data elements about Features that
enhance the model’s usefulness
• If reliable cell phone customers tend to
come from particular areas, it may be
prudent to model the regions from which
customers drive to reach the cell phone
store
• If an annual festival increase a store region’s
population by 100,000 people and a higher
percentage of fraud cases come from this
annual period, it may be prudent to cluster
purchase orders that are proximate to the
festival period
Transform:
• Put the data into a format into one that makes it easier for data
modeling systems to read
• This is usually one or a series of two-dimensional tables with one of
the followingtypes of data
Continuous: numeric, numbers,like dollar values or distance
Binary: 1/0, or a yes/no
Categorical: colors (white, black, gold) or a carrier
Split data into two sets:
• Larger set is the training set that feeds into a set of
models; it allows the models to identify good vs bad
orders and to generate correlations.
• Smaller test set is one that the data scientist puts away
for future use against the model
Choose:
• Models that would best fit the data that comes through the system
based on the models that have worked in the past
• Put the training set through the set of candidate models, which
results in:
• Risk assessments for each order in the training
set
• Refinement of the candidate models such they
become “pickled models”
Train & Test:
• After the candidate models generate risk assessments
for the training data set, a data scientist pulls the test
set out of a drawer (so to speak)
• The models are then judged for accuracy against data
they’ve never seen - this is how you make sure you’re
not building a model to predict yesterday’s weather
• The data scientist evaluates the candidate models
based upon the comparison.
• The candidate models undergo tuning and refinement
such that they produce the best results possible against
the test data set.
• The data scientist selects the candidate model that
produces the best results against the test data set
Pickle, Promote & Deploy:
• Compare the results for the best candidate model against the model currently in use, if it’s a winner move forward
• Test the performance of the new model in use to ensure that it meets performance standards and promote it to production
Model building
1
2
3
4
5
6 7
8
9
30. System for machine learning
30Consensus / Proprietary & Confidential
1
7
6
5
3
2
4
8
1.Orders arrive at
the machine
learning system via
several channels
2.The system
parses the data. It is
able to do so
regardless of the
channel of origin
3. The system
extracts that data
which is most
relevant to the
order scoring
process, transforms
the extracted data
into a format the
model can read &
loads reformatted
data into the model
4. The system
scores the extracted
and reformatted
order using the risk
scoring model
5. The system
determines at
random whether an
order should be in a
control group. It
approves those
orders immediately
and without regard
to their score
6. The system
scores the extracted
and reformatted
order using the risk
scoring model
7. Based on the results of the rules
application, the system routes orders
to third-party services and manual
review processes as appropriate8. The system sends a yes/no
determination and/or a risk score for
the order back to the originating
commercial channel
31. Conventional
• Many systems claim to
use “models”
• What they’re wedded
to are linear models
• LM’s are great for some
things, not for others
• Overfitting is an issue
The whole universe of statistical models
V.
Consensus
• The universe of models
• Select the one that best fits the job
Gaussian Kernel
Random Forest
Support Vector Machine
Support Vector Machine
32. Searching for a data wrangling solution
• Join disparate data sources together
• Carrier reconciliation data
• Internal order data
• External data
• Data is messy
• Shifting date formats
• Inconsistent data types
• Shifts in logic across partners
• Reduce reliance on developers/ data analysts
• I’m lousy at SQL
• Day jobs need to be attended to
• Scaling Discovery
• Needed to be able to act on hunches
• Explore ground truth
• Data Prep is a labor of love
CONFIDENTIAL 32
33. • Data is in different places (and sometimes toxic)
• Black box data
• Pulling standalone sets of hashed values
• Script Dev > SRE > Output > back to me
• Making repairing of data repeatable
• Geocode failures
• Exploration of Census data
• Tracking your changes
• Showing your work & Data Provenance
How Trifacta helps me
33Consensus / Proprietary & Confidential
34. • Discovery of data
• From 2-3 days, to less than one day
• Data preparation
• From 8 hours to less than one hour
Our results with Trifacta
34Consensus / Proprietary & Confidential