Human Factors of XR: Using Human Factors to Design XR Systems
Data Refinement: The missing link between data collection and decisions
1. Data Refinement:
The missing link between data collection and decisions
Stephen H. Yu
Data Strategy & Analytics Consultant
2. What we will cover
•
•
•
•
•
•
Database Marketing Landscape
Analytics and Models
“Model-Ready” Environment
Data Summarization & Categorization
Delivery: Scoring & QC
Closing the Loop
1
3. Big Data, Small Data, Neat Data, Messy Data
How is the "Big Data” working out for you?
2.5 quintillion bytes collected per “day”
1 quintillion (exabytes) = 1 billion gigabytes
• Did all this data improve your decision making process?
• Do you have the results to show for?
• Information Overload? You bet!
Harness insights, drop the noise
2
4. Database Marketing Landscape
• No guessing game – You MUST know your target
• Vast amount of online & offline data collected
But are they being used properly?
• Analytics play a huge roles in prospecting & CRM
• Short paced marketing cycle getting shorter
• Huge difference between advanced marketers
and those who are falling behind
Winners are the ones who know how to wield the power
of all available data faster.
3
5. Insights, not Raw Data
• Database marketers must excel in:
Collection
Refinement
Delivery
Size & speed
matters
Get to answers, not
just ingredients
For consumption
by end-users
• Must provide “marketing answers” via advanced analytics,
not just bits and pieces of data
• Insight does not come from data, it is derived from data
4
6. Refined Answers
Raw Data
Marketing Answers
•
•
•
•
•
•
•
•
•
• Likely to buy a luxury car
• Likely to take a foreign vacation
• Likely to response to free
shipping offer
• Likely to be a high value customer
• Likely to be qualified for credit
• Likely to upgrade
• Likely to leave
• Likely to come back
Demographic / Firmographic
RFM
Products & Services Used
Promotion / Response History
Lifestyle / Survey Responses
Delinquent history
Call / Communication Log
Movement Data
Sentiments
5
7. Different Types of Analytics
“Analytics” means different things…
• BI (Business Intelligence) Reporting: Display of success metrics, dashboard
reporting
• Descriptive Analytics: Profiling, segmentation, clustering
• Predictive Modeling: Response models, cloning Models, lifetime
value, revenue models, etc.
• Optimization: Channel optimization, marketing spending
analysis, econometrics models
Predictive Modeling for 1-to-1 Marketing
6
8. Why model?
• Increase Targeting Accuracy
• Reduce costs by contacting less/smart
• Stay relevant
• Consistent results
• Reveal hidden patterns in data
• Repeatable – key for automation
• Expandable
• “Supposedly” save time and effort
Models summarize complex data into simple-to-use “scores”
7
9. Why NOT model?
• Universe is too small
• Predictable data not available
• 1-to-1 marketing channels not in plan
• Tight budget
• Lack of resources
8
11. Any pain implementing models?
• Not easy to find “Best” customers
• Modelers are fixing data all the time
• Rely on a few popular variables
• Always need more variables
• Takes too long to build models and score
• Inconsistencies shown when scored
• Disappointing results
10
12. What does your database support?
If you have a database…
• Order Fulfillment
• Contact Management
• Standard Reports
• Ad hoc Reports and Queries
• Name Selections
• Response Analysis
• Trend Analysis
But does it support predictive
modeling and scoring?
11
13. For modeling, clean the data first
“Garbage-in, garbage-out”
• Most data sets are messy & “unstructured”
• Over 70-80% of model development time
goes to data prep work
• Most databases are NOT model-ready
• Modeling & Scoring
o Extension of database work
o Consistency is “the” key
12
14. Predictive Modeling is all about “Ranking”
2
1
3
Ultimately, Models
must properly “Rank”
Determine the level of
data accordingly
• Households
• Individuals
• Products
• Relational or unstructured
databases won’t cut it
• Must create “Descriptors” that
fit the level that needs to be
ranked
13
15. Unstructured to Structured
• Most modern databases optimized for massive storage and
rapid retrieval, not necessarily for predictive analytics
o Relational databases
o NoSQL databases
• Need “Analytical Sandbox” (or Database/Data-mart)
o Structured & de-normalized
o Variables as descriptors of model targets
o Common analytical language (SAS, R, SPSS)
o Must support “in-database” scoring
14
16. Analytical Sandbox – “Model-Ready” Environment
From “all” types of data collection to decisions. Then repeat.
15
17. Why Front-end DP Important?
Inexperienced analysts spend majority of time doing DP work
Modeling work at the last minute!
Creative variables enhance models
Inconsistent data creates a chain reaction to melt-downs
Data append/match becomes ineffective
16
19. 3 Major Types of Data for Marketers
Descriptive Data
• Demographic Data
• Firmographic Data
• Geo-demographic Data
Transaction Data /
Behavioral Data
• Transaction data
Attitudinal Data
• Surveys
• Sentiments
• Compiled / Co-op data
• Lifestyle data
• Online behavioral data
3-Dimensions in Predictive Analytics
18
20. Data Inventory
• “Modeling is making the best of
what we know”
• Beyond obvious RFM data
• Get Deeper
o Product/Service Level Data
o Historical Data
o Channel Data – Inbound & Outbound
o Online activities, sentiments,
unstructured data
• External Data
19
21. Create Data Menu
• Base it on Companywide Need-Analysis
• Ask the Analysts first:
What type of models are in the plan?
o Affinity/Look-alike Models
o Promotion/Response Models
o Time-series Models
o Attrition Models
• Consider non-analytical departments
• Maintain the ones that fit the objective
Don’t be afraid to throw out “noises”
20
22. Data Menu (continued)
Check the ingredients
Cost - Can you afford to maintain it?
• What do you have today?
• What can be bought?
• What can be created?
• Storage/Platform – Consider the
scoring part, too
• Programming/Processing Time
• Software
• Update
• External Data
21
23. Check Your Data Inventory
You may have more than you thought, so let’s start
with what you have:
•
•
Name & Address: Key to Geo/Demographic Data
Order Transaction Data: “RFM”, Payment Methods
•
Item/SKU Level Data: Products, Price, Units
•
Promotion/Response History: Source, Channel, Offer
•
Life-to-Date/Past “x” Months Summary Data
•
Customer Level Status Flags: Active, Dormant, Delinquent
•
Surveys/Product Registration Forms : Attitudinal/Lifestyle
•
Customer Communication History Data: Call-center, Web
•
Social Media, Click-through, Page views : Sentiment/Intentions
Need Conversion, Categorization, & Summarization
22
24. Maximize the Power of Transaction Data
Most databases describe shopping baskets
Start describing your targets
• RFM Data must be Summarized (or Denormalized)
• Turn RFM data into individual / household level
“Descriptors”
• Combine with essential categorical variables
(e.g., product, offer, channel, etc.)
23
25. Data Summarization – Matching the level of Data
Order Table
Cust ID
Order #
Order Summary Table
Order Date
$ Amount
000123
100011
2009-05-06
$199.99
000123
100128
2010-08-30
$50.49
000123
103082
2011-12-21
100036
2010-06-06
$43.99
003859
101658
2011-01-20
102189
2011-04-15
$119.45
003859
106458
2012-02-18
104535
2012-07-30
$354.72
016899
107296
2011-07-14
102982
2010-09-07
$128.60
019872
103826
2011-04-30
$499.99
019872
109056
2012-03-12
Last Order
Date
$199.99
019872
First Order
Date
$43.99
004593
$ Total
$43.99
003859
#
Orders
$128.60
003859
Cust ID
$59.99
000123
3
$379.08
2009-05-06
2011-12-21
003859
4
$251.42
2010-06-06
2012-02-18
004593
1
$354.72
2012-07-30
2012-07-30
016899
1
$199.99
2011-07-14
2011-07-14
019872
3
$688.58
2010-09-07
2012-03-12
24
26. Sample Variables after Summarization
Before
Recency
Frequency
Monetary
After Summarization
• Weeks since last online purchase
• Years since member sign up
• Days since last delinquent date
• Months since last response date
• Orders by offer type
• Orders by product/service type
• Payments by pay method
• Average days between transactions
• Total $ past 24 months
• Life-to-date spending
• Average dollars by channel
• Average dollars by product type
25
27. RFM Data Summary – Timeline
Life-to-date Summary provides
the historical view
May create bias towards
tenured customers
Put time limit on variables (e.g.
12-month, 24-month, etc.)
May require higher number of
variables and complicate the
process
For Lifetime Value & Time
Series Models
Must create historical arrays
(daily, weekly, monthly counts
of events)
26
28. Who does the summary work?
Answer: Not the statistician!
Key Takeaway
The data variables must be consistent
everywhere in the “Analytical Sandbox”
• Main analytical database
• Model development sample
• Pool of records to be scored
Pre-built summary variables in the
database
27
29. Data Categorization
Free-form data comes to life through categorization
Don’t Give Up!
• Hidden data in:
o Product, service, offer, channel, source, status, titles, surveys, etc.
• Have categorization guideline?
• Who will do it?
o Consider text mining techniques
• What to throw out?
o Keep data that matters in predictive modeling
28
30. Categorical Data
Any Non-numeric Data
• Product
• Service
• Offer
• Channel
• Source
• Market
• Region
• Business Title
• Member Status
• Payment Status
• etc…
Offer Code Example:
• Flat Dollar Discount
• % Discount
• Buy 1, Get 1 Free
• Free Shipping
• No Payment Until…
• Free Gift
• etc…
Categorize as much as possible at the data collection stage
29
31. Categorization Guidelines
Be consistent throughout
•
•
•
•
•
•
Survey Form
Data Entry
Inventory Database
Data Collection & Compilation
Summarization
Modeling, and Scoring
Create “Code” structure
• NEVER allow free-form answers
30
32. Categorization Guidelines (continued)
Create Rules and DON’T Deviate
from them
Training & Automation
More specific the better
Combine them later
But, don’t allow too many
variations (over 20) in one code
Break into multiple
codes if necessary
Don’t forget the end goals and
don’t over-do it
Must be “relevant”
31
33. Data Hygiene & Data Append
Do you know how many customers you
have in your database?
• Data conversion
o
o
o
o
Create consistency
Standardization
Edit
Purge
• Cover all bases – PII & RFM Data
• Create rules and be consistent
32
34. PII – Gateway to External Data
What is hidden behind simple name &
address?
• Standardize Name & Address first
o Maintain PII (Personally Identifiable Information)
o Hygiene via periodic NCOA and standardization
• First & Last Name – Ethnic, Gender
• Name, Address, Email – Demographic Data
• Address – Geo-demographic, Census Data
• Zip – County, Market Region, DMA
33
35. External Data
• Always consider buying data
before collecting and building
• Compiled Demographic /
Firmographic Data
• Behavioral / Transaction / Co-op
Data
• Lifestyle / Attitudinal / Survey
• Census / Geo-Demographic Data
34
36. External Data Check List
• Test multiple data sources
1.
2.
3.
4.
5.
6.
Depth of information / Uniqueness
Coverage / Match Rate
Consistency / Update Cycle
Price
User-friendliness
Delivery Options – Real-time?
• Learn about the data sources
o What’s real and what’s imputed?
o Don’t stop at Demographic: always consider
“Behavioral” data
35
37. Missing Values
“Missing Data can be meaningful.”
• For Numeric Data (e.g., $, Counters, Dates, etc.)
o Incalculable vs. Data-append Non-matches
o Missing is missing: DO NOT fill in with 0’s
$
• For Categorical Data (e.g., Codes, Text, etc.)
o Leave room for “N/A” (e.g., blank, “N/A”, “0”, “.”,
etc.)
o Code “Non-matches” to external files differently
And yet, unstructured database only store
what’s available.
36
38. More on missing data
• Agree on Imputation Rules
o Do it upfront
o Must be part of scoring codes
Data
• Educate non-analysts
o Hard to undo when combined with
other values
o Train vendors
• Always check % missing
o Development Sample vs. Life
Databases
37
39. Scoring – Sample vs. Database
• Development Sample vs. Live Database
o
o
o
o
Database Structure
Variable List/Name
Variable Value
Imputation Assumption
Lead to disasters if “anything” is different
• Do NOT play with model groups that are set in the
development sample
38
40. Scoring QC
Most troubles happen after the models are built…
Check:
•
•
•
•
•
•
Model Group Distributions
Variable distributions (values and indices)
Missing Values
Match rate for appended data
Scoring codes, including score breaks
Compare to previous runs – Check Deterioration
Set parameters for acceptable differences and Enforce
39
41. Score installment in the database
• Plan ahead to store model
scores in the database and datamarts
Sync with the main database
• Store raw scores, not just model
groups
• Match the levels of scores
o
o
o
o
Household
Individual
Email
Product
40
42. Back-end Analysis
Close the Loop Properly!
• 1-to-1 MKT 101:
“Learn from past campaigns”
• Must Plan ahead
o No excuse for not doing it
o Schedule ahead & budget properly
o Keep historical promo/response data (even
without marketing databases)
• Set Metrics Upfront
o List/Data Source
o By Offer, Creative, Season, Product, etc.
41
43. Response Reports & BI Analytics
• Start with “Canned” Reports and BI Dashboards
from vendors
• Don’t insist real-time for every report
• Get ready to create “Custom” reports and
dashboard views
o Prioritize what you want
• Format and Delivery
• Timing and Interval
• Timeline to be covered (YTD, 12-mo, etc.)
42
44. Key ROI Metrics
Set ROI Metrics, such as:
• Open, Click-through, Conversion Rates
o “Denominator” in each?
• Revenue
o Per 1,000 mailed/calls
o Per Order
o Per Display, Email, Click-through, Conversion
• By
o Source, Campaign, Time Period, Model
Group, Offer, Creative, Targeting
Criteria, Channel (in & outbound), Ad
server, Publisher, Key word, Script, Daypart, etc.
• Key variables must reside in the database
o Keep them in “ready-to-use” format
43
45. Where to Begin with Analytical Sandbox
Spec it out:
• Project Goal
• Data Source List (as detailed as possible)
• Final Variable List
• Project Flow:
o Collection
o Scoring
o Conversion & Categorization
o Storage
o Summarization
o Backend Analysis
o Matching / Data Append
o Development Sample
44
46. Who will build the sandbox?
• In-house vs. Out-sourcing;
Must consider
o Platform
o Software
o Programming
o Staffing
• Cost it out
o Don’t forget the update cost
• Back to the analysts for variable list
review
• Don’t be shy and ask for help from
consultants
45
47. 10 essential items to consider when outsourcing analytics
It is not just about modeling, but all surrounding services as well
10 essential items to consider when outsourcing analytical projects
1. Consulting capability: Translate marketing goals into mathematics
2. Data processing: Conversion, edit, summarization, data-append, etc.
3. Pricing structure: Model development is only one part; hidden fees?
4. Track record in the industry: Not in rocket science, but in marketing
5. Types of models supported: Watch out for one-trick ponies
6. Speed of execution: Turnaround time measured in days, not weeks
7. Documentation: Full disclosure of algorithms, charts and reports
8. Scoring validation: Job not done until fully scored and validated
9. Back-end analysis: For true “Closed-loop” marketing
10. Ongoing Support: Periodic review and update
46
48. Scope it out
• Know what you need, but don’t
over do it
“Modeling is making the best of
what’s available”
• Take a phased approach
o If budget is tight, start with low
hanging fruits
o Proof of concepts without full
database commitment in the
beginning
o Maintain consistency
o Keep Historical Data
47
49. Key Takeaways
•
•
•
•
•
•
•
•
Invest in analytics: It is about the insights, not just the data
Set business & analytical goals first
Advanced analytics requires its own sandbox
Ensure consistent data every step of the way: From
sampling to scoring
Check every data source, but don’t wait for a perfect data
set
Match the levels of data (Data Summary)
Don’t over-do it – employ phased approach
Ask for help
50. Stephen H. Yu · Data Strategy & Analytics Consultant· stephenyu1210@gmail.com· 201.218.2068