Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.
2. About Statistics.com
THE INSTITUTE FOR STATISTICS EDUCATION
• 100+ courses, introductory and advanced
• Traditional statistics, data mining, machine
learning, text mining, clinical
trials, optimization, use of R
• All online
• Typically 4 weeks, scheduled dates
• Don’t need to be online particular times/days
• Private discussion forum with instructors - noted
authors & experts
4. Predictive Analytics
• In marketing, used for model driven targeted
sales efforts
• Also… will loan default, what diagnosis (given
symptoms), is tax return fraudulent, …
5. Market Research
• Traditionally surveys, analysis, information
gathering, strategy
• Moving online increases the amount of
data, speeds its flow, and makes it more
accessible
6. Washington Post (web)
• 35 different reports tracking traffic daily
• Midday report “are we on track for visitors?”
• # visitors from key domains - .gov, .mil, .senate
or .house
7. Daily Mail (UK web)
• A traditional ingredient is stories about
animals – tracked on web
• “The animals that do best are
monkeys, dogs, and cats, in that order…”
Martin Clark (editor)
9. Predictive Analytics
• Goes beyond the obvious, capturing
complexity
• Implemented for real-time behavior and
decisions
10. Pregnant?
• Obvious retail clues – maternity clothes, baby
food, baby clothes, crib …
• These may be too late
• Earlier clues not so obvious –
lotions, supplements, and, esp., combinations
and changes in purchase patterns
• Data mining algorithms can capture these less
obvious, more complex signals
11. Training the Model
• Bridal registry
• Women of similar demographic not on bridal
registry
• Together, the training set
– Known outcome
– Purchase data over time
18. Therefore: Validate the Model
• Partition the original data
– Training
– Validation
• Fit the model to the training data
• Assess performance using the validation data
20. Confusion Matrix and Cutoff Control
Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable) 0.5
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 43 8
0 6 247
21. Lift
• In classifying “pregnant” vs. “not-pregnant”
classifying everyone as “not-pregnant” has
very high overall accuracy
• Need metric that reflects greater importance
of the “pregnant” category, which is rare
• Lift is the model’s improvement over average
random selection
23. Validate the Model
• Compare one model to another
• Avoid overfit
• Solution: apply model to hold-out sample
– Assess performance of different models
– Fine tune parameters of individual models
24. Partitioning
• Randomly split the initial data into 2 or 3
groups
– Training
– Validation
– Test
• Repeated use of validation data to compare
and fine tune models -> overfit to
validation, in addition to training
– “Test” partition used only once, at the end
25. Software
• SAS Enterprise Miner $$$$
• IBM SPSS Modeler (Clementine) $$$$
• XLMiner (Excel add-in) $
• Statistica Data Miner $$
• Salford Systems $$
• Rapid Miner $$ (open source free version)
• R open source free
26. Data Mining - More
• Clustering (segmentation)
• Profiling (explanatory models)
• Time series
• Affinity (recommender systems)
• Text analytics (NLP, sentiment analysis)
27. Skill Shortage
• McKinsey “Big Data” report
– Supply gap of 140,000-190,000 “deep analytical
talent”
• Emergence of “Analytics” masters programs
(Northwestern, NC State, …)