Slidedeck from our seminar about Data Science (30/09/2014)
Topics covered:
- What is Data Science?
- What can Data Science do for your business?
- How does Data Science relate to Statistics, BI and BigData?
- Practical application of data mining techniques: decision trees, naive bayes, k-means clustering, a priori
- Real-world case of applied data science
1. Data Science Company
Introduction to (big) data science
Infofarm - Seminar
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
30/09/2014
2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• About us
• What is Data Science?
• Data Science in practice
– Models
– Tools
• Case study
3. About us
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Company
• Data Science and BigData startup
• Part of the Cronos group
– Largest indepent IT services supplier in Belgium
– Organized in limited-sized highly focused competence centers
– 3000+ Consultants
• Incubated at Xplore Group, within the context of:
– Java
– PHP
– e-commerce (Hybris, Intershop, Magento, DrupalCommerce, ...)
– Mobile development (iOS, Android, ...)
– Web development (HTML5, CSS3, ...)
5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Team
• Mixed skills team
– 2 Data Scientists
• Mathematics
• Statistics
– 4 BigData Consultants
– 1 Infra specialist
– n Cronos colleagues
with various background
• Certifications
– CCDH - Cloudera Certified Hadoop Developer
– CCAD - Cloudera Certified Hadoop Administrator
– OCJP – Oracle Certified Java Programmer
6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Focus
• Mission
– “Help our customers to excel in their business activities by
providing them with new information and insights of high
business value.
Identifying, extracting and using data of all types and origins;
exploring, correlating and using it in new and innovative ways in
order to extract meaning and business value from it.”
• Focus Domains
– Data Science
– Machine Learning
– Big Data
7. Introduction: what is Data Science?
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
What is Data Science?
• Data Science & Business decisions
• Data Science vs …
– Statistics
– Business Intelligence
– Big Data
• What can Data Science do for your business?
• The Data Science maturity model
9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions
• Any business requires continuous decision taking
– Will we offer this customer a discount or not?
– Do we need to keep extra stock for product X?
– How do we answer this customer question?
– At which supplier do we buy this product?
– With which solution will be respond to this RFP?
– Do we need to replace device X?
– …
• The possible answers to these questions are based on prior
experience with the business
• Each decision can turn out to be the right or wrong one, business
knowledge should avoid picking the wrong ones
10. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions
– However …
• Do you really know your business that well?
• Hasn’t it evolved in this fast-changing world?
• Are you sure your competitors aren’t making better decisions?
– You probably own a lot more information than you might realize!
• All your business processes are generating data which you can
use to your advantage!
• Quotes you made vs deals you won
• Historical sales records
• Web logs showing user activity
• Social media activity referring your brand/product
• Metering info on devices (internet of things)
• …
11. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Types of Data
– Proprietary data
• ERP, CRM, Orders, Customers, Products, etc…
– “Dark Data” – currently unused, maybe not even aware of
• Unknown, but present in the company
• Cost-efficient BigData tools might enable business cases using this data
– External data
• Websites, social media, open data, …
– Data still to be captured
• “If only we knew X or Y” …
– There might be a huge added value in “mashing up” proprietary
data with public/open data!
12. Business Knowledge vs Data Science
(Intuitive knowledge vs data driven decisions)
Business Knowledge
Acquired by experience
(assumed) insights
RISK: too high bias on past experience and gut feeling
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science
Complementary to business knowledge
Confirmative or new insights
Data-driven decision taking
RISK: too naive data intepretation,
disconnected from business
13. Business Knowledge vs Data Science
(Intuitive knowledge vs data driven decisions)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
14. Business decisions: marketing example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Example:
We want to send mailings about our new product
• Decisions to take:
– Which mail to send to which customers?
– We need customer segmentation!
• Risks in failing to do this correctly
– Missing opportunities (not informing customers)
– Annoying customers with irrelevant mailings (churn, reputation damage, …)
15. Business decisions: marketing example
• Business knowledge based approach
– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female”
– But is this (still) true?
– E.g.: do we really want to send an ad of the new iPhone to a long-time Android
user because he’s a 30-something male customer?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
16. Business decisions: marketing example
• Data-driven approach: Can we identify different segments automatically?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
(machine learning!)
– WEB SERVER LOGS
Which customers have already looked at similar
product on our website?
– ORDER HISTORY
Which customers own complementary products?
– CRM INFORMATION
What is the typical profile of a customer that clicked
through on the last e-mail campaign for a similar product?
– …
• Business knowledge and Data Science become in- and output for
each other!
– Ideas/hypotheses and data to be examined should be identified from business
knowledge!
– A/B testing can be applied to test approaches and check results
– Let the data talk for itself! New business insights are generated
17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Being a Data Scientist
• “Data Scientist – the most sexy job of the 21st century”
- Thomas H. Davenport
• Data Scientist: “A person who is better at statistics than any software
engineer and better at software engineering than any statistician”
- Josh Wills
18. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science = team work!
19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Statistics
• Basic Statistics concepts
– Reliability and validity
– Probability
– Descriptive statistics and graphics
• Inferential statistics (and hypothesis testing)
– Probability distributions
– Populations and samples
– Confidence intervals
– Correlation
• Data Science
– Link with IT (tooling, scale, …)
– Data preparation & hacking (get data from databases, websites, …)
– Machine learning and automation
– Working interactively together with business
20. Data Science vs Business Intelligence
• Basic BI concepts: structuring data to report and query upon it
– DWH, OLAP, ETL processes
– Star- and snowflake schemas
– Query-oriented architectures
– Close to typical IT development cycle
• Data Science: working and experimenting with data to gain insights
– Exploratory working
– Work in a research cycle rather than development cycle
– Limited investment towards analysis that might or might not deliver
– Tools designed to avoid heavy ETL (loosely structured data)
– Eventually valuable analyses can be ported to BI systems
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
21. Data Science vs Business Intelligence
• Using tools that are designed to support exploratory
working
– Not requiring strict up-front schema design
– Allowing fast and cheap hypotheses testing
– Open up opportunities to quickly integrate many data sources
• Excel files, Text files, Word Documents
• Log files
• Relational databases
• Sensor data
• Timeseries data
• ...
• Integrations with online (OLTP) and analytical
(OLAP/BI) systems
– Typically for automating repetitive analysis and reporting outputs
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
22. Sampling Induction
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Big Data
• Process of statistical inference: sampling & induction
• BigData allows:
– N=ALL (avoid sampling errors)
• Sampling issues can be overcome by just processing ALL available data (process massive data)
– N=1 (avoid issues with non-homogenous datasets)
• Categorization becomes true personalisation: project towards ONE individual (calculate per item)
• Significance considerations are not applicable!
23. What can Data Science do for your business?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Extract meaning from data
– Using and combining data in ways it has never done before
– Finding patterns and correlations in data from all possible sources
– Detecting anomalies and changes in known patterns
• Transform data of various types into valuable information
– As a basis for management decisions
– As a basis for data products
– That can improve your business in any way
• Build and integrate Data Products
– Recommendation engines, Prediction models, Automated classification, …
• The key point is spotting opportunities to outperform your
competitors using any data available!
24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Scientific cycle
Question
Hypothesis
Experiment
(data)
Conclusion
Analyse
results
• This is NOT a
development cycle!
• Experimentation vs
engineering
• Being a Science makes
that the outcome cannot
be predicted
• This makes it hard to
integrate in an IT
development process
25. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Scientific cycle
• Take small steps
• Formulate hypotheses
• Actually build things
• Apply A/B testing
• Even without success,
you learned something!
26. The Data Science maturity model
• Don’t run before you can walk: The Data Science Maturity model
Each level builds on the quality of the underlying step. It’s science, not magic …
– Start off by simply collecting the data you need (type, quantity, quality)
– Then report on your current business (confirmative analysis)
– Discover new and valuable information (exploratory analysis)
– Build and test prediction models (predictive analysis)
– Steer your business based on advise output from your predictions (data-driven)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Collect
Describe
Discover
Predict
Advise
27. The Data Science maturity model
Phase Actions Examples in commerce
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Collect
Logging information
Gathering data from different sources
Logging user actions on a website
Using loyalty cards to id customers
Describe
Explorative Data Analysis
Basic analytical functions
Checking quantity and quality of data
Typical reporting
Correlating data over sources
Discover
Finding correlations
Building models
Finding similarly behaving customers
Predict
Building prediction models
Formulating expectations for the
future based on past info
Predict sales figures for a new product
Predict whether a certain customer
will or will not buy a certain product
Advise
Use prediction models to evaluate
decision possibilities and pick the best
Target advertising to the right
customer groups to optimize revenue
28. Data Science in practice
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
35. Modeling methods & statistics
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Basic patterns
– Recommendations
Based on known taste, propose items that might be liked as well
– Clustering
Detecting correlation groups in data without using pre-defined
segmentation based on business knowledge
– Classification
Automated labeling, acceptance/rejection of data based on
probability models
• Supervised & unsupervised learning methods
– k-means, naive bayes, n-nearest neighborhood, random forrests,
logistic regression, A priori, ...
36. Modeling methods: Decision Tree
• Query: which kind of fruit am I looking at
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
– More general: image recognition
• Clean your data
– What to do with missing values?
• Insert average value
• Insert special value
• Delete data
– What to do with outliers?
• Wrong data?
37. Modeling methods: Decision Tree
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Find most decisive variable
– Categorical variable: One leaf for each variable or one leaf for a
group of categories
– Numerical variable: find best cut-off(s)
Query
Color
Green Yellow Red
38. Modeling methods: Decision Tree
• For each leave, repeat the process:
Size is actually numerical: find size cut offs
Yellow
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Query
Color
Size
Green
Big
Medium
Small
Shape
Roun
d
Thin
Size
Red
Medium Small
39. Modeling methods: Decision Tree
Yellow
Medium
Small
Sweet
Sour
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Query
Color
Size
Green
Big
Water-melon
Medium
Green
apple
Small
Grapes
Shape
Size
Round
Big
Grape-fruit
Mediu
m
Lemon
Banana
Thin
Size
Red
apple
Try it
Cherry
Grape
40. Modeling methods: Decision Tree - Distributed
• A big advantage of the big data tools are the Distributed
processing power (run processes in parallel)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Build your decision tree
– Each leaf can be processed by another node
– All your data should still be available to every mapper
• Upgrading your decision tree
– Bagging trees (sampling your data)
– Random Forest (sampling your variables)
– Every mapper should only read a part of your data
– Still in general better results than a decision tree
41. Modeling methods: Decision Tree
• QUESTION: Can we predict whether a customer will place an
Date_added
> 1.5
Hour_added
> 16.29
0.06 Date_added
< 5.113
0.1136 0.1829
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
order during this web session?
• Modeling (data mining)
– Input: historical surfing information
– Decision tree algorithm
• Loop at historical data
• Find most decisive variable
• For each leaf, repeat
– Avoid overfitting!
• Runtime usage
– Pass current info in tree model
– Allow certain discounts to increase conversion?
– Put user on checkout or in-store after putting product in basket?
0.3273
42. Modeling methods: Naive Bayes
• QUESTION: Will I play tennis today?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Start with labeled data from the past
Again clean your data!
• Often used with plain text
• Assumes that each variable is independent from all others
• Named after Bayes rule (statistics)
43. Modeling methods: Naive Bayes
Day • Outlook Temperature Humidity Wind PlayTennis
D1 • Sunny Hot High Weak No
D2 • Sunny Hot High Strong No
D3 • Overcast Hot High Weak Yes
D4 • Rain Mild High Weak Yes
D5 • Rain Cool Normal Weak Yes
D6 • Rain Cool Normal Strong No
D7 • Overcast Cool Normal Strong Yes
D8 • Sunny Mild High Weak No
D9 • Sunny Cool Normal Weak Yes
D10 • Rain Mild Normal Weak Yes
D11 • Sunny Mild Normal Strong Yes
D12 • Overcast Mild High Strong Yes
D13 • Overcast Hot Normal Weak Yes
D14 • Rain Mild High Strong No
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
44. Modeling methods: Naive Bayes
• Consider PlayTennis problem and new instance
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
(sun, cool, high, strong)
45. Modeling methods: Naive Bayes
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Estimate parameters
– P(yes) = 9/14 P(no) = 5/14
– P(Wind=strong|yes) = 3/9
– P(Wind=strong|no) = 3/5
– …
• We have
P(y)P(sun|y)P(cool|y)P(high|y)P(strong|y) = 0.005
P(n)P(sun|y)P(cool|n)P(high|n)P(strong|n) = 0.021
• Therefore this new instance is classified to “no”
46. Modeling methods: Naive Bayes - distributed
• Vectorisation of trainining data (more or less wordcount) can
easily be distributed:
– Each text to one mapper
– Even when dealing with a large text cut your text in to peaces
– Every small block of data only read once by one mapper
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Vectorisation of your new instance
• Actual prediction is a multiplication of all conditional chances
also calculation of prediction easy to distribute
47. Modeling methods: Naive Bayes
• QUESTION: Can we route incoming questions (free text) to the
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
right person/department?
• Modeling (data mining)
– Input: historical information questions and handling person/department
– Naive bayes algorithm
• For each word or n-gram (2 or 3 words) – count occurences per file
• Very valuable are words with high frequency in a single document
• Very valuable are words only used in a small number of documents
• Remove stopwords, generic words, etc…
• Runtime usage
– Vectorize incoming document (which words/n-grams occur how many
times?)
– Predict category based on comparison with historical documents
48. Modeling methods: k-means Clustering
• QUESTION: Which countries have the same type of food
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
consumption
• Your data is not labeled!
• You define labels for your clusters after applying the cluster
algorithm
• Choose the number of clusters you are expecting
– Try for different number of clusters
– Run an algorithm to decide the optimal number of clusters
• Plot your final results mapped on your principal components
50. Modeling methods: k-means Clustering
• Define a metric: take every variable into account as much as all
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
other variables
• Create random starting points (as many as clusters you expect)
• Assign each point to the closest center (or starting) point
• Calculate the center of each cluster
• Iterate the previous two steps
53. Modeling methods: k-means Clustering
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
"cluster 1"
Country RedMeat Fish Fr.Veg
Albania 10.1 0.2 1.7
Bulgaria 7.8 1.2 4.2
Romania 6.2 1.0 2.8
Yugoslavia 4.4 0.6 3.2
"cluster 2"
Country RedMeat Fish Fr.Veg
Denmark 10.6 9.9 2.4
Finland 9.5 5.8 1.4
Norway 9.4 9.7 2.7
Sweden 9.9 7.5 2.0
"cluster 3"
Country RedMeat Fish Fr.Veg
Czechoslovakia 9.7 2.0 4.0
E Germany 8.4 5.4 3.6
Hungary 5.3 0.3 4.2
Poland 6.9 3.0 6.6
USSR 9.3 3.0 2.9
[
"cluster 4"
Country RedMeat Fish Fr.Veg
Austria 8.9 2.1 4.3
Belgium 13.5 4.5 4.0
France 18.0 5.7 6.5
Ireland 13.9 2.2 2.9
Netherlands 9.5 2.5 3.7
Switzerland 13.1 2.3 4.9
UK 17.4 4.3 3.3
W Germany 11.4 3.4 3.8
"cluster 5"
Country RedMeat Fish Fr.Veg
Greece 10.2 5.9 6.5
Italy 9.0 3.4 6.7
Portugal 6.2 14.2 7.9
Spain 7.1 7.0 7.2
54. Modeling methods: k-means Clustering - distributed
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Calculate conditional chances
– Every mapper only needs one variable
• Assigning points to clusters:
– All centers in distributed cache
– Rest of the data only read once by one mapper
– Calculate distances and assign to the closest center point
• Update center points
– One mapper for each cluster
55. Modeling methods: k-means Clustering
• QUESTION: In which different segments can we split our
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
customer base?
• Modeling (data mining)
– Input: any information on the customers (CRM, ERP, Social Media, …)
– Very important to find columns to use (requires business knowledge to
formulate hypotheses!)
– K-means clustering algorithm
• Define a “distance” formula to calculate how close two customers are to
each other
• Define starting points for each cluster center
• Iterate and re-allocate customers to a cluster, move cluster centers
• Runtime usage
– Quickly check the cluster in which a new customer could be residing
56. Modeling methods: A priori
• QUESTION: Which books might be interesting for you, knowing
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
which books you have read?
• Modeling (data mining)
– Input: all titles of books someone has read
– Make sure that same books have same titles (e.g.: drop edition from
title)
– A priori algorithm
• Make baskets of read books, labeled with the reader
• Identify common occuring books
• Tweak your recommendation rules:
– Chose big enough support
– Confidence of recommendations can be calculated
– The bigger the lift, the more valuable your recommendation might be for the reader
• Runtime usage
– Check if a subset of the books occur as left-hand-side of a rule
57. Modeling methods: A priori
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Data consists of books bought online
• There were more than 40000 users buying more than one book (If they only
bought one book, they are not useful to make your model)
• In total they bought more than 220000 books
• Notice the permutations in the rules
• As you might expect, sequel books are bought together
59. Modeling methods: A priori - distributed
• Make list of books bought together (training data)
– Similar to n-grams (Naïve Bayes)
– Every customer only read once by one mapper
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Make recommendations
– Every mapper handles a number of rules
60. Modeling methods: A priori
• QUESTION: Which adds can I show on a website?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Modeling (data mining)
– Input: All visited links, all bought items, …
– Decide what you think is important: you want to show items others were
also interested in, items others also bought, ….
– A priori algorithm
• Find items which occur together
• Define your support, confidence and lift you want
• Runtime usage
– Check if a subset of the visited links occur as a left hand side of a rule
61. Case study
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye