2. 6/4/2013Dataiku 2
Hi !
Current Life:
CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life:
Criteo
IsCool Entertainment
Exalead
Florian
Douetteau
Available on Slide Share
http://www.slideshare.net/Dataiku
Goals Today:
• Concrete Feedback on Data Analytics
Projects
• Data Team in practice and Key technologies
• Motivate you to start a data science project
Slide deck allergic ? Check:
https://github.com/dataiku
5. Collocation
6/4/2013Dataiku 5
Big Apple
Big Mama
Big Data
A familiar grouping of words,
especially words that habitually appear
together and thereby convey meaning
by association.
C
o
l
l
o
c
6. “Big” Data in 1999
6/4/2013Dataiku 6
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
1 Month
7. Hadoop
Java / Pig / Hive / Scala /
Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time
6/4/2013Dataiku 7
Big Data in 2013
1 Hour
9. Meet Hal Alowne
6/4/2013Dataiku - Data Tuesday 9
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
Hal Alowne
BI Manager
Dim’s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let’s just do as they do!
‟
”European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dim Sum
CEO & Founder
Dim’s Private Showroom
Big Data
Copy Cat
Project
10. Technology is complex
6/4/2013Dataiku 10
Hadoop
Ceph
Sphere
Cassandra
Spark
Scikit-Learn
Mahout
WEKA
MLBase
RapidMiner
Panda
D3
Crossfilter
InfiniDB
LucidDB
Impala
Elastic Search
SOLR
MongoDB
Riak
Membase
Pig
Hive
Cascading
Talend
Machine Learning
Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County
Data Clean Wasteland
Statistician Old
House
R
11. Statistics and Machine Learning is
complex !
6/4/2013Dataiku 11
Try to understand
myself
13. Plumbing is not complex
(but difficult)
6/4/2013Dataiku 13
Implicit User Data
(Views, Searches…)
Content Data
(Title, Categories, Price, …)
Explicit User Data
(Click, Buy, …)
User Information
(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation
Matrix
Transformation
Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
14. MERIT = TIME + ROI
6/4/2013Dataiku 14
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
TIME : 6 MONTHS ROI : APPS
Build a lab in 6 months
(rather than 18 months)
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
Build the lab
(6 months)
Deploy apps
that actually deliver value
2013 2014
2013
• Train People
• Reuse working patterns
16. Our Goal
6/4/2013Dataiku 16
Our Goal:
Change his perspective
on data science projects
(sorry, we couldn’t
find a picture of Hal
Smiling)
17. Why and For What ?
◦ Business Theory
◦ Concrete Projects
How people and project ?
◦ How to start
◦ Dedicated team ?
What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 17
19. Product Success
driven by Quality !
Margin / Customer
Value / Traffic /
Acquisition
6/4/2013Dataiku 19
Example: Launching an App
on the App Store
20. Margin for new
customers might
decline …
Margin for new
features might
decline …
Is your business
really scalable ?
6/4/2013Dataiku 20
you continue growing ….
21. Existing Customers
Profiles
Existing Product Assets
Existing Specific
Business Model
And your KNOWLEDGE
of it
6/4/2013Dataiku 21
Where is your core business
advantage ?
22. 6/4/2013Dataiku 22
Data Driven Business
What your value ?
Number of
Customers
Customer Knowledge
Increase over time with:
- Time spend in your app
- User relationship (network effet)
- Partner / Other Apps Interactions
Your Value
23. Data Impact
Not all business equals
6/4/2013Dataiku 23
Online
Advertising
Telecommunication
Insurance
Ability
to Acquire
Margin
New
Services
Overall
Subscription
Market
Infrastructure
Driver
Selling Data
Risk / Price
Optimization
Subscription
Market
Subscription
Market
25. What should be free
in the application ?
How to optimize
conversion ?
How to plan and
create a business
model ?
Main Pain Point:
How to plan and
optimize pricing in
the application ?
6/4/2013Dataiku 25
Freemium Application
26. Example (Freemium Application)
Fremium Model Optimization
6/4/2013Dataiku 26
Business
Model
User
Cluster
Simulation
Optimized Pricing: Margin
+23%
Business Planning
Capability
1 month 9 months
R + Python + InfiniDB
On-Premise
1TB Dataset
5 weeks project
27. Business Intelligence
Stack as Scalability and
maintenance issues
Backoffice implements
business rules that are
challenged
Existing infrastructure
cannot cope with per-
user information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
6/4/2013Dataiku 27
Large E-Retailer
28. • Relieve their current DWH and
accelerate production of some
aggregates/KPIs
• Be the backbone for new
personalized user experience
on their website: more
recommendations, more
profiling, etc.,
• Train existing people around
machine learning and
segmentation experience
1h12 to perform the
aggregate, available every
morning
New home page
personalization deployed in a
few weeks
Hadoop Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
6/4/2013Dataiku - Data Tuesday 28
Large E-Retailer : The Datalab
29. BI performed directly on
production databases
New reports required the
CTO direct work for
design and
implementation
Each photo tag manually
validated and completed
Large Photo Bank
6/4/2013Dataiku - Data Tuesday 29
Main pain point:
No visibility on new users
behaviours
30. Implementing a Cloud-based
data lab to :
• centralize all available data,
previously scattered between
SQL DB and file systems,
• improve web tracking
granularity to enhance
customer knowledge via
behavior modeling and
segmentation,
• create content-based
recommendation engines with
keywords clustering and
association.
6/4/2013Dataiku - Data Tuesday 30
Large Photo Bank : The Datalab
R + Vertica + Hadoop
Amazon Web Services
8 weeks projects
Automated content filtering
and recommendation
31. Large set of
manually crafted
linguistic resources
for interpreting
users queries
New Brands, rare
terms .. hard to
maintain
6/4/2013Dataiku 31
Large Online Directory
Main Pain Point:
Ability to maintain a very
large ontological knowledge
sets, with more than 100k
concepts
32. Analyze clicks,
rephrasing navigation to
detect queries that
require specific
processing
Gather web and external
data to enrich the
existing index
Train team to Hadoop
and Machine Learning
Continuous Relevance
Monitoring
Automated enrichment
2x more productivity
Hadoop (48 cores)
Python
On Premise
10 weeks projects
6/4/2013Dataiku 32
Large Online Directory: The Data Lab
33. Launch A Marketing
campaign
After a few days
PREDICT based on
behaviours
◦ Total ARPU for users
after 3 months
◦ Efficiency of a campaign
◦ Continue or not ?
Example ( E-Application )
Marketing Campaign Prediction
Dataiku 33
34. A very large community
Some mid-size
communities
Lots of small clusters
mostly 2 players)
Correlation
◦ between community size
and engagement / virality
Meaningul patterns
◦ 2 players / Family / Group
What is the minimum
number of friends to
have in the application
to get additional
engagement ?
Example (Social Gaming)
Social Gaming Communities
6/4/2013Dataiku 34
35. What others do ?
◦ Concrete Projects
How people and project ?
◦ How to start
◦ Dedicated team ?
What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 35
37. A / B Test
(or equivalent for your
business) is the first step to
get into a “data-driven”
mind set
No advanced analytics
requires, some existing
tools can help
Changing a color button
+21%
6/4/2013Dataiku 37
(1) Be Data Driven
38. People Microsoft Excel
6/4/2013Dataiku 38
(2) Use Excel
39. Data Team Data Tools
6/4/2013Dataiku 39
(3) Build a team
The Business Expert
who knows maths
The Analyst
that reveals patterns
The Coding Guy That
is enthusiastic
40. data lab, (n. m): a small group
with all the expertise, including
business minded people,
machine learning knowledge and
the right technology
A proven organization used by
successful data-driven
companies over the past few
years (eBay, LinkedIn, Walmart…)
TEAM + TOOLS = LAB
6/4/2013Dataiku 40
41. Organization
6/4/2013Dataiku 41
Targeted campaings
Price optimization
Personalized
experience
Quality Assurance
Workload and yield
management
User Feedback (A/B Test)
Continuous improvement
Data
Product
Designer
Business
&
Marketing
Engineers
User
Voice
42. Short Term Focus Long Term Drive
Business People Optimize Margin, …. Create new business
revenue streams
Marketing People Optimize click ratio Brand awareness and
impact
IT People Make IT work Clean and efficient
Architecture
Data People Get Stats Right, make
predictions
Create Data Driven
Features
It’s just a new team
6/4/2013Dataiku 42
43. Super Intern
6/4/2013Dataiku 43
What is your ability to integrate a new
smart guy and give him any
data he would need and any computing
power he would need to enhance
your product ?
44. What others do ?
◦ Concrete Projects
How people and project ?
◦ How to start
◦ Dedicated team ?
What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 44
51. Classic Columnar Architecture
6/4/2013Dataiku 51
Lots of data Some Place To
Pour It In
Some Tool To
To Some Maths And Graphs
Web Tracking Logs
Raw Server Logs
Order / Product / Customer
Facebook Info
Open Data (Weather, Currency …)
52. The Corinthian Architecture
6/4/2013Dataiku 52
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
53. Data Storage And Preparation
6/4/2013Dataiku 53
Large Scale:
Hadoop Cluster
Cassandra
MPP SQL Columnar
Medium/Large Scale:
CouchBase
MongoDB
….
Selection Drivers
Volume
Scalability
55. The Corinthian Architecture
6/4/2013Dataiku 55
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
Statistics
Cohorts
Regressions
Bar Charts For Marketing
Nice Infography for you Company Board
56. The Corinthian Architecture
6/4/2013Dataiku 56
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
60. The One Database won’t
make it all problem
6/4/2013Dataiku 60
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
JOIN / Aggregate
Rapid Goup By Computations
Direct Access to the computed Results
to production etc..
61. The Roman Social Forum
6/4/2013Dataiku 61
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
63. The Key Value Store
6/4/2013Dataiku 63
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
64. NoSQL
6/4/2013Dataiku 64
Search
• SOLR
• ElasticSearch
Document
• MongoDB
• CouchDB
KeyValue
• Redis
• Hbase
…
Selection Drivers
Durability / Avaiability …
Performance
Ease of use and API
Indexing
65. Action requires Prediction
6/4/2013Dataiku 65
Lots of data
Some Database
To Perform
Rapid Calculations
And some database
for graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Draw A Line For the future
What are my real users groups ?
Should I launch a discount offering or not ?
To everybody or to specific users only ?
66. The Medieval Fairy Land
6/4/2013Dataiku 66
Lots of data
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts and some
MACHINE LEARNING
Some Place To
Pour It In And
Clean / Prepare It
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
67. Predictions
6/4/2013Dataiku 67
Java
• Mahout (Hadoop)
• WEKA
Python
• Scikit-Learn
• PyML
R
Commercial
• Kxen
• SAS
• SPSS…
Selection Drivers
Scalability
Black Box / White Box ?
Data Management Integration
69. Exploratory Data Analysis
◦ Identifying and visualizing key patterns and correlations within the dataset
Unsupervised Learning
◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)
Supervised Learning
◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)
Time Series Prevision
◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)
Graph Analysis
◦ Analyzing relationships between a set of “nodes”, linked by “edges”
Associations / Sequences Mining
◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time
And many more…
Classes of Machine Learning Problems
04/06/2013Dataiku - Innovation Services 69
70. Mapping ML to Business Questions
04/06/2013Dataiku - Innovation Services 70
Class Sample Business Questions
Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?
Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The
same navigation behavior ?
Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying
users ? Who is going to leave my service ? What is the profile of the users who
do X ?
Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast,
can I also forecast my sales ?
Product Sale Forecast (for surbooking)
Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends
to my users ?
Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation
path on my website ?
71. Machine Learning Methods Detailed
04/06/2013Dataiku - Innovation Services 71
Analytical Task ML Task Sample Algorithms Shape of Dataset
Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features
Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features
Multivariate Analysis Principal components analysis, multi-dimensional scaling
correspondence analysis, factor analysis…
N obs. (1 row per obs.) * P features
“Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture
models, mean shift, dbscan, spectral clustering...
N obs. (1 row per obs.) * P features
Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM,
naïve Bayes, K-NN, random forests…
N obs. (1 row per obs.) * P features
Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period,
columns: measures)
Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity
(Louvain)…
Nodes and Edges lists (+
attributes)
Associations &
Sequences
Frequent Itemsets, A priori, Market Basket… (Timestamped) events or
transactions
72. Cluster a dataset
into K Buckets by
choosing the
“closest”
neighbours
6/4/2013Dataiku 72
Unsupervised Method
K-Means
73. Predict the color of
a point depending
on the colors of its
K closest
neighbours
6/4/2013Dataiku 73
Supervised
K-Nearest-Neighbours
74. Find the most
“significant” input
variable and split
value
Split the dataset
recursively
6/4/2013Dataiku 74
Supervised
Decision Tree
75. Several Paths to Machine Learning
04/06/2013Dataiku - Innovation Services 75
Analytical
Dataset
I’m looking
for clusters
I want to
predict a
variable
I’m looking
variable by
variable, or
pairs
I know how
many groups
to look for
HCA
…
Partitioning (K-
means…)
GMM
…
DP
GMM
…
K-means + Gap
| Silhouette | …
2-steps
clustering
I just want
to
explore
Yes
No
Ye
s
No
Small
Dataset
(<<1K)
Ye
s
No
Medium Dataset
(<<100K)
Ye
s
No
I can
sample
Ye
s
No
Affinity
Propagation,
Mean Shift…
Unsupervised Learning
Ye
s
No
All my
variables
are
numeric Ye
s
No
CA…
I have a
distance
matrix
Ye
s
No
MDS...
PCA
…
Exploratory Data Analysis Data
Viz...
Ye
s
Not
Only
I value
interpretability
Generalized
Linear
Model
Simple
Decision
Tree
Supervised Learning*
Correlation
Analysis
GLM
Parametric and non
parametric stat.
tests
* Methods generally working for both classification & regression
Support
Vector
Machines
Neural
Networks
K-Nearest
Neighbors
Ensembles (Random
Forest, Gradient
Boosted Tree
MARS
Generalized
Additive
Model
76. 6/4/2013Dataiku 76
Questions ?
Take Away
◦ There are new ways to perform data
analytics that are within your reach and
can bring business value
Some Additional Resources
◦ Open Source Projects
Dataiku Cloud Transport Client
http://dctc.io
Dataiku Web Tracker
https://github.com/dataiku/wt1
◦ Our Technical Blog
http://www.dataiku.com/blog