This document discusses the development of predictive applications and outlines a vision for a platform called "Blue Box" that could help address many of the challenges in building and deploying these applications at scale. It notes that building predictive applications currently requires integrating multiple separate components. The document then describes desired features for the Blue Box platform, such as data cleansing, external data integration, model updating, decision logic, auditing, and serving predictions in real-time. It poses questions about how such a platform could be created, whether through open source or a commercial offering.
1. Imagine
How
5 Years from Now
will
predictive applications
be put
in production
Our Goal Today
How are we doing today ?
What is difficult ?
What should be simpler?
2. What is a predictive application ?
Churn Prevention
Fraud Detection
Demand Forecast
Targeting
Maintenance
Match Making
Ad Bidding
Drug Studies
Pricing
Ranking
3. This discussion not relevant to all
Churn
Maintenance
Drug Studies Multi-Years
Multi-Years
Multi-Years Weekly
Weekly
Yearly
Bidding Two Weeks Sub-Second
Data Span
Retrain
every …
Score
every…
Yearly
Day
Monthly
Monthly
Production
= Dev
Online Learning
4. Not just a “model”
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
Let’s call this a
Predictive Service Specification
5. How much effort ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
20% 30% 25% 5% 5% 15%
Data
Collection
6. Who Does What ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data Domain
Engineers
Data AnalystsData ScientistsBusiness Intelligence
Engineers
7. Huge Variety of Tech
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
ETL ?
Ad-Hoc?
ETL ?
Ad-Hoc?
ETL ?
SQL ? R ? Python ?
Matlab ?
R ? Python ?
R ? Python ? SAS? Java / Python
Business Rules
Management System
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
8. From Build to Run
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
?
Input Data Decision
Build Time
Run Time
9. How People Do that Today ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
PMMLETL WebServiceScript/SQL
Data
Collection
A Predictive Service
=
Up to 4 different “Applications" that can run out-of-sync
10. Some Integrated Per-Platform Approach
in Database
in SAS
in Hadoop/Spark
SQL Commercial Warehouse
+ Scoring UDF
End-to-end integration script
Ad-hoc development
12. Reason 1 : Prohibitive Costs kill projects
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
RSQL PythonR
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
SQLETL WebServiceSQL PMML
300K$ 50K$ 200K$100K$
50K$
650K$
13. Reason 2: Distribution Drift
New behaviour
New product
New competitor
Model stops working as planned
You need to be able to do same week update
14. Reason 3: Mitigate with Data Hazards
You need to be able to do same week update
Most interesting “Big Data” Sources are fragile
15. Reason 4: Decide is beyond Predict
Most Interesting Problems Require To Combine
Models + Heuristics + Non-local Optimization
16. Reason 5: “Suits ready” for scalability
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Your CTO could certainly
maintain it up and running all by himself
Your CTO could certainly
maintain it up and running all by himself
17. Imagine the Dream Platform
That Would Solve All This
?
Let’s call it Blue Box
New Data
Decision
18. Feature : Cleansing, Enrich and Merge
Blue Box must be the perfect Data Blending runtime
19. Feature: Aggregating Data
Raw Events Stream Aggregate State
Consolidating History Must be part of Blue Box
1TB-100TB+ 100MB-1OGB
20. Feature : External Data Compliant
main
data
enriched main
data
additional
data
e.g. Census,
Map, Etc..
Third Data Data Must Be “In” the Blue Box
21. Feature : Update Data Service
Smart Lazy Human
A/B Test Support in Blue Box
Decision Ver. A
Decision Ver. B
P D F M S
New
Model
22. Feature : Programatic Decision
Need for Business Compliant
“Real-Time” Rules in Blue Box
model 1
model 2 model 3
if
combine
with
if proba > 0,63 decision A
else decision B
if proba > 0,79 decision A
else decision B
23. Feature : Audit and Logs
Smart Lazy Human
?
Blue Box needs to keep track of its decisions and Why
Decision Cause Log
24. External Data
Advanced Join / Matching
Ad-Hoc Transformation
Python / R / Spark DataFrame transformations
SQL Like Transformations
Scoring Causes / Audit
A/B Test Support
Model Rollback / Versioning
Prediction Log. Stats / Audit
Ad-hoc scoring/decision code/scoring
Open Source
What does Blue Box look like?
?
25. Interesting /
Potential Open Source Project
Real-Time Entity Update, Management,
Scoring
Open Source PMML Scoring in Java
Oryx: Lambda Architecture built on Spark and
Kafka, with specialisation on real-time machine learning
26. How will we create the “blue box” ?
?
Specification ? PMML Extension ?
Open Source Framework ?
Hadoop / Spark Specific ?
27. Thank you !
is blue
Convince decisions makers to make
data their competitive advantage
florian.douetteau@dataiku.comjobs@dataiku.com
Wanna work on
this topic ?
Wanna share your
dream features?