What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
1. Big Data [sorry]
Data Science:
What Does a Data Scientist Do?
Carlos Somohano
Founder Data Science London
@ds_ldn
datasciencelondon.org
The Cloud and Big Data: HDInsight on Azure London 25/01/13
3. Man on the Moon – Small Data!
Computer Program
Apollo X1
Man on the Moon
Date: 1,969
Speed: 3,500 km/hour
Distance: 356,000 Km
64 Kb, 2Kb RAM, Fortran
Weight: 13,500 kg
Never been there before
Must work 1st time
Lots of complex data
Must return to Earth
4. Apollo XI, 1969
SkyDive Stratos, 2012
64 Kb
Tens of Gigabytes
Think About It – We live in Crazy Times!
6. What is Big Data? IT mumbo-jumbo
A fashionable term typically used by some IT
vendors to remarket old fashioned software
hardware
7. What is Big Data? The n-Vs
Volume …
Variety …
Velocity …
(add your own V here…)
So What?
8. Change! Water Cooler Chat
We need to parallelize data operations but it’s too costly complex …
The business can’t get access to all the relevant data, we need external data…
We can’t match customer master data to live customer interactions…
We can’t just force everything into a star-schema…
These BI reports and charts don’t tell us anything we didn’t know…
We are missing the ETL window, the data we needed didn’t arrive on time…
We can’t predict with confidence if we can’t explore data develop our own models
9. What is Big Data? Force of Change
Big Data forces you to change the way you collect,
store, manage, analyze and visualize data
11. Big Data = Crude Oil [not New Oil]
Think data as ‘crude oil.’
Big Data is about extracting the ‘crude oil,’
transporting it in ‘mega-tankers,’ siphoning it through
‘pipelines,’ and storing it in massive ‘silos’…
All ‘this’ is about IT Big Data… fine and well…
… BUT
12. You need to refine the ‘crude oil’
Enter Data Science…
13. The Science [and Art] of…
Discovering what we don’t know from data
Obtaining predictive, actionable insight from data
Creating Data Products that have business impact now
Communicating relevant business stories from data
Building confidence in decisions that drive business value
14. Brief History of Data Science
6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism…
1974 – Peter Naur @UoC Datalogy Data Science
2001 – William S. Cleveland @CSU Data Science: An Action Plan …:
2002 – Committee on Data for Science Technology (CODATA)
2003 – Journal of Data Science
2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do?
2010 – Drew Conway @NYU The Data Science Venn Diagram
2010 – Hillary Mason Chris Wiggins @Dataists “
2010 – Mike Loukadis @O’Reilly “What is Data Science?”
2011 – DJ Patil @LinkedIn data scientist vs. data analyst
15. Jeff Hammerbacher, 2009
“... on any given day, a team member could author a
multistage processing pipeline in Python,
design a hypothesis test, perform a regression analysis
over data samples with R,
design and implement an algorithm for some data-
intensive product or service in Hadoop, or
communicate the results of our analyses to other
members of the organization.
16. Mike Loukides, 2010
Data science enables the creation of data
products.
Whether... data is search terms, voice samples, or
product reviews,... users are in a feedback loop in
which they contribute to the products they use.
That's the beginning of data science.
17. Hilary Mason Chris Wiggins,2010
Data science is clearly a blend of the hackers’ arts, statistics
and machine learning...;
and the expertise in mathematics and the domain of the
data for the analysis to be interpretable...
It requires creative decisions and open-mindedness in a
scientific context.
19. DJ Patil, 2011
”We realized that as our organizations grew, we both had to figure out
what to call the people on our teams. Business analyst” and Data analyst”
seemed too limiting.
The focus of our teams was to work on data applications that would have
an immediate and massive impact on the business.
The term that seemed to fit best was data scientist: those who use both
data and science to create something new”
21. The Duck – Billed Platypus
The Data Scientist – Billed Platypus
22. The Platypus – Billed Data Scientist
Machine Learning
Hacking
Statistics
Math
Visualization
Science
Programming
Data Mining
The Data Scientist – Billed Platypus
24. Class DataScientist {
Is skeptical, curious. Has inquisitive mind
Knows Machine Learning, Statistics, Probability
Applies Scientific Method. Runs Experiments
Is good at Coding Hacking
Able to deal with IT Data Engineering
Knows how to build data products
Able to find answers to known unknowns
Tells relevant business stories from data
Has Domain Knowledge
}
26. 10 Things [most] Data Scientists Do
1 Ask Good Questions. What is What…
…we don’t know?
…we’d like to know?
2 Define and Test an Hypothesis. Run experiments
3 Scoop, Scrap, Sink, Sample Business Relevant Data
4 Munge and Wrestle Data. Tame Data
5 Explore Data, Discover Data Playfully. Discover unknowns.
6 Model Data. Model Algorithms.
7 Understand Data Relationships
8 Tell the Machine How to Learn from Data
9 Create Data Products that Deliver Actionable Insight
10 Tell Relevant Business Stories from Data
29. [Some] Data Science Principles
1 Socio-Technical Systems (STS) are complex!
2 Data is never at rest
3 Data is dirty, deal with it
4 SVoT = LOL!
5 Data munging data wrestling 70% time
6 Simplification. Reduction. Distillation
7 Curiosity. Empiricism. Skepticism
30. Knowns Unknowns
There are known knowns. These are things we know
that we know.
There are known unknowns. That is to say, there are
things that we know we don't know.
But there are also unknown unknowns. There are
things we don't know we don't know
Donald Rumsfeld
31. DIKUW FTW!
D I K U W
Data Information Knowledge Understanding Wisdom
PAST FUTURE
Data Engineer
Data Analyst
Data Miner
Data Scientist
Raw What How to Why When
Numbers Description Experience Cause Effect Prediction
Letters Context Tested Proven What’s best
Known Unknown
Symbols Relationship Instruction Unknowns
Unknowns
Known Knowns
Signals Reports Programs models
32. Data Discovery
Data Analyst
Data Scientist
The new reality for Business Intelligence and Big Data, Applied Data Labs
33. Data Models vs. Algorithmic Models
Data Modeling
VS.
Algorithmic Modeling
Y ß F( X, random noise, parameters)
Y ß
Black Box
ß X
Random Forests
We understand the world
We don’t understand the world
How well ‘my data model’ works
The world produces data in a black-box
Statisticians, Data Analysts, Data Miners
Data Scientists
Linear Regression
Machine Learning, AI Neural Nets
Logistic Regression
Random Forests, SVM, GBT
Known Distributions
Unknown Multivariate Distributions
Confidence Intervals
Iterative
Predictor Variables Goodness of Fit
Predictive Accuracy
“Statistical Modeling: The Two Cultures” Leo Breiman, 2001
34. Learning from Data is Tricky
Statistical vs. Machine Learning
Supervised vs. Unsupervised Learning
Induction vs. Deduction
Sampling Confidence Intervals
Probability Distribution
Deviation Variance
Correlation vs. Causation
Causation Prediction
35. More Data or Better Models?
More Data Beats Better Algorithms, Omar Tawakoi @BlueKai
Better Algorithms Beat More Data, Mark Torrance @RocketFuel
More Data or Better Models, Xavier Armitrain @Netflix
On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google
Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
37. Data Science Process - 1
1 Known Unknowns?
2 We’d like to know…?
3 Outcomes?
4 What Data?
5 Hypothesis?
The World
Ingest Raw Data
Munch Data
The Dataset
Product Manufactured
Transactions
MapReduce
Independency?
Goods shipped
Web-Scraping
ETL, ELT
Correlation?
Product purchased
Web-clicks logs
Data Wrangle
Covariance?
Phone Calls Made
Sensor Data
Data Cleansing
Causality?
Energy Consumed
Mobile Data
Data Jujitsu
Dimensionality?
Fraud Committed
Docs, Emails, XLS
Dim Reduction
Missing Values?
Repair Requested
Social Feeds, RSS
Sample
Relevant?
System
Flume Sink HDFS
Select, Join, Bind
38. Data Science Process - II
The Dataset
Explore Data
Represent Data
Discover Data
Deliver Insight
Learn From Data
Data Product
Visualize Insight
Description Inference
Objectives
Data Algorithm Models
Levers
Actionable
Machine Learning
Modeling
Predictive
Networks Graphs
Simulation
Immediate Impact
Regression Prediction
Optimization
Business Value
Classification Clustering
Visualization
Easy to explain
Experiments Iteration
40. A Data Product Is…
… Curated and crafted from raw data
… A result of exploration and iterations
… A machine that learns from data
… An answer to known unknowns or unknown unknowns
… A mechanism that triggers immediate business value
… A probabilistic window of future events or behavior
41. Data Jiu-Jitsu
Data
Jiu Jitsu Fight
$$$$
Data Product
Data Scientist
Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value
(DJ Patil @LinkedIn)
42. Developing Data Products
Objectives
Levers
Data
Models
What Outcome What Inputs Can What Data Can How the Levers
Am I Trying to We Control?
We Collect?
Influence the
Achieve?
Objectives
Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”
Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
43. Objective-Based Data Products
What Outcome Am I Actionable
Trying to Achieve?
Outcome
Data
Modeler
Simulator
Optimizer
The Model Assembly Line
Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”
Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
45. Customer Lifecycle Value
Optimize CLV
Product Recommendations
Visualizer
Data
Modeler
Simulator
Optimizer
1 Products the customer may like
2 Price Elasticity
3 Probability of Purchase w/o Recommendation
4 Purchase Sequence
5 Causality Model
6 Patience Model
Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”
Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
46. Automated Fruits Procurement
Confirm Purchase Orders
In less than 2 hours
Safety Stock levels?
Demand vs Stock?
Price vs. Demand?
12,000 stores
Anomalies?
300 Fruits
Fruit Shortages?
Avg. Shelf life 3 days
Fruit Write-offs?
Adapted from Blueyonder
47. Strawberries the Weather
No sales vs X,XXX sales predicted
Why these huge stock write-offs?
A Predictive Model that calculates
strawberry purchases based on
Weather forecast
Sudden increase in temperature
Store temperature
Freezer sensor data
Remaining stock per shelf live
Sales TPoS feeds
Web searches, social mentions
Adapted from Blueyonder
48. Personalized Social Recommendations
Collaborative Filtering: Matching Skills to People
Prediction: Personalized Skills Recommendation
Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
49. Colas- In Which US State I Invest Mktg. $?
What the Business Analyst Sent
What the Data Scientist did…
50. The Great Pop vs. Soda Page
http://www.popvssoda.com/
53. Interested in Data Science?
Join our community
http://www.meetup.com/Data-Science-London/
Follow us on Twitter
@ds_ldn
Check out our blog
http://datasciencelondon.org