Data Science: The Product Manager's Primer

Data Science: The Product Manager's Primer
by Andrew Koller and Doron Bergman
/Productschool @ProdSchool /ProductmanagementSF

Who we are
Andrew Koller
koller.andrew.j@gmail.com
Five years experience as an
entrepreneur and product manager
including a background in statistical
physics modeling and data science.
Doron Bergman
PhD in theoretical solid state physics.
Three years experience in large tech and
startups.

Overview
Everything you need to know to understand the world of data science in start ups
➔ Back to the Basics
An overview of statistics and the mathematical basis
➔ Data Science and AI
How Data Science differs from statistics and makes prediction in the real world
➔ Putting it all together with an example
Provide a simple unifying message for what is to come
➔ Data Scientist perspectives
How to understand objections your DS’s might have and how to be their hero
NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.

How popular is Data Science
really?

Really Popular!
(Especially in the Bay Area)

Q: What is data science and
AI? Where do I start?
A: Statistics

Statistical
Background
A quick review and overview of the statistics
you’ll need to communicate with your team
➔ Regression
Describing a relationship between two
variables.
➔ Confidence
Understanding to some measure of
how valid and strong the result is.

Q: What is statistics and why
are you talking about that?
I’m here for data science.

A: Statistics is a way to
describe and make
generalizations about a
population of data. It forms
the basis of Data Science.

Regression is the
description of a variety of
data points with a function.
The most basic form is
linear regression
in the form of y=Mx+b

Money and Wins
Money might not be able to buy
happiness but it sure seems to buy
wins in Major League Baseball
Y = M * x + b
Wins = 0.153 wins/$Million * Payroll
(in $Million) + 66.4 Wins
Or about $6.5 Million per win
Actual calculation will be left as an exercise
Image source: New York Times, 2010

Q: How well does linear
relationship apply to
everything?
A: Pretty well*

US Population
From 1650 to 1850 the US population
grew non linearly
Y = A* (b *x)n
Sadly not linear…. Or is it?
Image source:
http://onlinestatbook.com/2/transformations/tukey.html

US Population
But - after taking the log of the
population and the function is
Y = M * x + b
Image source:
http://onlinestatbook.com/2/transformations/tukey.html

Many datasets can be made
linear after just one
transformation
Q: What if that isn’t good
enough?

The next step beyond
linear regression is called
polynomial
regression in the
form of y=Nx2 + Mx + b or
higher order. Each
additional term adds
increased accuracy.
More general (math nerd)
form:
y = ∑ Mnxn

Q: How accurate is my
regression line? How is it
measured?

Accuracy (or alternatively
error) is measured by
taking the distance
between the value
predicted by the
regression from the actual
value. This is called the
residual.
Residuals are often expressed by data
scientists as residuals
squared.
A’s have a positive
residual
Mets have a negative
residual
Which team would you rather
be?

Great! So now we can make
predictions right?
Right?

Data Science
Understand how data science builds off of
statistics to make predictions and power
some of the most common uses.
➔ Numeric Predictions
variables.
➔ Categorization
Understanding the some measure of

Statistics primarily
DESCRIBE data sets, but are
not set up to PREDICT the
values.
Example, please???

Which line on the right is the “best” and if
there was another point in the set where
do you think it might be?
Statistics says the black line is the
best.
Human intuition thinks the
green line might be better.
But we still don’t know

Another example:
The graphs to the right
show increasing number of
polynomial terms used to
fit data on house size vs
price.
Adapted from http://www.astroml.org/sklearn_tutorial/practical.html

Q: I think I get it. What do we
do about this problem?

Data scientists divide the
data into two:
Train and Test

Training set is used
to set up the model - aka
fitting parameters
Test set is used to
measure how well the
model predicts the value
of data.

Comparing the difference
between actual and
predicted values -
residuals - indicates
whether your model is
down for the count …

Great. I now understand the
data science process, but it’s
not yet magic. What else
can it do?

Using a model based on
parameters, a computer
can group items into
categories and make
choices

Problem:
We want to predict if a
stranger at the gate of our
castle is a Stark or a
Lannister.
Should we trust them?
We don’t want to get killed.

Name Eye Color
Hair
color Stark
Ned Grey
Dark
Brown Y
Robb Blue
Dark
Brown Y
Sansa Blue Red Y
Arya Grey Brown Y
Bran Brown Brown Y
Rickon Blue Brown Y
Lyanna Blue Brown Y
Benjen Brown Brown Y
Tywin Green Blonde N
Tyrion Green/Black
Dirty
Blonde N
Jamie Green Brown N
Cersei Green Blonde N

Name Eye Color
Eye
Number Hair color
Hair
Number Stark
Ned Grey 4
Dark
Brown 5 1
Robb Blue 2
Dark
Brown 5 1
Sansa Blue 2 Red 2 1
Arya Grey 4 Brown 3 1
Bran Brown 3 Brown 3 1
Rickon Blue 2 Brown 3 1
Lyanna Blue 2 Brown 3 1
Benjen Brown 3 Brown 3 1
Tywin Green 1 Blonde 1 0
Tyrion Green/Black 1.5
Dirty
Blonde 2 0
Jamie Green 1 Brown 3 0
Cersei Green 1 Blonde 1 0

Training
Name Eye Number Hair Number Stark
Ned 4 5 1
Sansa 2 5 1
Bran 3 2 1
Rickon 2 3 1
Tyrion 1.5 2 0
Jamie 1 3 0
Testing
Name Eye Color Hair Number Stark
Robb 2 5 1
Arya 4 3 1
Tywin 1 1 0
Cersei 1 1 0

Training
Name
Eye
Number
Hair
Number Stark
Model
Outcome
Residuals
Squared
Ned 4 5 1 1.3235 0.10465225
Sansa 2 5 1 0.7275 0.07425625
Bran 3 2 1 0.7786 0.04901796
Rickon 2 3 1 0.5629 0.19105641
Tyrion 1.5 2 0 0.3316 0.10995856
Jamie 1 3 0 0.2649 0.07017201
Stark = 0.298 (eye number) +
0.0823 (hair number) - 0.28

Test
Name
Eye
Number
Hair
Number Stark
Model
Outcome
Residuals
Squared
Robb 2 5 1 0.7275 0.07425625
Arya 4 3 1 1.1589 0.02524921
Tywin 1 1 0 0.1003 0.01006009
Cersei 1 1 0 0.1003 0.01006009
Average Residual Squared
Training: 0.0998
Test: 0.0299

Working with
data scientists
How to have better two way conversations
with your data science team and handle
objections
➔ Data cleanliness
variables.
➔ Model fit
Understanding the some measure of

“The data just isn’t clean
enough to work with”
But it’s in a database isn’t
that good enough?

Real world data is never
as clean as one would
hope. There is always
the danger of missing
fields, mistyped entries,
previous wrong
answers etc.

Solutions:
● Removing bad data
columns and checking
effect on user
● Making assumptions
about missing data
● Spend time tracking
down better data
● Find alternative sources
for data
Always check the effect
that a change in the data
has on a user experience.

“We can’t ship. The learning
curve is broken.”

Going back to our housing
problem will help us
identify what might be
going on.

A learning curve shows
how performance of each
test and train data sets
perform as the side
increases.
Learning curves are used
to understand the basic
characteristics of the
model fit.

Learning curve can show
bias, meaning the test
and training set both give
similar answers but the
answers are incorrect.
Data scientists can often fix
this problem with more
work.

The other issue is called
variance, meaning the
test and training set give
different answers.
This type of issue is the
hardest for your data
science team to deal with.

Some solutions to
variance or underfit
might be:
● Increasing the amount
of data
● Figuring out the exact
impact on users
● Increasing complexity
of the model
○ Higher order model
○ More variables

“This is cool. Where can I
learn more?”

Bibliography
For more information there are many great resources online that give overviews as well
as in depth info made for people who want to get into data science
➔ Andrew Ng’s Course
Machine Learning on Coursera
➔ Coursera
◆ Machine Learning Foundations: A Case Study Approach
◆ A Crash Course in Data Science
➔ Kaggle
Hosts competitions and datasets. Also has a tutorial to walk through a machine
learning example
➔ Sci-Kit Learn Documentation
Documentation for the most popular machine learning library

Upcoming Courses
San Francisco
Weeknights: September 6th
Weekends: September 10th
Apply At
www.productschool.com

Upcoming Workshops
Rsvp On Eventbrite
August 3: From Building Products to Managing Them
August 10: Coding For Non Coders
August 17: Product Owners: How to Get Your Development Team to
Love You
August 24: PM Life at an Early Stage Startup

Data Science: The Product Manager's Primer

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Data Science: The Product Manager's Primer

Ähnlich wie Data Science: The Product Manager's Primer (20)

Mehr von Product School

Mehr von Product School (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Science: The Product Manager's Primer