WhatsApp 9892124323 âCall Girls In Kalyan ( Mumbai ) secure service
Â
Data Science: The Product Manager's Primer
1. Data Science: The Product Manager's Primer
by Andrew Koller and Doron Bergman
/Productschool @ProdSchool /ProductmanagementSF
2. Who we are
Andrew Koller
koller.andrew.j@gmail.com
Five years experience as an
entrepreneur and product manager
including a background in statistical
physics modeling and data science.
Doron Bergman
PhD in theoretical solid state physics.
Three years experience in large tech and
startups.
3. Overview
Everything you need to know to understand the world of data science in start ups
â Back to the Basics
An overview of statistics and the mathematical basis
â Data Science and AI
How Data Science differs from statistics and makes prediction in the real world
â Putting it all together with an example
Provide a simple unifying message for what is to come
â Data Scientist perspectives
How to understand objections your DSâs might have and how to be their hero
NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.
8. Q: What is data science and
AI? Where do I start?
A: Statistics
9. Statistical
Background
A quick review and overview of the statistics
youâll need to communicate with your team
â Regression
Describing a relationship between two
variables.
â Confidence
Understanding to some measure of
how valid and strong the result is.
10. Q: What is statistics and why
are you talking about that?
Iâm here for data science.
11. A: Statistics is a way to
describe and make
generalizations about a
population of data. It forms
the basis of Data Science.
12. Regression is the
description of a variety of
data points with a function.
The most basic form is
linear regression
in the form of y=Mx+b
13. Money and Wins
Money might not be able to buy
happiness but it sure seems to buy
wins in Major League Baseball
Y = M * x + b
Wins = 0.153 wins/$Million * Payroll
(in $Million) + 66.4 Wins
Or about $6.5 Million per win
Actual calculation will be left as an exercise
Image source: New York Times, 2010
14. Q: How well does linear
relationship apply to
everything?
A: Pretty well*
15. US Population
From 1650 to 1850 the US population
grew non linearly
Y = A* (b *x)n
Sadly not linearâŠ. Or is it?
Image source:
http://onlinestatbook.com/2/transformations/tukey.html
16. US Population
But - after taking the log of the
population and the function is
Y = M * x + b
Image source:
http://onlinestatbook.com/2/transformations/tukey.html
17. Many datasets can be made
linear after just one
transformation
Q: What if that isnât good
enough?
18. The next step beyond
linear regression is called
polynomial
regression in the
form of y=Nx2 + Mx + b or
higher order. Each
additional term adds
increased accuracy.
More general (math nerd)
form:
y = â Mnxn
20. Accuracy (or alternatively
error) is measured by
taking the distance
between the value
predicted by the
regression from the actual
value. This is called the
residual.
Residuals are often expressed by data
scientists as residuals
squared.
Aâs have a positive
residual
Mets have a negative
residual
Which team would you rather
be?
22. Data Science
Understand how data science builds off of
statistics to make predictions and power
some of the most common uses.
â Numeric Predictions
Describing a relationship between two
variables.
â Categorization
Understanding the some measure of
how valid and strong the result is.
24. Which line on the right is the âbestâ and if
there was another point in the set where
do you think it might be?
Statistics says the black line is the
best.
Human intuition thinks the
green line might be better.
But we still donât know
25. Another example:
The graphs to the right
show increasing number of
polynomial terms used to
fit data on house size vs
price.
Adapted from http://www.astroml.org/sklearn_tutorial/practical.html
26. Q: I think I get it. What do we
do about this problem?
35. Problem:
We want to predict if a
stranger at the gate of our
castle is a Stark or a
Lannister.
Should we trust them?
We donât want to get killed.
36. Name Eye Color
Hair
color Stark
Ned Grey
Dark
Brown Y
Robb Blue
Dark
Brown Y
Sansa Blue Red Y
Arya Grey Brown Y
Bran Brown Brown Y
Rickon Blue Brown Y
Lyanna Blue Brown Y
Benjen Brown Brown Y
Tywin Green Blonde N
Tyrion Green/Black
Dirty
Blonde N
Jamie Green Brown N
Cersei Green Blonde N
37. Name Eye Color
Eye
Number Hair color
Hair
Number Stark
Ned Grey 4
Dark
Brown 5 1
Robb Blue 2
Dark
Brown 5 1
Sansa Blue 2 Red 2 1
Arya Grey 4 Brown 3 1
Bran Brown 3 Brown 3 1
Rickon Blue 2 Brown 3 1
Lyanna Blue 2 Brown 3 1
Benjen Brown 3 Brown 3 1
Tywin Green 1 Blonde 1 0
Tyrion Green/Black 1.5
Dirty
Blonde 2 0
Jamie Green 1 Brown 3 0
Cersei Green 1 Blonde 1 0
38. Training
Name Eye Number Hair Number Stark
Ned 4 5 1
Sansa 2 5 1
Bran 3 2 1
Rickon 2 3 1
Tyrion 1.5 2 0
Jamie 1 3 0
Testing
Name Eye Color Hair Number Stark
Robb 2 5 1
Arya 4 3 1
Tywin 1 1 0
Cersei 1 1 0
41. Working with
data scientists
How to have better two way conversations
with your data science team and handle
objections
â Data cleanliness
Describing a relationship between two
variables.
â Model fit
Understanding the some measure of
how valid and strong the result is.
42. âThe data just isnât clean
enough to work withâ
But itâs in a database isnât
that good enough?
43. Real world data is never
as clean as one would
hope. There is always
the danger of missing
fields, mistyped entries,
previous wrong
answers etc.
44. Solutions:
â Removing bad data
columns and checking
effect on user
â Making assumptions
about missing data
â Spend time tracking
down better data
â Find alternative sources
for data
Always check the effect
that a change in the data
has on a user experience.
46. Going back to our housing
problem will help us
identify what might be
going on.
47. A learning curve shows
how performance of each
test and train data sets
perform as the side
increases.
Learning curves are used
to understand the basic
characteristics of the
model fit.
48. Learning curve can show
bias, meaning the test
and training set both give
similar answers but the
answers are incorrect.
Data scientists can often fix
this problem with more
work.
49. The other issue is called
variance, meaning the
test and training set give
different answers.
This type of issue is the
hardest for your data
science team to deal with.
50. Some solutions to
variance or underfit
might be:
â Increasing the amount
of data
â Figuring out the exact
impact on users
â Increasing complexity
of the model
â Higher order model
â More variables
52. Bibliography
For more information there are many great resources online that give overviews as well
as in depth info made for people who want to get into data science
â Andrew Ngâs Course
Machine Learning on Coursera
â Coursera
â Machine Learning Foundations: A Case Study Approach
â A Crash Course in Data Science
â Kaggle
Hosts competitions and datasets. Also has a tutorial to walk through a machine
learning example
â Sci-Kit Learn Documentation
Documentation for the most popular machine learning library
54. www.productschool.com
Upcoming Workshops
Rsvp On Eventbrite
August 3: From Building Products to Managing Them
August 10: Coding For Non Coders
August 17: Product Owners: How to Get Your Development Team to
Love You
August 24: PM Life at an Early Stage Startup