4. Three structures
1. Separating growth in In vs. out
2. Maturity level of departures
3. Retention losenge
1. Unemployment US vs. France
2. How to fix a casual video game
3. Great startup vs. bonfire
Three stories
No sophisticated models
How to structure data
6. Similar unemployment pattern
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment US
Unemployment (M)
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment France
Unemployment (M)
Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
7. Very different issues
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment US
Unemployment (M) Lost job Found job
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment France
Unemployment (M) Lost job Found job
Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
8. How to build a detailed reference table
Period (day,
week, month)
User
ID
Present or
Active this
period
Present or
Active last
period
Last active
(period)
Status
2018-01-01 12345 TRUE NULL NULL New
2018-01-08 12345 TRUE TRUE 2018-01-01 Active
2018-01-15 12345 FALSE TRUE 2018-01-08 Lapsed
2018-01-22 12345 FALSE FALSE 2018-01-08 Lost
2018-01-29 12345 TRUE FALSE 2018-01-08 Re-activated
…
SELECT … AS period, id, CASE WHEN… LAG(…) OVER MAX(…) OVER CASE WHEN…
GROUP BY period, id
w AS WINDOW…
9. How to build an aggregated reference table
Period (day,
week, month)
Status Count
Last
active
Source
Last
action
2018-01-01 New 17 854
2018-01-08 Active 78 442
2018-01-15 Lapsed 12 325
2018-01-22 Lost 10 548
2018-01-29 Re-activated 2 428
SELECT … AS period, status, COUNT()
GROUP BY period, status
11. Distinct user status allow better insight
0
1
2
3
4
5
6
Players
Daily active
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Players funnel
Lost Active
Numbers are made-up; they look nothing like a project I worked on.
0
1
2
3
4
5
6
Players
Daily active New Active Lost
13. Cohort
• a group of people with a shared characteristic (Cambridge Eng. Dict.)
• a group of people who did something all during the same period (Me)
• Don’t focus exclusively on registration: first order, or third, re-activation, etc.
14. Triangle of user experience
Timeofthefirstactionorregistration
Cohort
Promotion
NowTime of the action
15. Retention Losange
Time of the action
Timeofthefirstactionorregistration
Too recently
acquired
Now
8 weeks
After
8 weeks
17. More considerations
• Arbitrary thresholds
• Simple, imperfect, memorable
• Communicate: catchy names
• Alex Schultz, VP Growth Facebook
• More time-like metrics
• Activity totals vs. Behaviour step
• Time spent vs. since registration
• Demographic age vs. seniority
• Experience on wider platform
• Friends’ experience levels
Hinweis der Redaktion
This talk is probably part of a series
I started suggesting a list of 12 questions to askHow mature a data organization is?
This is a small step on how
You can be a bit more systematic in handling your data
Long way to say I’m old and cranky
Small change since last time
Last presentation I talked about 12 things that
Needed to be there to make the small core part of data science
Mainly, it’s a good ETL process and good habits around it
A lot of that is good engineering and
addressing analyst frustrations
One of the most impactful part is
reusable data structure
This is something that fewer people
in your organization are likely to ask
Because it’s a less glaring pain point
But enforcing consistent concept is important
No sophisticated models
How to structure data
I’m more than happy to talk about
convoluted models
counter-intuitive corrections
Learned last weekFacebook self-credited Growth-accounting
This, I learned during my Master’s
Before Facebook founded
Well. One of my Master’s
Let’s have fun!
Let’s talk about unemployment!
You see, more jobs are created that destroyed
When unemployment goes down.
Same thing in both economies.
But how you get there is not the same
What is essential is that all things add up exactly
The fact that a user is considered lapsed or lost after X or Y period is rather arbitrary
Try to pick a number that separates well: few people transition from one to the other
Once you have your Period and your status
You can…
Once you have your Period and your status
You can…
You can also draw transition graph
You can and you should add a lot of things to that group by:
you should have detailed totals by as many dimensions as you can think is relevant
——————
Do you have any questions so far?
Does this make sense to you?
Do you see the applications to Data science?
Now we have a relevant distinction in our population
We can now train models trying to predict it!
we can compare leavers with non-leaving customers
With the same maturity or experience
Keep in mind:
any departure is temporary
All cut-off is arbitrary
That doesn’t matter
What matters is that everyone is accounted for
The key feature so far is how status
add up to the active customers.
Would you invest in a company with that kind of growth?
Now, let’s apply the framework
Of course, you know about funnels & retention and you would have caught that
But even the funnel can be different with that insight:
big drops might be relevant, but not all obstacles are hostile
Good challenge
Do not confuse progress and retained
The wider idea in this talk is:
Time is a key dimension in any product experience
Cohorts are a great abstraction
You need to come with more vocabulary around
Two geometric arguments that explain a non-trivial concept
The key notion in those two part,
And a word that I have oddly not used so far has been a cohort
What is a cohort?
What is the widest, most simple definition of a cohort
Let’s represent every action that your users on a two-dimensional plan
X, or abscissa, is the time of the action
Y, or ordinates, is the time that user registered or first did something
In this corner, you can’t have any action
because that would involve time travel
Or some sort of weird CGI. Don’t go there.
Or rather, do do there:
count how many actions you have in there
That’s a great thing to check.
If there are, you know you have a problem
This graph is actually used commonly to represent retention,
Using colour maps
This triangle becomes very useful when you are studying retention
Same brown time travelers, still empty, hopefully
If you want to know if people are retained
after, say, eight weeks, you should exclude recent joiners
And most people remember to do that.
What people often overlook is to exclude activity
* that we know about *
after eight weeks of stay
That losenge is what you should be looking at
Once again, you are probably better off
looking at a detailed colour maps
But if you are trying to model retention, this is important.
Let's assume you want to know if
People who joined on week 1 are st
And you should be making sure that
finding the right number to model is easier to find.
That’s the big lesson here:
* Make the right metric easier to find *
Once again,
Not directly related to machine learning
but getting those right is essential to building a relevant model
Two things I would like to just say a word about
First is: all those thresholds are arbitrary
The right unit might not be calendar time
Think about alternative way of counting
Time in your service & calendar time since
- Or think beyond what you see: wider platform, relations
Do you have questions?
Or would you rather have me
give the floor to someone smarter than me?