Conclusions you reach with data are only valid if they correctly interpret your data set. In many organizations, the responsibility for collecting and aggregating data is distributed, so it can be hard to ensure that everyone who uses a data set understands the limitations of the signals in that pipeline.
As an example, many companies make important decisions about what events constitute an “active user,” and these decisions are reflected in the pipeline code. Changes to a pipeline may not be communicated to all downstream users, leading to misinformed conclusions even from correctly executed analyses.
In this talk, Richard will share three key questions to help ensure that you are interpreting your data correctly and drawing accurate conclusions.
5. AB Test was (very!) promising
Revenue: +$100million
6. AB Test was (very!) promising
Revenue: +$100million
7. Twyman’s Law
Any piece of data or evidence that looks interesting or unusual
is probably wrong!
8. Reality:
“User” in Office is hard to define. For instance, you can be an
O365 user but not a OneDrive user.
“You are a OneDrive
user if [blah blah
blah].”
I’m pretty sure I
alerted all the right
people.
10. Best case: I wasted spent several (tense!) days to find
out where we had a data bug.
11. Best case: I wasted spent several (tense!) days to find
out where we had a data bug.
Worse case: I claimed victory and shipped something
that may be useless or even actually negative.
12. Best case: I wasted spent several (tense!) days to find
out where we had a data bug.
Worse case: I claimed victory and shipped something
that may be useless or even actually negative.
Expected gains don’t pan out, eroding trust.
Worst case: “Well the pipeline has been wrong before”
when it disagrees with my intuition.
17. Where is your culture?
PermissiveRestrictive
“Data ownership is a power structure”
Permission needed to use data.
Beware the implicit permission: if you
need human help then you need
permission.
Path to production is slow, much of the
data product is rewritten by product
team.
Clear metrics set expectations. Ideas are
widely sourced. A “try it” culture.
Everyone has meaningful access to data
assets: they can see what exists, what it
means, and how to access it.
Best practices, like testing, are
encouraged even for data consumers.
18. Where is your culture?
PermissiveRestrictive
“Data ownership is a power structure”
Permission needed to use data.
Beware the implicit permission: if you
need human help then you need
permission.
Path to production is slow, much of the
data product is rewritten by product
team.
The idea tax strangles your innovation.
Clear metrics set expectations. Ideas are
widely sourced. A “try it” culture.
Everyone has meaningful access to data
assets: they can see what exists, what it
means, and how to access it.
Best practices, like testing, are
encouraged even for data consumers.
19. Where is your culture?
Anarchy
Coherence
“Accidental production”
causes tech debt.
Org changes erode
knowledge and leave zombie
jobs
Post-hoc understanding of
who uses what.
Automated data lineage
keeps up to date with the
organization.
Tests reinforce business
critical assumptions
20. The fear loop is driven by lack of coherence.
PermissiveRestrictive
Anarchy
Coherence
Democracy
!
21. The fear loop is driven by lack of coherence.
PermissiveRestrictive
Anarchy
Coherence
Embarrassing,
preventable
mistakes
Democracy
!
22. The fear loop is driven by lack of coherence.
PermissiveRestrictive
Anarchy
Coherence
Embarrassing,
preventable
mistakes
The control is for
your own good.
Democracy
!
23. Let’s talk about tools
PermissiveRestrictive
Anarchy
Coherence
Tools can
move you in
this direction.
24. Let’s talk about tools
PermissiveRestrictive
Anarchy
Coherence
Tools can help, but they aren’t the answer.
25. Make the right thing to do the easy thing to do
Post-hoc understanding of what data creators are doing
• You really do want Data Catalogue/Lineage.
• Social aspect is critical: encourage people using similar data to
chat. Maybe (… probably) they are reinventing wheels.
• “How do people use this well?” is the most important question your
catalogue can answer.
26. Make the right thing to do the easy thing to do
Testing?
• I think a little testing is helpful, but too much risks negative
marginal ROI.
• Focus on integration testing or key assumptions with major
downstream implications.
• Testing tools that let end users assert truths about their
data/assumptions are preferable.
28. Thanks! (and a favor)
I’m trying to interview as many people as I can about their data
pipeline experiences for a larger project.
Can you share your experiences?
LinkedIn - https://www.linkedin.com/in/guyrt/