[presentation to Data Science Dojo meetup on 14 Jan 2015]
Years before anyone had uttered the words “data” and “science” together to mean a professional discipline, the NIH was funding the National Centers for Biomedical Computing. For five years, I was attached to one of these centers hosted at U.Michigan where scientists and computational experts were engaged in large projects attempting to explain complex phenomena in human disease by integrating the analysis of data and domain information. Rather than talk about the scientific success of this grand effort, I will discuss lessons from this experience that can guide us as we build our skills as data scientists and help others understand their data.
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
"Got a nail? I got a hammer": Lessons for data science from the "dawn" of big science
1. “Got a nail? I got a hammer!”
Lessons for Data science from the “dawn” of big science
Ben Keller
Data Science Dojo
14 January 2015
@vinegarbin
bjkeller.github.io
linkedin.com/in/bjkeller[ ]
Creative Commons Attribution-
ShareAlike 4.0 International License
][
2. Some context
Almost 10 years ago the NIH program “National
Centers for Biomedical Computing” started
Goal to answer questions of driving projects in
biomedicine using computing
“Big science” [though maybe not the “dawn” ]
3. A different perspective
Questions center around story-telling
“what molecular activity links this genetic
change to the symptoms of type 2 diabetes?”
Overriding goal to build software to find answers
Proof-of-concept analysis to drive development
5. A computational scientist – trained in
algorithms – thinks of a problem as
...and will build tools that solve it
Given: a set of genes G, a covariance matrix M
over expression of genes in G
Find: a family of gene sets {Gi}, subsets of G,
such that...
7. What do you see?
p1 p2 …
s1 42.211 9.3211 …
s2 2.192 8.9942 …
⋮ ⋮ ⋮ ⋱
8. We see what we recognize
the van carrying my
geology class stops
next to a rock feature
that looks like:
9. We see what we recognize
John, napping in the
back seat, wakes up
briefly and looks out
the window.
What did he see?
10. We see what we recognize
John will tell you he saw a chevron fold
formed by opposing pressure on the rock layers
11. We see what we recognize
Everyone else saw
that water flowing
along a crack had
formed a v-shaped
channel in steeply
sloping layers (Sorry, John. You’re still wrong.)
12. So, what do you see?
p1 p2 …
s1 42.211 9.3211 …
s2 2.192 8.9942 …
⋮ ⋮ ⋮ ⋱
14. You might see
columns on which to do regression
a matrix on which to do matrix factorization
a graph connecting subjects to features
variables on which to measure mutual information
variables on which to Bayesian inference
15. You might see
A proxy problem that you
already know how to solve
16. We solve what we see
Scientist had results
linking genomic regions
to each other in
subjects with bipolar
disorder
Asking: what is common?
17. We solve what we see
Asking: what is common?…
……
Previously used graph to
represent what was common in
recommender systems
18. We solve what we see
Asking: what is common?
…
……
Previously used graph to
represent what was common in
recommender systems
We get answers, but hard to
interpret biologically
CDKN2A/B
PPARG
HHEX TCF7L2
"mortality"
"g1""repression"
19. Cognitive engineering tells us that we have to
manage relationships of
how we
think of
problem
how
represented
by tools
20. Cognitive engineering tells us that we have to
manage relationships of
way we
think of
tasks
what is
allowed by
tools
22. Lesson:
see the data and problem as
the expert “owner” sees them
Read as:
- leave the data as it is, and avoid exposing
abstractions not already in the original problem
- provide a expert-understandable explanation of
analysis, and, if you can’t, rethink whether the
approach is useful
23. "Need hot water for your bath?
Here’s a bucket,
a pot and a stove.
The well is outside!”
25. Stories we are looking
are complex
A
B
C
D
with interrelated data
26. Stories we are looking
are complex
with interrelated data
and interrelated
chains of analysis
27. Stories we are looking
are complex
with interrelated data
and interrelated
chains of analysis
Often have to translate data between tools,
and change perspective
29. Lesson
Any analysis is part of a larger question
Read as:
- reduce cognitive load of interpreting
between analysis steps
- understand how different steps relate
and try to help expert understand flow
of analysis
32. Corollary:
Appeal to cognitive science
Read as:
- use studies already done to understand
how scientists/experts do their work
- work with cog science expert to develop
understanding of domain experts
34. A complex problem
- involves uncertainty
- draws on incomplete and diverse
sources of information
- may be affected by several factors
and be driven by competing
objectives
(Mirel, Interaction design for complex problem solving, 2004)
36. A complex problem because
- uncertain what is an actual solution
- involves diverse, incomplete, and possibly
irrelevant information
- based on incomplete observations, affected
by technology/methodology
- conflicting objectives of predicting/
remediating/understanding disease
39. Corollary:
Embrace the uncertainty
Read as:
- expect not to know what expert needs,
and for them not to know what they need
- be agile: build analysis in conversation
with expert to push understanding
41. Corollary
Data will be “special”
Read as:
- understand where your data is coming
from, what it represents, and how it the
data owner sees it
- understand sources of error/noise
45. Corollary
The question will change
Read as:
once analysis gives the answer, expert
may recognize it was the wrong question,
or may come up with another one