Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to truly fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way.
This talk is a meditation on the ideal environment for doing data science and how to (almost) get there. I will cover how I approach data problems with Clojure (and why Clojure in the first place), what I believe the process of doing data science should look like and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
10. The analytics chasm
Ideal. Almost real-time, can
be done during brainstorming
without disrupting flow
< 2min < 20min project
squeeze in
somewhere
in the day
fail
roadmap
ahoy!
13. Sharing results
• Have one canonical version that is always current.
• Concentrate discussion in one place and make it
searchable and persistent.
• Include methodology (=code).
19. Code hidden, but
can be expanded
Questions,
comments,
&
annotations
Shareable
Periodically re-run
to keep it fresh
#alderaan #sales #growth
discoverability
25. Data frame considered
harmful
• Data frame (=table) conflates representation and
abstraction
• Clojure excels in structure manipulation/encoding
26. github.com/sbelak/huri
• No data structures, just functions over collections
• Composable (even DSLs — no macros!)
• Reasonably fast (transducers <3)
• Do-what-I-mean (auto-sort, liberal with inputs, …)
• Minimal buy-in
27. composable
data structure
based DSLs
->> and partial friendly
Support reaching into
nested structures
everywhere
vanilla vector of maps
interoperability
Provide curried versions
where possible
28. Composability is key to
quick iterating
• Curried versions where possible
• ->> and partial friendly
• Side benefit: consistent API
• Generalised accessors (reaching into complex
structures everywhere via comp)
function
map key
“virtual” structure
29. “This is possibly Clojure’s most important
property: the syntax expresses the code’s
semantic layers. An experienced reader of
Clojure can skip over most of the code and
have a lossless understanding of its high-
level intent.”
— Z. Tellman, Elements of Clojure
36. huri.plot
• DSL that compiles to ggplot2
• Targets Gorilla REPL
• Follows the rest of Huri’s design philosophy
• bar chart, scatter plot, line chart, box & violin plot,
heatmap, histogram
37.
38. Takeouts
• Speed-of-answer matters
• Data science is about communication
• We don’t have to reinvent every wheel in Clojure
• Clojure is fantastic at structure manipulation, play
to its strengths
• Blurring the line between environment and work is
a powerful idea