The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
4. Whither Data Science?
FLAWED
twitter.com/josh_wills/status/198093512149958656
issue: aristotelian perspectives in a non-linear world…
5. Whither Data Science?
circa 2008: a large ad-tech firm, running one
of the largest Hadoop instances in the cloud,
execs said “Don’t bother” to dig into analysis
of geo, clustering, time series, etc.
6. Whither Data Science?
circa 2008: a large ad-tech firm, running one
of the largest Hadoop instances in the cloud,
execs said “Don’t bother” to dig into analysis
of geo, clustering, time series, etc.
!
We did anyway.
7. Whither Data Science?
circa 2008: a large of the largest Hadoop FLAWED
ad-tech firm, running one
instances in the cloud,
execs said “Don’t bother” to dig into analysis
of geo, clustering, time series, etc.
!
We did anyway:
• people in SF don’t click online travel ads much,
however, people in Dodge City do… a lot!
• largest customer segment: flag poles, portable
generators, hammocks, sea salt, mail-order
steaks, defibrillators
8. Whither Data Science?
primary sources for the notion:
Cleveland, W. S.,
“Data Science: an Action Plan for Expanding
the Technical Areas of the Field of Statistics,”
International Statistical Review (2001), 69, 21-26.
http://cm.bell-labs.com/stat/doc/datascience.ps
Breiman L.,
“Statistical modeling: the two cultures”,
Statistical Science (2001), 16:199-231.
http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
9. Whither Data Science?
we have a long, long way yet to go:
So many problems that we encounter
in industry can be represented as graphs…
!
Tensors provide means for representing
multiple-edge graphs, ostensibly solving
for a general case…
!
Even so, how much time have you spent
working with tensors for data science apps?
wikipedia.org
11. Arc 1: Who has the crystal ball?
TL;DR: Nods to some people who envisaged
and modeled our shared future…
12. Arc 1: Who has the crystal ball?
Theory, Eight Decades Ago:
what can be computed?
Haskell Curry
haskell.org
Alonso Church
wikipedia.org
Praxis, Four Decades Ago:
algebra for applicative systems
John Backus
acm.org
David Turner
wikipedia.org
Reality, Two Decades Ago:
web apps, ML, machine data
Pattie Maes
MIT Media Lab
16. Arc 2: Why are we here?
TL;DR: We share the delightful role of…
!
!
speaking truth to power
17. Arc 2: Why are we here?
Reason 1:
early 19th c. Prussian/Napoleonic “General Staff”
organization => corporate IT silos
!
translated:
too many people saying “That is not my concern.”
!
action:
interdisciplinary teams tear down silos,
surfacing insights
18. Arc 2: Why are we here?
Reason 2:
19th-20th c. statistics emphasized defensibility
in lieu of predictability
!
translated:
defend one’s job, not boost top-line revenue
!
action:
focus on predictability; if you need to defend
your job, you should be working elsewhere
19. Arc 2: Why are we here?
Reason 3:
machine learning derives from several disciplines,
but ultimately is a subset of optimization
!
translated:
they couldn’t talk to each other very much,
we have difficulty understanding them collectively
!
action:
learn to leverage optimization theory, thoroughly
20. Arc 2: Why are we here?
Reason 4:
university math curricula are still tilted toward
Cold War priorities
!
translated:
2-3 years calculus weeds out the better mechanical
engineering candidates who can build the most
cost-effective ICBMs
!
action:
leadership must embrace how to leverage advanced
math for business use cases
21. Arc 2: Why are we here?
Reason 5:
brogrammers tend to emphasize logical
reasoning over analytic reasoning
!
translated:
left-brained lopsidedness wins temporarily,
then fails spectacularly
!
action:
ask security to walk the brogrammers
back to their cave
22. Arc 2: Why are we here?
Reason 6:
people can make intuitive decisions in
~4 dimensions at most, period
!
translated:
product managers as Steve Jobs wannabes
are poisonous
!
action:
leverage data science, visualization, machine learning
with distributed systems at scale to address the high
dimensionality of data
23. Arc 2: Why are we here?
Reason 7:
embracing perpetual learning curves represents
a promethean challenge
!
translated:
learning is hard, and many organizations go to
great lengths to minimize it
!
action:
learn efficiently, continually, with a great thirst
25. Arc 4: What happens next?
TL;DR: Brace yourselves…
26. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
27. Arc 4: What happens next?
Full stack… no, really
from visualization
to virtualization,
all points in-between
source: Microsoft
28. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
29. Arc 4: What happens next?
You’ll work with functional programming
and cloud-based notebooks
http://databricks.com/product
30. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
31. Arc 4: What happens next?
Shift from modeling based on variance (batch)
towards probabilistic approximation
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-
mining/
32. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
33. Arc 4: What happens next?
Early data scientists displace the old-school
product managers
34. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
35. Arc 4: What happens next?
IoT, drones, microsats: several orders of magnitude
more data up ahead
microsats
e.g., Planet Labs, 400 km
airships
e.g., JP Aerospace, 40 km
atmostats
e.g., Titan Aerospace, 20 km
drones
e.g., HoneyComb, 120 m
robots
e.g., Blue River, 1 m sensors
e.g., Hortau, -0.3 m
Layered Sensing Networks
36. Arc 4: What happens next?
• Full stack… no, really
• You’ll work with functional programming
and cloud-based notebooks
• Shift from modeling based on variance (batch)
towards probabilistic approximation
• Early data scientists displace the old-school
product managers
• IoT, drones, microsats: several orders of
magnitude more data up ahead
• leave SF – the more interesting data science
work to be accomplished is not here
37. Arc 4: What happens next?
leave SF – the more interesting data science
work to be accomplished is not here
39. Vector Quantization:
After we’ve cleaned up data, formulated workflows
in terms of monoids, used graph representation, and
parallelized with a wealth of linear algebra, much of
the heavy-lifting that remains on the clusters is in
optimization
For example, deep learning @Google
uses many layers of neural nets trained
with gradient descent optimization
Taming Latency Variability and Scaling Deep Learning
Jeff Dean @Google (2013)
youtu.be/S9twUcX1Zp0
40. Vector Quantization:
One advantage of quantum algorithms is
to run large gradient descent problems in
constant time… Reworking high-ROI apps
to leverage lots of ML and large clusters,
then SGD represents the datacenter cost
basis, notably that part that scales…
Want to slash costs exponentially?
Plug in quantum for a game-changer,
maybe
Fast quantum algorithm for
numerical gradient estimation
Stephen P. Jordan
Phys. Rev. Lett. 95, 050501 (2005)
arxiv.org/abs/quant-ph/0405146 dwavesys.com
41. Vector Quantization:
Proposal: let’s drop clusters of quantum
devices into lunar polar craters, so we
can handle massive vector quantization
workloads
• micro-kelvin environs
• near perpetual sunlight
for energy sources
• park routers at L4
• approx. $15B to finance,
i.e., ~6 days DoD budget
42. Vector Quantization:
We’ll just put this here…
a couple o’ Googly projects in progress:
qCraft: Quantum Physics In Minecraft
plus.google.com/u/
1/+QuantumAILab/posts/
grMbaaDGChH
lunar.xprize.org
“We’re going back to the Moon. For good.”
45. events:
Strata EU
Barcelona, Nov 19-21
strataconf.com/strataeu2014
Data Day Texas
Austin, Jan 10
datadaytexas.com
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
46. presenter:
monthly newsletter for updates,
events, conf summaries, etc.:
liber118.com/pxn/
Just Enough Math
O’Reilly, 2014
justenoughmath.com
preview: youtu.be/TQ58cWgdCpA
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do