Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Big Data in Texas: Then, Now, and Ahead
1. “Big Data in Texas:
Then, Now, and Ahead”
Paco Nathan,
Evil Mad Scientist @
Concurrent, Inc.
1
2. Then, Now, and Ahead
THEN
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
2
3. observations…
Lynn asked me to talk about Data here today
A few weeks ago we stepped back for a moment
to reflect about what we’d seen happen in Austin
over the years
Both of us ran alternative bookstores in Austin,
twenty or so years ago, and we participated as
the Internet thing exploded in the 1990s
That was a blast –
3
11. observations…
Overall, it’s about systems thinking
We have a wealth of that here, at UT/Austin in particular…
Ilya Prigogine spent years here, which is just incredible
School of Architecture, with leading work in VR, GIS, etc.
Interactive innovations at ACTLab…
Quantitative emphasis at McCombs…
major intellectual resources here
11
12. Then, Now, and Ahead
NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
12
13. Data Science edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
business process,
wodniW D3 nepO
Domain dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
Expert
woN tahC
stakeholder
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
data detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
science egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
data prep, discovery,
noitartsigeR euqinU
Data
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
Scientist modeling, etc.
software engineering,
App Dev
automation
Ops
systems engineering,
availability
introduced
capability
13
15. references…
by DJ Patil
Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
15
16. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
cascading.org
16
17. Enterprise Data Workflows
Over the past 5+ years, we’ve seen many large-
scale Enterprise production deployments based
on Cascading, Cascalog, Scalding, PyCascading,
Cascading.JRuby, etc.
Enterprise data workflows,
Machine learning at scale,
Big Data…
Why?
amazon.com/dp/1449358721
17
18. Then, Now, and Ahead
NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
18
19. Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits well into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
19
20. Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits well into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
• Adjusted Data – Dr. Don Easterbrook, Senate witness
20
21. Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this
21
22. Circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
22
23. Circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
“Throw it over the wall”
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
23
24. Circa 2001: post- big ecommerce successes
Stakeholder Product Customers
dashboards UX
Engineering
models servlets
recommenders
Algorithmic + Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
24
25. Circa 2001: post- big ecommerce successes
Stakeholder Product Customers
“Data products”
dashboards UX
Engineering
models servlets
recommenders
Algorithmic + Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
25
26. Circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
26
27. Circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
“Optimizing topologies”
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
27
28. references…
• Lambda Architecture: blending topologies
• Big Data by Nathan Marz, James Warren
• manning.com/marz
source: Nathan Marz
28
29. references…
by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
29
30. references…
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
30
31. Then, Now, and Ahead
NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
31
32. Displacement
Geoffrey Moore
Mohr Davidow Ventures, author of Crossing The Chasm
Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade
data as the major force… mostly through apps –
verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
XLDB, 2012:
complex analytics workloads are now displacing
SQL as the basis for Enterprise apps
32
33. Drivers
algorithmic modeling + machine data
+ curation, metadata + Open Data
data products, as feedback into automation
evolution of feedback loops
a big part of the science in data science…
internet of things + complex analytics
accelerated evolution, additional feedback loops
taking this out into a highly social dimension
33
34. “A kind of Cambrian explosion”
source: National Geographic
34
36. A Thought Exercise
Consider that when a company like Catepillar moves
into data science, they won’t be building the world’s
next search engine or social network
They will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
Operations Research –
crunching amazing amounts of data
36
37. A Thought Exercise
That’s a $50B company,
in a market segment worth $250B
Upcoming: tractors as drones –
guided by complex, distributed data apps
37
39. Two Avenues to the App Layer
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments
complexity ➞
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space scale ➞
to compete using relatively lean staff
39
40. Then, Now, and Ahead
AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
40
41. For instance…
Let’s drill-down on that intersection of tractors and
crops, as a focus…
Some of the largest use cases for large-scale data
workflows which we encounter are in Agriculture
Here’s a sector which integrates some of those
themes from the Internet of Things, Catepillar,
Climate Corp, etc.
41
42. Data and Agriculture, Ahead
• single largest employer, livelihood for 40% globally
• 500 million small farms worldwide
• most family farmers rely on rain-fed agriculture
• approx $2T agricultural real estate in US alone
• high annual rate of soil depletion
• cycles of flooding, drought, desertification
• high resolution from private satellite networks,
e.g., skyboximaging.com
• SMS networks for “business intelligence” among
family farmers in Ethiopia agrepedia.com
• microfinance, e.g., kiva.org, slowmoney.org
42
43. Data and Agriculture, Ahead
Consider the emerging reality of drone tractors,
guided by satellite feeds, with predictive analytics
accessing remote cloud-based clusters, crunching
data for crops planted per-plot, based on years of
history evaluated in time series analysis
It would be difficult to identify a bigger Big Data
problem in the world
43
44. Data and Agriculture, Ahead
You’ve heard about Peak Oil, Peak Phosphorus?
How about Peak Snow?
In other words, rising variance of snow pack levels,
increasingly earlier peak snow in the mountains…
which stresses the watersheds, infrastructure, etc.,
which in turn stress agriculture, energy, transportation,
financial markets, tax basis, etc.
Jeff Dozier, William Gail
“The Emerging Science of Environmental Applications”
The Fourth Paradigm, 2009
source: J. Dozier, et al., UCSB
44
45. Data and Agriculture, Ahead
Variance in the timing of the water cycle causes
stress on natural resources and infrastructure:
reservoirs, aqueducts, river ways, aquifers, levees,
farm lands, seawater incursion, etc.
Even in the face of so much IoT data looming,
we lack adequate data and modeling of snowpack,
snow melt, runoff, evaporation, water basins, etc.,
to understand the impact of these changes – now
needed to forecast where to change infrastructure
or strategies
There’s not much machine data up in the mountain
peaks, and satellite data only serves so far…
new opportunities for Big Data
source: J. Dozier, et al., UCSB
45
47. Data and Agriculture, Ahead
We can resolve these kinds of
problems; however, solutions
must leverage huge amounts
of data
47
48. Then, Now, and Ahead
AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
48
49. Everything’s Bigger in Texas
Agriculture is just one sector, one set of
problems to tackle
We have much, much more here in Texas
For example, Houston is a major center
for Maritime work…
check out:
marinexplore.org
49
50. Everything’s Bigger in Texas
There’s also the not so small matter of the
Energy and Transportation sectors
GE is putting sensors in each and every
wind generator, each and every jet engine –
again, the Internet of Things.
I’ve heard rumors there are a few of those
wind turbines out in West Texas?
50
51. Everything’s Bigger in Texas
Another of the fastest growing use cases we
see for large-scale predictive modeling is in
Telecom
Think about the stream of CDRs, billions of us
bipeds wandering about the planet with our
phones…
Firehose for that makes Twitter look like MySpace!
The value of location services as data products
for local businesses, communities is astounding
51
52. Then, Now, and Ahead
AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves
52
53. What is needed?
Approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log file analysis, etc.
Unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up
Most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
source: D3
53
54. What else do we need?
• more emphasis on statistical thinking
• not SQL vs. NoSQL, but instead a focus
on apps as the process of structuring data
• multi-disciplinary teams,
not cubicles and silos
• evolving more feedback loops,
to drive more automation
• oddly enough, we need automation
to be able to employ more people
in intelligent, productive ways
• otherwise, we’re left with…
source: Schwa Corporation
54