If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com
An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.
Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/
Scale your database traffic with Read & Write split using MySQL Router
Intro to Data Science for Enterprise Big Data
1. Intro to Data Science
Paco Nathan
Document
Collection
Scrub
Tokenize
token
M
Concurrent, Inc. Stop Word
List
HashJoin
Left
RHS
Regex
token
GroupBy
token
R
Count
pnathan@concurrentinc.com
Word
Count
@pacoid
Copyright @2012, Concurrent, Inc.
3. core values
Data Science teams develop actionable insights,
building confidence for decisions
that work may influence a few decisions worth
billions (e.g., M&A) or billions of small decisions
(e.g., AdWords)
probably somewhere in-between…
solving for pattern, at scale.
NB: projects require
teams, not sole players
4. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
backstory
5. personal timeline
1980s 1990s 2000s 2010s
Symbiot, Adknowledge,
lead data teams ShareThis, IMVU, etc.
consult
start-up CTO BNTI
Bell Labs,
enterprise Moto
IBM,
research NASA
school Stanford
6. inflection point: demand side
• huge Internet successes after 1997 holiday season… 1997
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)
• consider this metric: 1998
annual revenue per customer / operational data store size
dropped more than 100x within a few years after 1997
• storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data…
our methods must adapt 2004
• “conventional wisdom” of RDBMS and BI tools became
less viable; business cadre still focused on pivot tables
and pie charts… which tends toward inertia
• MapReduce and the Hadoop open source stack grew
directly out of this context… but that only solves parts
massive disruption in retail, advertising, etc.,
“All of Fortune 500 is now on notice over the next 10-year
period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
8. statistical thinking
Process Variation Data Tools
a mode of thinking which includes both logical and analytical
reasoning: evaluating the whole of a problem, as well as its
component parts; attempting to assess the effects of changing
one or more variables
this approach attempts to understand not just problems and
solutions, but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on
experience in physics – roughly 50% of my peers come from physics
or physical engineering… programmers typically don’t think this way
9. most valuable skills
• approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues
• unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up
• most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary
10. social caveats
• the phrase “This data cannot be correct!” may be an
early warning about the organization itself
• much depends on how the people whom you work
alongside tend to arrive at their decisions:
‣ probably good: Induction, Abduction, Circumscription
‣ probably poor: Deduction, Speculation, Justification
in general, one good
data visualization
can put many ongoing
verbal arguments
to rest
xkcd
16. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
build:
data science teams
17. process
help people ask the
discovery right questions
allow automation to
modeling place informed bets
deliver products at
integration scale to customers
leverage smarts in
apps product features Gephi
keep infrastructure
systems running, cost-effective
18. matrix = needs × roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
stakeholder
scientist
developer
ops
19. matrix: usage
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
conceptual tool for managing Data Science teams stakeholder
overlay your project requirements (needs)
with your team’s strengths (roles)
scientist
that will show very quickly where to focus
NB: bring in individuals who cover 2-3 needs, developer
particularly for team leads
ops
20. matrix: needs
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
one dimension is “needs”: stakeholder
discovery, modeling, integration, apps, systems
these are the primary phases of leveraging Big Data…
scientist
stakeholders represent the domain: the key aspect
to leverage
analysts usually drive from discovery toward integration, developer
while the engineers tend to drive from systems toward
integration
ops
NB: effective, hands-on management in Data Science
must live in the space of integration, not delegate it
21. matrix: roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
one dimension is “roles”: stakeholder
stakeholder, scientist, developer, ops
each role leverages different disciplines, opportunities,
scientist
and risks… there’s great power in pairing people with
complementary skills, in team environments where they
can recognize each other’s priorities and perspectives
developer
blurring these roles is wonderful, when you find great
people capable of doing so, e.g., DevOps… however,
when businesses get into trouble, they will tend to ops
“push down” these roles, blurring boundaries in
ways which stresses teams and limits scalability
22. matrix: example team
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
stakeholder
scientist
developer
ops
23. matrix: example team
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy
stakeholder
scientist
developer
ops
summary: this team seems heavy on systems, may need more overlap
between modeling and integration, particularly among team leads
24. typical hand-offs
integrity availability discovery communications
people
vendor
data
sources
Query
data Query
Hosts
query BI & dashboards
warehouse Hosts
hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
customer analytics visualize business
interactions recommenders stakeholders
internal API, crons, etc.
modeling
engineers,
automation analysts
25. data priorities
• Availability
Top priority, providing access to data as needed.
Lack of availability causes large hidden costs to a business.
• Integrity
integrity
vendor
data
sources
availability discovery communications
people
•
Query
data Query
Hosts
query BI & dashboards
Hosts
Discovery
warehouse hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
•
customer analytics visualize business
interactions recommenders stakeholders
Modeling internal API, crons, etc.
modeling
engineers,
automation analysts
• Communications
26. data priorities
• Availability
• Integrity
Work within Engineering to ensure that customer data,
internal metrics, third-party sources, etc., get collected and
maintained in ways which are meaningful and consistent
for required business use cases.
• Discovery
integrity
vendor
data
sources
availability discovery communications
people
•
Query
data Query
Hosts
query BI & dashboards
Hosts
Modeling
warehouse hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
•
customer analytics visualize business
interactions recommenders stakeholders
Communications internal API, crons, etc.
modeling
engineers,
automation analysts
27. data priorities
• Availability
• Integrity
• Discovery
Analyze and visualize data on behalf of business stakeholders.
Leverage statistics so that we not only say “What” decisions to
take, but can answer “Why?” and “How good are they?”
• Modeling integrity
vendor
data
sources
availability discovery communications
people
•
Query
data Query
Hosts
query BI & dashboards
Communications
warehouse Hosts
hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
customer analytics visualize business
interactions recommenders stakeholders
internal API, crons, etc.
modeling
engineers,
automation analysts
28. data priorities
• Availability integrity
vendor
data
sources
availability discovery communications
people
•
Query
data Query
Hosts
query BI & dashboards
Integrity
warehouse Hosts
hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
•
customer analytics visualize business
interactions stakeholders
Discovery
recommenders
internal API, crons, etc.
modeling
engineers,
automation analysts
• Modeling
Use business learnings in automated, scalable ways.
For example, manage an automated bid system.
Principally “algorithmic modeling”, not “data modeling”.
• Communications
29. data priorities
• Availability
• Integrity
integrity
vendor
data
sources
availability
Query
discovery communications
people
•
data Query
Hosts
query BI & dashboards
Hosts
Discovery
warehouse hosts reporting
production
cluster presentations
decision support
classifiers
predictive analyze,
•
customer analytics visualize business
interactions recommenders stakeholders
Modeling internal API, crons, etc.
modeling
engineers,
automation analysts
• Communications
Work closely with stakeholders so that insights gleaned from
data+analysis are understood, and important to the business.
Sum of learnings from this ongoing process represents
our primary value.
30. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
theory:
wrangle the data
31. CAP theorem
high
availability
C A
strong
consistency
P eventual
consistency
partition
tolerance
32. CAP theorem
“You can have at most two of these properties for any shared-data
system… the choice of which feature to discard determines the
nature of your system.” – Eric Brewer, 2000 (Inktomi)
• revenue transactions in ecommerce typically require
strong consistency and partition tolerance
• most analytics jobs for business use cases generally require
availability and eventual consistency, but tend to
not tolerate highly partitioned data
• ETL becomes an Achilles heal for “agile”:
‣ agile/experiment-driven/scale-out, which leads to…
‣ provably-hard-to-detect metadata drift, leading to…
‣ high-risk technical debt
33. interpretation
• purpose: theoretical limits for data access patterns
• essence:
‣ consistency
‣ availability
‣ partition tolerance
• best case scenario: you may pick two … or spend
billions struggling to obtain all three at scale (GOOG)
• translated: cost of doing business
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
34. data access patterns
• the world is not made of data warehouses…
• a handful of common data access patterns prevail
• learn to recognize these for any given problem
• typically expressed as trade-offs among:
‣ speed & volume (latency and throughput)
‣ reads & writes (access and storage)
‣ consistency / availability / partition tolerance
as for roles on teams, some mixing is valuable;
OTOH, too much blurring of boundaries causes
stress
35. data access patterns
• design patterns: originated in consensus negotiation
for architecture, later used in software engineering
• consider the corollaries in large-scale data work…
• essential advice:
select data frameworks based on
your data access patterns
• in other words, decouple use cases based on
needs – to avoid “one size fits all” blockers
• let’s review some examples…
36. access → frameworks → forfeits
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP
37. access → frameworks → forfeits
financial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP
39. interpretation
• purpose: theoretical limits for scalable computation
• essence:
task overhead and data independence
define limits of parallelism for any given problem;
however, these also suggest how well a problem
can be scaled-out
• translated: return on investment
http://en.wikipedia.org/wiki/Amdahl's_law
http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/
40. parallel computation
• parallelism allows for horizontal scale-out, which
create business “levers” in cost/performance at scale
• NB: MapReduce provides a compute framework which
is part-parallel and part-serial… that tends to
complicate app development
• most hard problems in industry have portions which
do not allow data independence, or which require
iteration
• current efforts in massively parallel algorithms research
may help to parallelize problems and reduce iteration –
estimates are 3-5 years out for industry use
GPUs and other hardware architecture advancements
will likely make Hadoop unrecognizable 3-5 years out
41. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
theory:
manage the science
42. the science in data science
• Estimate Probability!
• Calculate Analytic Variance!!
• Apply Learning Theory!!!
edoMpUsserD:IUN
• Manipulate Order
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
Complexity!!!!
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
l e n aP t i dE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
y b b o l s em a g d e hc n u a L
noitartsigeR euqinU
edoMpUsserD:IUN
tcudorP ylppA lenaP yr
tcudorP evomeR lenaP
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps ss
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
dneirF ddA
revO tcudorP pilF lena
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esah
M215 :gniniamer ecaps
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detra
teP weN etaerC
detrats etius tset :tseTy
emag pazyeh dehcnua
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
43. probability estimation
“a random variable or stochastic variable is a
variable whose value is subject to variations”
“an estimator is a rule for calculating an
estimate of a given quantity based on observed
data”
estimators and probability
distributions provide the essential
basis for our insights
bayesian methods, shrinkage…
these are our friends
quantile estimation, empirical CDFs…
…versus frequentist notions
44. analytic variance
our tools for automation leverage deep
understanding of covariance
cannot overstate the importance of
sampling… insist on metrics described
as confidence intervals, where valid
bootstrapping, bagging…
these are our friends
Monte Carlo methods resolve “black box”
problems
point estimates may help prevent
“uninformed” decisions
do not skimp on this part, ever…
a hard lesson learned from BI failures
45. learning theory
in general, apps alternate between learning
patterns/rules and retrieving similar things…
statistical learning theory – rigorous,
prevents you from making billion dollar
mistakes, probably our future
machine learning – scalable, enables
you to make billion dollar mistakes, much
commercial emphasis
supervised vs. unsupervised
arguably, optimization is a related area
once Big Data projects get beyond merely
digesting log files, optimization will likely
become yet another buzzword :)
46. order complexity
techniques for manipulating order complexity:
dimensional reduction… with clustering
as a common case
e.g., you may have 100 million HTML docs,
but there are only ~10K useful keywords
low-dimensional structures, PCA
linear algebra tricks: eigenvalues, matrix
decomposition, etc.
many hard problems resolved by “divide and
conquer”
this is an area ripe for much advancement in
algorithms research near-term
47. Intro to Data Science
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
praxis
49. a sample of great algorithms…
time series analysis seasonal variation geospatial
hidden markov models ARIMA bayesian point estimates kriging k-d trees
funnel optimization topics lang id anti-fraud regression
linear programming cosine similarity LDA TextRank LID TF-IDF random forest GLM/GAM
elasticity of demand recommender key phrase doc similarity classifier
differential equations k-medoids PCA LSH k-means|| probabilistic hashing
customer lifetime value market segmentation dimensional reduction customer experiments
connected components markov random walk association rules multi-arm bandit
sessionization social graph what if ? sample variance
affiliation networks MCMC bootstrapping
50. Intro to Data Science
Paco Nathan
Document
Collection
Scrub
Tokenize
token
M
Concurrent, Inc. Stop Word
List
HashJoin
Left
RHS
Regex
token
GroupBy
token
R
Count
pnathan@concurrentinc.com
Word
Count
@pacoid
Copyright @2012, Concurrent, Inc.
Hinweis der Redaktion
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n
responsible for net lift, or we work on something else\n