Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

Big Data Analytics -
The Best of the Worst
Krishna Sankar
@ksankar
https://www.linkedin.com/in/ksankar

About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx]
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Studying MS-CFRM (Computational Finance/Risk management) UWA
o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]
o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]
o Reviewer : “Machine Learning with Spark” Packt Publishing
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com

Background – Top 5Background – Top 5
http://tcapp2.publishpath.com/rabbithole
http://conservationmagazine.org/wordpress/wp-‐content/uploads/2013/05/context-‐matters.jpg

1) Data Science
The art of building a model with known knowns
Which when let loose, works with unknown unknowns
1) Data Science
The art of building a model with known knowns
Which when let loose, works with unknown unknowns
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others
know,
you
don’t
Model Evolution/DevOps
to capture this
o Capture in

Models
o Facts,
outcomes
or

scenarios
we
have
not

encountered,
nor

considered
o “Black
swans”,
outliers,
long

tails
of
probability

distributions
o Lack
of
experience,

imagination
o Potential
facts,

outcomes
we
are

aware,
but
not

with
certainty
o Stochastic

processes,

Probabilities
o Known Knowns
o There are things we know that
we know
o Known Unknowns
o That is to say, there are things
that we now know we don't
know
o But there are also Unknown
Unknowns
o There are things we do not know
we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics

2) The pipeline is the context2) The pipeline is the context
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Volume
o Velocity
o Streaming Data
o Volume
o Velocity
o Streaming Data
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
CollectCollect StoreStore TransformTransform
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-‐
structured
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-‐
structured
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management
Data ScienceData Science
o Dynamic Data Sets
o 2 way key-‐value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
o Dynamic Data Sets
o 2 way key-‐value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performance
o Scalability
o Refresh Latency
o In-‐memory Analytics
o Performance
o Scalability
o Refresh Latency
o In-‐memory Analytics
o Advanced Visualization
o Interactive Dashboards
o Map Overlay
o Infographics
o Advanced Visualization
o Interactive Dashboards
o Map Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots

VolumeVolume
VelocityVelocity
VarietyVariety
3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s
ContextContext
Connect
edness
Connect
edness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE

4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift
Dynamic dash boards
Multi-dimensional
pivots w/
customization
Selectable
algorithms on data
subsets
“Cluster Customer
for 5 thanksgiving
seasons”
Learning Models
Automatic Feature Selection
& hyper parameter
optimizations as it gets
more data
Dynamic Models –
Model Selection based
on context
Complexity
Value
Automated Analytics- Let
Data tell story
Feature Learning, AI, Deep
Learning
Concept Drift
Validate Model assumptions
+ hyper parameters +
features in the current
context – after they are in
production
Ref:
Prof.
Josh
Bloom,
Keynote:
A
Systems
View
of
Machine
Learning,
#pydata Seattle’15

5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps
oAnalytics in the lab = Investigative
• Interactive, Iterative,
Explorative
• Output is usually decision
data science
o Analytics in the factory = Operational
• Automated, systemic,
transparent & explainable
• Output is embedded
intelligence
• Embedded in customer facing
decision systems
Josh
Wills-‐From
the
labs
to
the
factory,

https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/
There is a chasm between Model/Reason and Deploy

6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell
o Data is the lens through which you see the business and fell the pulse
o Collect the right data through “Thoughtful Data Design”
o Give Data Back in a Powerful Way
o But don’t confuse or overwhelm the users
• The users have to feel safe
• The users have to feel they are in control
o Never try to launch a complicated data product on a fixed schedule
o Offer progressively sophisticated products, leveraging the data & insights, across
the different user population segments
• Customer segmentation & stratification is not just for retail !
Josh
Wills-‐From
the
labs
to
the
factory,

https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/

“Therearenoroutinestatisticalquestions,only
questionablestatisticalroutines”--DavidCox
Ref:
Gabriele
Corno Natural
History
Museum
in
#London
..by
George
Thalassinos
Big Data Analytics - The Best of the Worst

Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The company proudly has crossed the
chasm to the big data world with a new
shiny Hadoop infrastructure.
o Now every one starts putting their data
into this “lake”.
o After a few months, the disks are full;
Hadoop is replicating 3 copies; even
some bytes are falling off the floor from
the wires – but no one has any clue on
what data is in there, the consistency
and the semantic coherence
Red Pill-Data Curation
o Data Curation
• A consistent published schema
o Data Quality & Data Lineage,
“descriptive metadata and an
underlying mechanism to maintain it”,
all are part of the data curation layer …
o Semantic consistency across diverse
multi-structured multi-temporal
transactions require a level of data
curation & discipline
o Design for the right “Data Gravity” &
“Data Mass” as Van Lindberg
mentioned, yesterday, in his keynote
• Not Data Molasses !

Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The company proudly has crossed the
chasm to the big data world with a new
shiny Hadoop infrastructure.
o Now every one starts putting their data
into this “lake”.
o After a few months, the disks are full;
Hadoop is replicating 3 copies; even
some bytes are falling off the floor from
the wires – but no one has any clue on
what data is in there, the consistency
and the semantic coherence
o Data Curation
• A consistent published
schema
o Data quality & data lineage,
“descriptive metadata and an
underlying mechanism to maintain
it”, all are part of the data curation
layer …
o Semantic consistency across diverse
multi-structured multi-temporal
transactions require a level of data
curation and discipline
https://www.linkedin.com/pulse/data-‐lakes-‐udls-‐vs-‐analytics-‐platforms-‐gargi-‐adhav

Big Data To NowhereBig Data To Nowhere
Blue Pill
o IT sees an opportunity and starts
building the infrastructure, sometimes
massive, and puts petabytes of data in
the Big Data Hub or lake or pool or …
But no relevant business facing apps.
o A conversation goes like this …
• Business : I heard that we have a big
data infrastructure, cool. When can I
show a demo to our customers ?
• IT : We have petabytes of data and I
can show the Hadoop admin console.
We even have the Spark UI !
• Business : … (unprintable)
Red Pill-Full Stack MVP
(see next slide)
o Build the full stack ie bits to business …
o Build incremental Decision Data Science &
Product Data Science layers, as appropriate …
o The following conversation is a lot better …
• Business : I heard that we have a big data
infrastructure, cool. When can I show a demo
to our customers ?
• IT : Actually we don’t have all the data. But
from the transaction logs and customer data,
we can infer that Males between 34 -36 buy
a lot of stuff from us between 11:00 PM &
2:00 AM !
• Business : That is interesting … Show me a
graph. BTW, do you know what is the revenue
is and the profit margin from these buys ?
• IT : Graph is no problem. We have a shiny
app with the dynamic model over the web
logs.
• IT: With the data we have, we only know
that they comprise ~‾30% of our volume by
transaction. But we do not have the order
data in our Hadoop yet. We can … let me
send out a budget request …

ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend
o Predict
o Reason
o Model
o Model
o Explore
R/Python
o Compositional Analysis
Data Hub
Curated Data
Storage : HDFS, Parquet
Compute : Hadoop MR, Spark
Landing
Zone
Dashboards
APIs
Reporting Hub
Analytics Hub
ETL
In-Memory Hub
Real-Time
Kafka …
Reporting

Hub
Analytics

Hub
Hadoop

MR
Long-‐Running
Complex
Jobs
-‐ Yearly
pivots,

Multi-‐dimensional
Exact
Uniques
✔ ️ ✔ ️
Real-‐time
ad-‐hoc
pivots,
Approx Uniques (HLL) ✔ ️
Fast
Response
with
Aggregated
data
Subsets ✔ ️

ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend
o Predict
o Reason
o Model
o Model
o Explore
R/Python
o Compositional Analysis
Data Hub
Curated Data
Storage : HDFS, Parquet
Compute : Hadoop MR, Spark
Landing
Zone
Dashboards
APIs
Reporting Hub
Analytics Hub
ETL
In-Memory Hub
Real-Time
Kafka …
Reporting

Hub
Analytics

Hub
Hadoop

MR
Long-‐Running
Complex
Jobs
-‐ Yearly
pivots,

Multi-‐dimensional
Exact
Uniques
✔ ️ ✔ ️
Real-‐time
ad-‐hoc
pivots,
Approx Uniques (HLL) ✔ ️
Fast
Response
with
Aggregated
data
Subsets ✔ ️
https://www.linkedin.com/pulse/why-‐how-‐make-‐mvp-‐analytics-‐ruoyu-‐bao
Build The E2E Analytics MVP Stack

A Data Too FarA Data Too Far
Blue Pill
o You might get a few .gz files, a few .csv files
and of course, parquet files, in multiple
systems
o Some will have IDs, some names, some
aggregated by week, some aggregated by
day and others pure transactional.
o The challenge is that we have the data, but
there is no easy way to combine them for
interesting inferences …
o “..The most creative things that happen
with data are less about sophisticated
algorithms and vast computation
(though those are nice) than it is about
putting together different pieces of
data that were previously locked up in
different silos.”
o Data Pipelines (eg.Kafka) with in-line
processing to ensure correctness,
semantic and temporal congruence &
integrity
Ref:
Jay
Kreps,
Announcing
Confluent

Where is the Tofu ?Where is the Tofu ?
Blue Pill
o It is very simple to produce
“reasonable” recommendations
o But extremely difficult to improve them
to become “great”
o And, there is a huge difference in
business value between reasonable
Data Set & great …
o The Antidote : The insights and the
algorithms should be relevant and
scalable …
o There is a huge gap between Model-
Reason and Deploy …
o Statistical Significance need not mean
business significance
o Don't confuse the statistical
significance of an experiment with the
magnitude of the result, even though
the word "significance" is often used
for both – Peter Norvig
Ref:
Xavier
Amatriain when
he
talked
about
the
Netflix
Prize
"Knowledge is a process of piling up facts;
wisdom lies in their simplification."
- Martin Fischer

Analytics - miscuesAnalytics - miscues
oDon’t Torture the Data

Down
the
rabbit
hole
art
by
frostyshadows
http://frostyshadows.deviantart.com/art/Down-‐the-‐Rabbit-‐Hole-‐358090601
Design PrinciplesDesign Principles
1. Start with needs*
2. Do less
3. Design with data
4. Do the hard work to make it simple
5. Iterate. Then iterate again.
6. Build for inclusion
7. Understand context
8. Build digital services, not websites
9. Be consistent, not uniform
10. Make things open: it makes things better
https://www.gov.uk/design-‐principles

Data Alone is not enoughData Alone is not enough
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
o Data Scientists are not Data Alchemists
• Don’t expect Analytic Gold from a pack of data lead
A few useful things to know about machine learning- by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
https://www.flickr.com/photos/bionerd/3123155390

More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not mean it can be
learned
o Correlation Does not imply Causation
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755

In short …In short …
o Build Full stack, iteratively building capabilities
o Identify the ‘Right’ Business Problems
o Create Valuable Data Perspectives
o Frame problems & bring analytics together with non-quantitative information to
build compelling stories
o Embed Inference & Intelligence in products
https://www.linkedin.com/pulse/article/20141108013125-‐1290064-‐winning-‐at-‐analytics-‐takes-‐more-‐than-‐technology
http://www.kdnuggets.com/2014/09/hiring-‐data-‐scientist-‐what-‐to-‐look-‐for.html

Ogilvy & Mather Advertising: Morningview fromthe Ogilvy & Mather NY office,nicknamedthe ChocolateFactory # TravelTuesday
hankYou
ThankYou

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

Ähnlich wie Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes (20)

Mehr von Krishna Sankar

Mehr von Krishna Sankar (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes