SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Big Data Analytics -
The Best of the Worst
Krishna Sankar
@ksankar
https://www.linkedin.com/in/ksankar
About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx]
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Studying MS-CFRM (Computational Finance/Risk management) UWA
o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]
o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]
o Reviewer : “Machine Learning with Spark” Packt Publishing
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
Background – Top 5Background – Top 5
http://tcapp2.publishpath.com/rabbithole
http://conservationmagazine.org/wordpress/wp-­‐content/uploads/2013/05/context-­‐matters.jpg
1) Data Science
The art of building a model with known knowns
Which when let loose, works with unknown unknowns
1) Data Science
The art of building a model with known knowns
Which when let loose, works with unknown unknowns
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others	
  know,	
  you	
  don’t
Model Evolution/DevOps
to capture this
o Capture in	
  
Models
o Facts,	
  outcomes	
  or	
  
scenarios	
  we	
  have	
  not	
  
encountered,	
  nor	
  
considered
o “Black	
  swans”,	
  outliers,	
  long	
  
tails	
  of	
  probability	
  
distributions
o Lack	
  of	
  experience,	
  
imagination
o Potential	
  facts,	
  
outcomes	
  we	
  are	
  
aware,	
  but	
  not	
  	
  
with	
  certainty
o Stochastic	
  
processes,	
  
Probabilities
o Known Knowns
o There are things we know that
we know
o Known Unknowns
o That is to say, there are things
that we now know we don't
know
o But there are also Unknown
Unknowns
o There are things we do not know
we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics
2) The pipeline is the context2) The pipeline is the context
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
purpose  built  
appliances  
(soft/hard)
o Manage  SLAs  &  
response  times
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
purpose  built  
appliances  
(soft/hard)
o Manage  SLAs  &  
response  times
o Volume
o Velocity
o Streaming  Data
o Volume
o Velocity
o Streaming  Data
o Canonical   form
o Data  catalog
o Data  Fabric  across  the  
organization
o Access  to  multiple  
sources  of  data  
o Think  Hybrid  – Big  Data  
Apps,  Appliances  &  
Infrastructure
o Canonical   form
o Data  catalog
o Data  Fabric  across  the  
organization
o Access  to  multiple  
sources  of  data  
o Think  Hybrid  – Big  Data  
Apps,  Appliances  &  
Infrastructure
CollectCollect StoreStore TransformTransform
o Metadata
o Monitor  counters  &  
Metrics
o Structured  vs.  Multi-­‐
structured
o Metadata
o Monitor  counters  &  
Metrics
o Structured  vs.  Multi-­‐
structured
o Flexible  &  Selectable
§ Data  Subsets  
§ Attribute  sets
o Flexible  &  Selectable
§ Data  Subsets  
§ Attribute  sets
o Refine  model  with
§ Extended  Data  
subsets
§ Engineered  
Attribute  sets
o Validation  run  across  a  
larger  data  set
o Refine  model  with
§ Extended  Data  
subsets
§ Engineered  
Attribute  sets
o Validation  run  across  a  
larger  data  set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management
Data ScienceData Science
o Dynamic  Data  Sets
o 2  way  key-­‐value  tagging  of  
datasets
o Extended  attribute  sets
o Advanced  Analytics
o Dynamic  Data  Sets
o 2  way  key-­‐value  tagging  of  
datasets
o Extended  attribute  sets
o Advanced  Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performance
o Scalability
o Refresh  Latency
o In-­‐memory  Analytics
o Performance
o Scalability
o Refresh  Latency
o In-­‐memory  Analytics
o Advanced  Visualization
o Interactive  Dashboards
o Map  Overlay
o Infographics
o Advanced  Visualization
o Interactive  Dashboards
o Map  Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
VolumeVolume
VelocityVelocity
VarietyVariety
3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s
ContextContext
Connect
edness
Connect
edness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE
4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift
Dynamic dash boards
Multi-dimensional
pivots w/
customization
Selectable
algorithms on data
subsets
“Cluster Customer
for 5 thanksgiving
seasons”
Learning Models
Automatic Feature Selection
& hyper parameter
optimizations as it gets
more data
Dynamic Models –
Model Selection based
on context
Complexity
Value
Automated Analytics- Let
Data tell story
Feature Learning, AI, Deep
Learning
Concept Drift
Validate Model assumptions
+ hyper parameters +
features in the current
context – after they are in
production
Ref:	
  Prof.	
  Josh	
  Bloom,	
  Keynote:	
  A	
  Systems	
  View	
  of	
  Machine	
  Learning,	
  #pydata Seattle’15
5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps
oAnalytics in the lab = Investigative
• Interactive, Iterative,
Explorative
• Output is usually decision
data science
o Analytics in the factory = Operational
• Automated, systemic,
transparent & explainable
• Output is embedded
intelligence
• Embedded in customer facing
decision systems
Josh	
  Wills-­‐From	
   the	
  labs	
  to	
  the	
  factory,	
  
https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/
http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
There is a chasm between Model/Reason and Deploy
6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell
o Data is the lens through which you see the business and fell the pulse
o Collect the right data through “Thoughtful Data Design”
o Give Data Back in a Powerful Way
o But don’t confuse or overwhelm the users
• The users have to feel safe
• The users have to feel they are in control
o Never try to launch a complicated data product on a fixed schedule
o Offer progressively sophisticated products, leveraging the data & insights, across
the different user population segments
• Customer segmentation & stratification is not just for retail !
Josh	
  Wills-­‐From	
   the	
  labs	
  to	
  the	
  factory,	
  
https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/
http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
“Therearenoroutinestatisticalquestions,only
questionablestatisticalroutines”--DavidCox
Ref:	
  Gabriele	
  Corno Natural	
  History	
  Museum	
  in	
  #London	
  ..by	
  George	
  Thalassinos
Big Data Analytics - The Best of the Worst
Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The company proudly has crossed the
chasm to the big data world with a new
shiny Hadoop infrastructure.
o Now every one starts putting their data
into this “lake”.
o After a few months, the disks are full;
Hadoop is replicating 3 copies; even
some bytes are falling off the floor from
the wires – but no one has any clue on
what data is in there, the consistency
and the semantic coherence
Red Pill-Data Curation
o Data Curation
• A consistent published schema
o Data Quality & Data Lineage,
“descriptive metadata and an
underlying mechanism to maintain it”,
all are part of the data curation layer …
o Semantic consistency across diverse
multi-structured multi-temporal
transactions require a level of data
curation & discipline
o Design for the right “Data Gravity” &
“Data Mass” as Van Lindberg
mentioned, yesterday, in his keynote
• Not Data Molasses !
Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The company proudly has crossed the
chasm to the big data world with a new
shiny Hadoop infrastructure.
o Now every one starts putting their data
into this “lake”.
o After a few months, the disks are full;
Hadoop is replicating 3 copies; even
some bytes are falling off the floor from
the wires – but no one has any clue on
what data is in there, the consistency
and the semantic coherence
Red Pill-Data Curation
o Data Curation
• A consistent published
schema
o Data quality & data lineage,
“descriptive metadata and an
underlying mechanism to maintain
it”, all are part of the data curation
layer …
o Semantic consistency across diverse
multi-structured multi-temporal
transactions require a level of data
curation and discipline
https://www.linkedin.com/pulse/data-­‐lakes-­‐udls-­‐vs-­‐analytics-­‐platforms-­‐gargi-­‐adhav
Big Data To NowhereBig Data To Nowhere
Blue Pill
o IT sees an opportunity and starts
building the infrastructure, sometimes
massive, and puts petabytes of data in
the Big Data Hub or lake or pool or …
But no relevant business facing apps.
o A conversation goes like this …
• Business : I heard that we have a big
data infrastructure, cool. When can I
show a demo to our customers ?
• IT : We have petabytes of data and I
can show the Hadoop admin console.
We even have the Spark UI !
• Business : … (unprintable)
Red Pill-Full Stack MVP
(see next slide)
o Build the full stack ie bits to business …
o Build incremental Decision Data Science &
Product Data Science layers, as appropriate …
o The following conversation is a lot better …
• Business : I heard that we have a big data
infrastructure, cool. When can I show a demo
to our customers ?
• IT : Actually we don’t have all the data. But
from the transaction logs and customer data,
we can infer that Males between 34 -36 buy
a lot of stuff from us between 11:00 PM &
2:00 AM !
• Business : That is interesting … Show me a
graph. BTW, do you know what is the revenue
is and the profit margin from these buys ?
• IT : Graph is no problem. We have a shiny
app with the dynamic model over the web
logs.
• IT: With the data we have, we only know
that they comprise ~‾30% of our volume by
transaction. But we do not have the order
data in our Hadoop yet. We can … let me
send out a budget request …
ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend
o Predict
o Reason
o Model
o Model
o Explore
R/Python
o Compositional Analysis
Data Hub
Curated Data
Storage : HDFS, Parquet
Compute : Hadoop MR, Spark
Landing
Zone
Dashboards
APIs
Reporting Hub
Analytics Hub
ETL
In-Memory Hub
Real-Time
Kafka …
Reporting	
  
Hub
Analytics	
  
Hub
Hadoop	
  
MR
Long-­‐Running	
  Complex	
  Jobs	
  -­‐ Yearly	
  pivots,	
  
Multi-­‐dimensional	
   Exact	
  Uniques
✔ ️ ✔ ️
Real-­‐time	
  ad-­‐hoc	
  pivots,	
  Approx Uniques (HLL) ✔ ️
Fast	
  Response	
  with	
  Aggregated	
  data	
  Subsets ✔ ️
ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend
o Predict
o Reason
o Model
o Model
o Explore
R/Python
o Compositional Analysis
Data Hub
Curated Data
Storage : HDFS, Parquet
Compute : Hadoop MR, Spark
Landing
Zone
Dashboards
APIs
Reporting Hub
Analytics Hub
ETL
In-Memory Hub
Real-Time
Kafka …
Reporting	
  
Hub
Analytics	
  
Hub
Hadoop	
  
MR
Long-­‐Running	
  Complex	
  Jobs	
  -­‐ Yearly	
  pivots,	
  
Multi-­‐dimensional	
   Exact	
  Uniques
✔ ️ ✔ ️
Real-­‐time	
  ad-­‐hoc	
  pivots,	
  Approx Uniques (HLL) ✔ ️
Fast	
  Response	
  with	
  Aggregated	
  data	
  Subsets ✔ ️
https://www.linkedin.com/pulse/why-­‐how-­‐make-­‐mvp-­‐analytics-­‐ruoyu-­‐bao
Build The E2E Analytics MVP Stack
A Data Too FarA Data Too Far
Blue Pill
o You might get a few .gz files, a few .csv files
and of course, parquet files, in multiple
systems
o Some will have IDs, some names, some
aggregated by week, some aggregated by
day and others pure transactional.
o The challenge is that we have the data, but
there is no easy way to combine them for
interesting inferences …
Red Pill-Data Curation
o “..The most creative things that happen
with data are less about sophisticated
algorithms and vast computation
(though those are nice) than it is about
putting together different pieces of
data that were previously locked up in
different silos.”
o Data Pipelines (eg.Kafka) with in-line
processing to ensure correctness,
semantic and temporal congruence &
integrity
Ref:	
  Jay	
  Kreps,	
  Announcing	
  Confluent
Where is the Tofu ?Where is the Tofu ?
Blue Pill
o It is very simple to produce
“reasonable” recommendations
o But extremely difficult to improve them
to become “great”
o And, there is a huge difference in
business value between reasonable
Data Set & great …
Red Pill-Data Curation
o The Antidote : The insights and the
algorithms should be relevant and
scalable …
o There is a huge gap between Model-
Reason and Deploy …
o Statistical Significance need not mean
business significance
o Don't confuse the statistical
significance of an experiment with the
magnitude of the result, even though
the word "significance" is often used
for both – Peter Norvig
Ref:	
   Xavier	
  Amatriain when	
  he	
  talked	
  about	
  the	
  Netflix	
  Prize
"Knowledge is a process of piling up facts;
wisdom lies in their simplification."
- Martin Fischer
Analytics - miscuesAnalytics - miscues
oDon’t Torture the Data
Down	
  the	
  rabbit	
  hole	
  art	
  by	
  frostyshadows
http://frostyshadows.deviantart.com/art/Down-­‐the-­‐Rabbit-­‐Hole-­‐358090601
Design PrinciplesDesign Principles
1. Start with needs*
2. Do less
3. Design with data
4. Do the hard work to make it simple
5. Iterate. Then iterate again.
6. Build for inclusion
7. Understand context
8. Build digital services, not websites
9. Be consistent, not uniform
10. Make things open: it makes things better
https://www.gov.uk/design-­‐principles
Data Alone is not enoughData Alone is not enough
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
o Data Scientists are not Data Alchemists
• Don’t expect Analytic Gold from a pack of data lead
A few useful things to know about machine learning- by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
https://www.flickr.com/photos/bionerd/3123155390
More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not mean it can be
learned
o Correlation Does not imply Causation
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
In short …In short …
o Build Full stack, iteratively building capabilities
o Identify the ‘Right’ Business Problems
o Create Valuable Data Perspectives
o Frame problems & bring analytics together with non-quantitative information to
build compelling stories
o Embed Inference & Intelligence in products
https://www.linkedin.com/pulse/article/20141108013125-­‐1290064-­‐winning-­‐at-­‐analytics-­‐takes-­‐more-­‐than-­‐technology
http://www.kdnuggets.com/2014/09/hiring-­‐data-­‐scientist-­‐what-­‐to-­‐look-­‐for.html
Ogilvy & Mather Advertising: Morningview fromthe Ogilvy & Mather NY office,nicknamedthe ChocolateFactory # TravelTuesday
hankYou
ThankYou

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 

Was ist angesagt? (20)

Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 

Ähnlich wie Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?Inside Analysis
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricCambridge Semantics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 

Ähnlich wie Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes (20)

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 

Mehr von Krishna Sankar

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

Mehr von Krishna Sankar (17)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Kürzlich hochgeladen

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Kürzlich hochgeladen (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

  • 1. Big Data Analytics - The Best of the Worst Krishna Sankar @ksankar https://www.linkedin.com/in/ksankar
  • 2. About MeAbout Me o Data Scientist • Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx] • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Studying MS-CFRM (Computational Finance/Risk management) UWA o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC] o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT] o Reviewer : “Machine Learning with Spark” Packt Publishing o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 3. Background – Top 5Background – Top 5 http://tcapp2.publishpath.com/rabbithole http://conservationmagazine.org/wordpress/wp-­‐content/uploads/2013/05/context-­‐matters.jpg
  • 4. 1) Data Science The art of building a model with known knowns Which when let loose, works with unknown unknowns 1) Data Science The art of building a model with known knowns Which when let loose, works with unknown unknowns Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others  know,  you  don’t Model Evolution/DevOps to capture this o Capture in   Models o Facts,  outcomes  or   scenarios  we  have  not   encountered,  nor   considered o “Black  swans”,  outliers,  long   tails  of  probability   distributions o Lack  of  experience,   imagination o Potential  facts,   outcomes  we  are   aware,  but  not     with  certainty o Stochastic   processes,   Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics
  • 5. 2) The pipeline is the context2) The pipeline is the context o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Volume o Velocity o Streaming  Data o Volume o Velocity o Streaming  Data o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure CollectCollect StoreStore TransformTransform o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set ReasonReason ModelModel DeployDeploy Data ManagementData Management Data ScienceData Science o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 6. VolumeVolume VelocityVelocity VarietyVariety 3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s ContextContext Connect edness Connect edness IntelligenceIntelligence InterfaceInterface InferenceInference o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE
  • 7. 4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift Dynamic dash boards Multi-dimensional pivots w/ customization Selectable algorithms on data subsets “Cluster Customer for 5 thanksgiving seasons” Learning Models Automatic Feature Selection & hyper parameter optimizations as it gets more data Dynamic Models – Model Selection based on context Complexity Value Automated Analytics- Let Data tell story Feature Learning, AI, Deep Learning Concept Drift Validate Model assumptions + hyper parameters + features in the current context – after they are in production Ref:  Prof.  Josh  Bloom,  Keynote:  A  Systems  View  of  Machine  Learning,  #pydata Seattle’15
  • 8. 5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps oAnalytics in the lab = Investigative • Interactive, Iterative, Explorative • Output is usually decision data science o Analytics in the factory = Operational • Automated, systemic, transparent & explainable • Output is embedded intelligence • Embedded in customer facing decision systems Josh  Wills-­‐From   the  labs  to  the  factory,   https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/ http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/ There is a chasm between Model/Reason and Deploy
  • 9. 6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell o Data is the lens through which you see the business and fell the pulse o Collect the right data through “Thoughtful Data Design” o Give Data Back in a Powerful Way o But don’t confuse or overwhelm the users • The users have to feel safe • The users have to feel they are in control o Never try to launch a complicated data product on a fixed schedule o Offer progressively sophisticated products, leveraging the data & insights, across the different user population segments • Customer segmentation & stratification is not just for retail ! Josh  Wills-­‐From   the  labs  to  the  factory,   https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/ http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
  • 10. “Therearenoroutinestatisticalquestions,only questionablestatisticalroutines”--DavidCox Ref:  Gabriele  Corno Natural  History  Museum  in  #London  ..by  George  Thalassinos Big Data Analytics - The Best of the Worst
  • 11. Data SwampData Swamp Blue Pill o Typical case of “ungoverned data stores addressing a limited data science audience“ o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. o Now every one starts putting their data into this “lake”. o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence Red Pill-Data Curation o Data Curation • A consistent published schema o Data Quality & Data Lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer … o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation & discipline o Design for the right “Data Gravity” & “Data Mass” as Van Lindberg mentioned, yesterday, in his keynote • Not Data Molasses !
  • 12. Data SwampData Swamp Blue Pill o Typical case of “ungoverned data stores addressing a limited data science audience“ o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. o Now every one starts putting their data into this “lake”. o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence Red Pill-Data Curation o Data Curation • A consistent published schema o Data quality & data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer … o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline https://www.linkedin.com/pulse/data-­‐lakes-­‐udls-­‐vs-­‐analytics-­‐platforms-­‐gargi-­‐adhav
  • 13. Big Data To NowhereBig Data To Nowhere Blue Pill o IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. o A conversation goes like this … • Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? • IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI ! • Business : … (unprintable) Red Pill-Full Stack MVP (see next slide) o Build the full stack ie bits to business … o Build incremental Decision Data Science & Product Data Science layers, as appropriate … o The following conversation is a lot better … • Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? • IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM ! • Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ? • IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs. • IT: With the data we have, we only know that they comprise ~‾30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …
  • 14. ML Engine numPy, SciPy, Pandas, Spark, Azure ML, MPP/Impala o Collect o Store o Transform o Report o Visualize o Recommend o Predict o Reason o Model o Model o Explore R/Python o Compositional Analysis Data Hub Curated Data Storage : HDFS, Parquet Compute : Hadoop MR, Spark Landing Zone Dashboards APIs Reporting Hub Analytics Hub ETL In-Memory Hub Real-Time Kafka … Reporting   Hub Analytics   Hub Hadoop   MR Long-­‐Running  Complex  Jobs  -­‐ Yearly  pivots,   Multi-­‐dimensional   Exact  Uniques ✔ ️ ✔ ️ Real-­‐time  ad-­‐hoc  pivots,  Approx Uniques (HLL) ✔ ️ Fast  Response  with  Aggregated  data  Subsets ✔ ️
  • 15. ML Engine numPy, SciPy, Pandas, Spark, Azure ML, MPP/Impala o Collect o Store o Transform o Report o Visualize o Recommend o Predict o Reason o Model o Model o Explore R/Python o Compositional Analysis Data Hub Curated Data Storage : HDFS, Parquet Compute : Hadoop MR, Spark Landing Zone Dashboards APIs Reporting Hub Analytics Hub ETL In-Memory Hub Real-Time Kafka … Reporting   Hub Analytics   Hub Hadoop   MR Long-­‐Running  Complex  Jobs  -­‐ Yearly  pivots,   Multi-­‐dimensional   Exact  Uniques ✔ ️ ✔ ️ Real-­‐time  ad-­‐hoc  pivots,  Approx Uniques (HLL) ✔ ️ Fast  Response  with  Aggregated  data  Subsets ✔ ️ https://www.linkedin.com/pulse/why-­‐how-­‐make-­‐mvp-­‐analytics-­‐ruoyu-­‐bao Build The E2E Analytics MVP Stack
  • 16. A Data Too FarA Data Too Far Blue Pill o You might get a few .gz files, a few .csv files and of course, parquet files, in multiple systems o Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. o The challenge is that we have the data, but there is no easy way to combine them for interesting inferences … Red Pill-Data Curation o “..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.” o Data Pipelines (eg.Kafka) with in-line processing to ensure correctness, semantic and temporal congruence & integrity Ref:  Jay  Kreps,  Announcing  Confluent
  • 17. Where is the Tofu ?Where is the Tofu ? Blue Pill o It is very simple to produce “reasonable” recommendations o But extremely difficult to improve them to become “great” o And, there is a huge difference in business value between reasonable Data Set & great … Red Pill-Data Curation o The Antidote : The insights and the algorithms should be relevant and scalable … o There is a huge gap between Model- Reason and Deploy … o Statistical Significance need not mean business significance o Don't confuse the statistical significance of an experiment with the magnitude of the result, even though the word "significance" is often used for both – Peter Norvig Ref:   Xavier  Amatriain when  he  talked  about  the  Netflix  Prize "Knowledge is a process of piling up facts; wisdom lies in their simplification." - Martin Fischer
  • 18. Analytics - miscuesAnalytics - miscues oDon’t Torture the Data
  • 19. Down  the  rabbit  hole  art  by  frostyshadows http://frostyshadows.deviantart.com/art/Down-­‐the-­‐Rabbit-­‐Hole-­‐358090601 Design PrinciplesDesign Principles 1. Start with needs* 2. Do less 3. Design with data 4. Do the hard work to make it simple 5. Iterate. Then iterate again. 6. Build for inclusion 7. Understand context 8. Build digital services, not websites 9. Be consistent, not uniform 10. Make things open: it makes things better https://www.gov.uk/design-­‐principles
  • 20. Data Alone is not enoughData Alone is not enough o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset o Data Scientists are not Data Alchemists • Don’t expect Analytic Gold from a pack of data lead A few useful things to know about machine learning- by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755 https://www.flickr.com/photos/bionerd/3123155390
  • 21. More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean it can be learned o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A few useful things to know about machine learning - by Pedro Domingos § http://dl.acm.org/citation.cfm?id=2347755
  • 22. In short …In short … o Build Full stack, iteratively building capabilities o Identify the ‘Right’ Business Problems o Create Valuable Data Perspectives o Frame problems & bring analytics together with non-quantitative information to build compelling stories o Embed Inference & Intelligence in products https://www.linkedin.com/pulse/article/20141108013125-­‐1290064-­‐winning-­‐at-­‐analytics-­‐takes-­‐more-­‐than-­‐technology http://www.kdnuggets.com/2014/09/hiring-­‐data-­‐scientist-­‐what-­‐to-­‐look-­‐for.html
  • 23. Ogilvy & Mather Advertising: Morningview fromthe Ogilvy & Mather NY office,nicknamedthe ChocolateFactory # TravelTuesday hankYou ThankYou