SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Advanced Analytics in Hadoop
Thomas W. Dinsmore
1
Advanced Analytics in Hadoop
• Use cases
• Architectures
• Current Options:
• Open Source
• Commercial
2
Analytics
3
Ad Hoc Queries
ReportsData Access
Visualization
Data Manipulation
OLAP/ROLAP etc
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Dashboards
Scorecards
Streaming Analytics
Computational Complexity
Advanced Analytics
4
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Streaming Analytics
Computational Complexity
Advanced Analytics
5
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Streaming Analytics
Feature Extraction
Dimension Reduction
6
7
Analytics Platform
For some use cases, you must use all of the data.
8
Anomaly
Detection
Affinity
Analysis
Clustering
Social
Network
Analysis
Collaborative
Filtering
For others, using all of the data is worth it.
9
Catastrophic Risk Modeling
Modeling with Fine-grained
Behavioral Data
10
1. Apache Mahout!
2. Code it yourself.!
3. …
Your Options (2013)
Architecture
11
Legacy Alongside
12
HDFS HDFS HDFS HDFS HDFS HDFS
Data
Legacy Pass-Through
13
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Data
MapReduce Push-Down
14
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Advantages!
• Co-exists w/ other applications
• Integrated workload management
• Simplified administration
Disdvantages!
• MapReduce latency
Co-Located In-Memory (Asymmetric)
15
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Easy to adapt legacy apps
• Isolates analytic workload
Disdvantages!
• Data moves within the cluster
• Requires YARN
Co-Located In-Memory (Symmetric)
16
HDFS
Map!
Reduce
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Lowest latency
Disdvantages!
• Upgrade every node
• Requires YARN
Summary: Architecture
• MapReduce Push-Down is current “champion”
• Stable
• Co-exists well with Hadoop ecosystem
• MR 1.0 penalizes performance
• Required: persistent in-memory processing
• YARN enables co-location
17
Open Source Projects
18
Apache Mahout
• Apache incubator project (2007)
• Machine learning library
• Included in most distributions
• Thin acceptance, few contributors
• Diverse architecture
• Single-node
• MapReduce
• New algos run on Spark
• Recently cleaned up
19
Apache Giraph
• Apache top-level project
• Runs in MapReduce
• Dedicated graph engine
• Used by Facebook, few others
• Dead in the water
• No presence in leading distros
• No significant commercial support
• No releases in 13 months
• No recent code commits on Git
20
GraphLab
• Carnegie Mellon project (2009)
• Distributed in-memory engine:
• Primarily graph analysis
• Selected machine learning algos
• Interface from Java, JavaScript,
Python
• GraphLab Inc provides commercial
support (2013, $6.75MM)
• Independent distribution, or through
Pivotal
21
0xdata H2O
• Vendor-driven open source project
• 0xdata sells support, customization
• Distributed in-memory prediction engine
• Multiple deployment options:
• Standalone (with HDFS)
• Over YARN
• In MapReduce
• Claims 2,000+ users
• 4 public references
• Used by a leading P&C insurer
• Java, R, Python and Scala interfaces
22
Apache Spark
• Top-level Apache project (2/14)
• Release 1.0 (5/14)
• Distributed in-memory analytics
• Machine learning
• Graph analytics
• Streaming analytics
• Fast SQL
• Compatible with Hadoop storage
• Integrated with YARN
• Scala, Python, Java interfaces (+SparkR)
• Growing ecosystem
• Supported in leading Hadoop distributions
23
Apache Spark: Hadoop Distributions
24
Spark Components
MLLIB GraphX Spark Streaming Spark SQL Shark
Cloudera Yes Yes Yes Yes (Impala)
Hortonworks Yes (Storm) (Stinger)
MapR Yes Yes Yes Yes Yes
Pivotal Yes Yes Yes Yes Yes
IBM BigInsights
Summary: Open Source Projects
25
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Status Independent Top-Level Top-Level Top-Level Independent
Architecture
Co-Located Memory-
Centric
MapReduce MapReduce
Co-Located Memory-
Centric
Co-Located Memory-
Centric
Interfaces Java, Python, R, Scala Java Java
Java, Python, Scala
(SparkR)
Python
Commercial Support 0xdata Databricks GraphLab, Inc.
Distribution Independent Independent
All Hadoop
Distributions
Cloudera!
Hortonworks!
MapR!
Pivotal
Independent
Analytic Features
26
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Prediction +++ + +++
Dimension Reduction + +++ + +
Clustering + +++ + +++
Collaborative Filtering +++ + +++
Text Analytics +++ +++
Matrix Operations + +++ +
Graph Analysis + + +++
Analytic Features: Prediction
27
Mahout 0.9 Spark 1.0 H2O 2.2
Linear Regression +
Logistic Regression +
Generalized Linear Models +
Naive Bayes + + +
Decision Tree +
Gradient Boosted Trees +
Random Forests + +
Linear Support Vector Machine +
Deep Learning (Backprop MLP) +
Analytic Features: Dimension Reduction
28
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Singular Value
Decomposition + +
Lanczos Algorithm + +
Stochastic SVD +
Principal Components
Analysis + + +
Analytic Features: Clustering
29
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
k-Means + + + +
Fuzzy k-Means +
Streaming k-Means +
Spectral Clustering + +
Analytic Features: Collaborative Filtering
30
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Item-Based +
Matrix Factorization with ALS + + +
Matrix Factorization with ALS,
Implicit Feedback +
ALS with Parallel Coordinate
Descent +
Weighted ALS +
Sparse ALS +
Analytic Features: Text Analytics
31
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Latent Dirichlet Allocation + +
Frequent Pattern Mining +
Collocations +
Matrix Operations
32
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Stochastic Gradient
Descent + +
Limited-Memory BFGS +
RowSimilarityJob +
ConcatMatrices +
Summary: Open Source
• Giraph is toast
• Mahout may be recovering from roadkill status
• GraphLab outperforms Spark GraphX today in graph analytics
• 0xdata H2O outperforms Spark MLLib today in machine learning
• Spark catching up fast
• More resources and distribution
• Integrated platform for ML and graph analysis
33
Commercial Software
34
Alpine
• Business user interface
• Collaboration environment
• Broad library of techniques
• Strong cloud offering
• Leverages Hadoop (multiple distros), Hawq or
Pivotal Greenplum
• Push-down MapReduce
• Certified on Spark
• Small but growing customer base
35
IBM SPSS Analytics Server
• Introduced 2013
• Serves as “back end” for SPSS
Modeler
• Uses push-down MR
• Limited analytic feature set
• IBM supports on multiple Hadoop
distros
• Customer acceptance unknown
36
Revolution Analytics ScaleR
• ScaleR library of distributed statistics,
machine learning functions
• Tools to distribute arbitrary R functions
• Runs in Cloudera, Hortonworks, Teradata, LSF
clusters, MS HPC
• Hadoop edition uses MR push-down
• Tools simplify installation in large clusters
• R interface
• Partnerships with Alteryx, Qlik, MicroStrategy,
Tableau provide business interfaces
37
Skytree Server
• Georgia Tech’s FastLab project, repurposed as
commercial software
• Distributed machine learning platform
• Very opaque about technical details
• User interface is an API
• Co-located in Hadoop under YARN
• Just certified by Hortonworks
• Customer acceptance unknown
• No new public references in a year
• Used by leading credit card company
38
SAS High-Performance Analytics
• Distributed in-memory analytics
• Designed to run in special-purpose appliances (2011)
• Repurposed to run in Hadoop (2013)
• Co-exists poorly — cannot run SAS and MapReduce at
the same time
• Reads entire dataset into memory
• Uses MPI to communicate among nodes
• Requires upgrades from standard Hadoop infrastructure
• Customer acceptance unknown
• No public references
• Generic success stories missing from Strata presos
39
SAS LASR Server
• SAS’ “other” distributed in-memory platform
• Back end for several end-user products
• SAS Visual Analytics (2012)
• SAS Visual Statistics (New)
• SAS In-Memory Statistics for Hadoop (New)
• Recently added statistics and machine learning
• Does not read raw HDFS; must be transformed to proprietary
SASHDAT
• Like HPA, reads entire dataset into memory.
• 16 Core 256GB node can load 75GB table
• Runs DS2 programs, not Legacy SAS programs
• Fast, but with limited feature set
• SAS claims 1,400 “sites” for Visual Analytics
• Many of those are standalone boxes
40
Summary: Commercial
• Alpine’s interface is compelling to business user
• IBM Analytics Server is a good first release
• RRE ScaleR appeals to R users, plays well in Hadoop sandbox
• Skytree Server: strong in prediction
• SAS: why two competing memory-centric architectures?
41
Progress
• Spark: blindingly fast maturity
• Rapidly expanding library of analytic features
• Growing developer community, ecosystem
• Commercial: from zero to many
42
Interesting Questions
• Will Mahout get a second wind?
• Will Spark MLLib displace 0xdata?
• Will Spark GraphX catch up to GraphLab?
• Can Spark Streaming compete with Storm and commercial entrants?
• How quickly will customers adopt memory-centric architecture for analytics?
• What will Alpine and MicroStrategy do with Spark?
• Will IBM distribute Spark in BigInsights?
• When will SAS announce a reference customer for HPA/LASR in Hadoop?
43
Advanced Analytics in Hadoop
Thomas W. Dinsmore
44

Weitere ähnliche Inhalte

Was ist angesagt?

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylininovex GmbH
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 

Was ist angesagt? (20)

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 

Andere mochten auch

Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Advanced analytics
Advanced analyticsAdvanced analytics
Advanced analyticsShankar R
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
The future of business intelligence
The future of business intelligence The future of business intelligence
The future of business intelligence Phocas Software
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialQiang Zhu
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIan Balina
 
A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017Sisense
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSVirginia Fernandez
 
White Paper - The Business Case For Business Intelligence
White Paper -  The Business Case For Business IntelligenceWhite Paper -  The Business Case For Business Intelligence
White Paper - The Business Case For Business IntelligenceDavid Walker
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualizationlesterathayde
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

Andere mochten auch (20)

Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Advanced analytics
Advanced analyticsAdvanced analytics
Advanced analytics
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Get your data analytics strategy right!
Get your data analytics strategy right!Get your data analytics strategy right!
Get your data analytics strategy right!
 
Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
The future of business intelligence
The future of business intelligence The future of business intelligence
The future of business intelligence
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics Brief
 
A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
 
White Paper - The Business Case For Business Intelligence
White Paper -  The Business Case For Business IntelligenceWhite Paper -  The Business Case For Business Intelligence
White Paper - The Business Case For Business Intelligence
 
SAS Institute: Big data and smarter analytics
SAS Institute: Big data and smarter analyticsSAS Institute: Big data and smarter analytics
SAS Institute: Big data and smarter analytics
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 

Ähnlich wie Advanced Analytics in Hadoop

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 

Ähnlich wie Advanced Analytics in Hadoop (20)

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Apache drill
Apache drillApache drill
Apache drill
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Mehr von AnalyticsWeek

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataAnalyticsWeek
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingAnalyticsWeek
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsAnalyticsWeek
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in governmentAnalyticsWeek
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingAnalyticsWeek
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3AnalyticsWeek
 

Mehr von AnalyticsWeek (8)

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big Data
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reporting
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Advanced Analytics in Hadoop

  • 1. Advanced Analytics in Hadoop Thomas W. Dinsmore 1
  • 2. Advanced Analytics in Hadoop • Use cases • Architectures • Current Options: • Open Source • Commercial 2
  • 3. Analytics 3 Ad Hoc Queries ReportsData Access Visualization Data Manipulation OLAP/ROLAP etc Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Dashboards Scorecards Streaming Analytics Computational Complexity
  • 4. Advanced Analytics 4 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Computational Complexity
  • 5. Advanced Analytics 5 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Feature Extraction Dimension Reduction
  • 6. 6
  • 8. For some use cases, you must use all of the data. 8 Anomaly Detection Affinity Analysis Clustering Social Network Analysis Collaborative Filtering
  • 9. For others, using all of the data is worth it. 9 Catastrophic Risk Modeling Modeling with Fine-grained Behavioral Data
  • 10. 10 1. Apache Mahout! 2. Code it yourself.! 3. … Your Options (2013)
  • 12. Legacy Alongside 12 HDFS HDFS HDFS HDFS HDFS HDFS Data
  • 13. Legacy Pass-Through 13 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Data
  • 14. MapReduce Push-Down 14 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Advantages! • Co-exists w/ other applications • Integrated workload management • Simplified administration Disdvantages! • MapReduce latency
  • 15. Co-Located In-Memory (Asymmetric) 15 YARN HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce Advantages! • Easy to adapt legacy apps • Isolates analytic workload Disdvantages! • Data moves within the cluster • Requires YARN
  • 17. Summary: Architecture • MapReduce Push-Down is current “champion” • Stable • Co-exists well with Hadoop ecosystem • MR 1.0 penalizes performance • Required: persistent in-memory processing • YARN enables co-location 17
  • 19. Apache Mahout • Apache incubator project (2007) • Machine learning library • Included in most distributions • Thin acceptance, few contributors • Diverse architecture • Single-node • MapReduce • New algos run on Spark • Recently cleaned up 19
  • 20. Apache Giraph • Apache top-level project • Runs in MapReduce • Dedicated graph engine • Used by Facebook, few others • Dead in the water • No presence in leading distros • No significant commercial support • No releases in 13 months • No recent code commits on Git 20
  • 21. GraphLab • Carnegie Mellon project (2009) • Distributed in-memory engine: • Primarily graph analysis • Selected machine learning algos • Interface from Java, JavaScript, Python • GraphLab Inc provides commercial support (2013, $6.75MM) • Independent distribution, or through Pivotal 21
  • 22. 0xdata H2O • Vendor-driven open source project • 0xdata sells support, customization • Distributed in-memory prediction engine • Multiple deployment options: • Standalone (with HDFS) • Over YARN • In MapReduce • Claims 2,000+ users • 4 public references • Used by a leading P&C insurer • Java, R, Python and Scala interfaces 22
  • 23. Apache Spark • Top-level Apache project (2/14) • Release 1.0 (5/14) • Distributed in-memory analytics • Machine learning • Graph analytics • Streaming analytics • Fast SQL • Compatible with Hadoop storage • Integrated with YARN • Scala, Python, Java interfaces (+SparkR) • Growing ecosystem • Supported in leading Hadoop distributions 23
  • 24. Apache Spark: Hadoop Distributions 24 Spark Components MLLIB GraphX Spark Streaming Spark SQL Shark Cloudera Yes Yes Yes Yes (Impala) Hortonworks Yes (Storm) (Stinger) MapR Yes Yes Yes Yes Yes Pivotal Yes Yes Yes Yes Yes IBM BigInsights
  • 25. Summary: Open Source Projects 25 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Status Independent Top-Level Top-Level Top-Level Independent Architecture Co-Located Memory- Centric MapReduce MapReduce Co-Located Memory- Centric Co-Located Memory- Centric Interfaces Java, Python, R, Scala Java Java Java, Python, Scala (SparkR) Python Commercial Support 0xdata Databricks GraphLab, Inc. Distribution Independent Independent All Hadoop Distributions Cloudera! Hortonworks! MapR! Pivotal Independent
  • 26. Analytic Features 26 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Prediction +++ + +++ Dimension Reduction + +++ + + Clustering + +++ + +++ Collaborative Filtering +++ + +++ Text Analytics +++ +++ Matrix Operations + +++ + Graph Analysis + + +++
  • 27. Analytic Features: Prediction 27 Mahout 0.9 Spark 1.0 H2O 2.2 Linear Regression + Logistic Regression + Generalized Linear Models + Naive Bayes + + + Decision Tree + Gradient Boosted Trees + Random Forests + + Linear Support Vector Machine + Deep Learning (Backprop MLP) +
  • 28. Analytic Features: Dimension Reduction 28 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Singular Value Decomposition + + Lanczos Algorithm + + Stochastic SVD + Principal Components Analysis + + +
  • 29. Analytic Features: Clustering 29 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 k-Means + + + + Fuzzy k-Means + Streaming k-Means + Spectral Clustering + +
  • 30. Analytic Features: Collaborative Filtering 30 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Item-Based + Matrix Factorization with ALS + + + Matrix Factorization with ALS, Implicit Feedback + ALS with Parallel Coordinate Descent + Weighted ALS + Sparse ALS +
  • 31. Analytic Features: Text Analytics 31 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Latent Dirichlet Allocation + + Frequent Pattern Mining + Collocations +
  • 32. Matrix Operations 32 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Stochastic Gradient Descent + + Limited-Memory BFGS + RowSimilarityJob + ConcatMatrices +
  • 33. Summary: Open Source • Giraph is toast • Mahout may be recovering from roadkill status • GraphLab outperforms Spark GraphX today in graph analytics • 0xdata H2O outperforms Spark MLLib today in machine learning • Spark catching up fast • More resources and distribution • Integrated platform for ML and graph analysis 33
  • 35. Alpine • Business user interface • Collaboration environment • Broad library of techniques • Strong cloud offering • Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum • Push-down MapReduce • Certified on Spark • Small but growing customer base 35
  • 36. IBM SPSS Analytics Server • Introduced 2013 • Serves as “back end” for SPSS Modeler • Uses push-down MR • Limited analytic feature set • IBM supports on multiple Hadoop distros • Customer acceptance unknown 36
  • 37. Revolution Analytics ScaleR • ScaleR library of distributed statistics, machine learning functions • Tools to distribute arbitrary R functions • Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC • Hadoop edition uses MR push-down • Tools simplify installation in large clusters • R interface • Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces 37
  • 38. Skytree Server • Georgia Tech’s FastLab project, repurposed as commercial software • Distributed machine learning platform • Very opaque about technical details • User interface is an API • Co-located in Hadoop under YARN • Just certified by Hortonworks • Customer acceptance unknown • No new public references in a year • Used by leading credit card company 38
  • 39. SAS High-Performance Analytics • Distributed in-memory analytics • Designed to run in special-purpose appliances (2011) • Repurposed to run in Hadoop (2013) • Co-exists poorly — cannot run SAS and MapReduce at the same time • Reads entire dataset into memory • Uses MPI to communicate among nodes • Requires upgrades from standard Hadoop infrastructure • Customer acceptance unknown • No public references • Generic success stories missing from Strata presos 39
  • 40. SAS LASR Server • SAS’ “other” distributed in-memory platform • Back end for several end-user products • SAS Visual Analytics (2012) • SAS Visual Statistics (New) • SAS In-Memory Statistics for Hadoop (New) • Recently added statistics and machine learning • Does not read raw HDFS; must be transformed to proprietary SASHDAT • Like HPA, reads entire dataset into memory. • 16 Core 256GB node can load 75GB table • Runs DS2 programs, not Legacy SAS programs • Fast, but with limited feature set • SAS claims 1,400 “sites” for Visual Analytics • Many of those are standalone boxes 40
  • 41. Summary: Commercial • Alpine’s interface is compelling to business user • IBM Analytics Server is a good first release • RRE ScaleR appeals to R users, plays well in Hadoop sandbox • Skytree Server: strong in prediction • SAS: why two competing memory-centric architectures? 41
  • 42. Progress • Spark: blindingly fast maturity • Rapidly expanding library of analytic features • Growing developer community, ecosystem • Commercial: from zero to many 42
  • 43. Interesting Questions • Will Mahout get a second wind? • Will Spark MLLib displace 0xdata? • Will Spark GraphX catch up to GraphLab? • Can Spark Streaming compete with Storm and commercial entrants? • How quickly will customers adopt memory-centric architecture for analytics? • What will Alpine and MicroStrategy do with Spark? • Will IBM distribute Spark in BigInsights? • When will SAS announce a reference customer for HPA/LASR in Hadoop? 43
  • 44. Advanced Analytics in Hadoop Thomas W. Dinsmore 44