SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
1 
Advanced Analytics with Big Data 
Thomas W. Dinsmore
Advanced Analytics with Big Data 
•What do we mean by “Big Data”? 
•Do we need to use all of the data? 
•What analytics can run inside Big Data platforms? 
2
Big Data 
•Data that cannot be efficiently handled in a relational database 
•The three Vs: 
•Volume 
•Variety 
•Velocity 
3
Big Data Platforms 
•Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc 
•Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica, Par Accel/Redshift etc etc 
•NoSQL/NewSQL: Cassandra, Mongo, MemSQL 
•Streaming engines: Infosphere Streams 
4 
Convergence: Federated SQL engines (e.g.) Pivotal Hawq
6 
Analytics Platform 
For aggregate models, you can simply sample the data and work offline.
7 
Anomaly 
Detection 
Affinity 
Analysis 
Microsegmentation 
Social Network Analysis 
Collaborative 
Filtering 
However, for some use cases you may need to use all of the data.
8 
Catastrophic Risk Modeling 
Modeling with Fine-grained 
Behavioral Data 
For other use cases, using all of the data is worth extra time and effort.
9 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
Data 
Most legacy analytic packages can read HDFS files.
10 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
MapReduce 
Data 
Some tools also provide pass-through capabilities.
11 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
HDFS 
MapReduce 
Advantages 
•Co-exists w/ other applications 
•Integrated workload management 
•Simplified administration 
Disdvantages 
•MapReduce latency 
Several tools translate user requests to MapReduce. This eliminates data movement and co-exists well with other applications.
12 
YARN 
HDFS 
Map Reduce 
HDFS 
Map 
Reduce 
HDFS 
Map Reduce 
HDFS 
Map 
Reduce 
HDFS 
Map Reduce 
HDFS 
Map 
Reduce 
Advantages 
•Easy to adapt legacy apps 
•Isolates analytic workload 
Disdvantages 
•Data moves within the cluster 
•Requires YARN 
YARN (*) makes it possible to bypass MapReduce and run analytics in memory on dedicated nodes. 
(*) Yet Another Resource Negotiater
13 
HDFS 
Map 
Reduce 
YARN 
HDFS 
Map Reduce 
HDFS 
Map Reduce 
HDFS 
Map 
Reduce 
HDFS 
Map Reduce 
HDFS 
Map 
Reduce 
HDFS 
Map 
Reduce 
Advantages 
•Lowest latency 
Disdvantages 
•Upgrade every node 
•Requires YARN 
Distributing in-memory analytics across the Hadoop cluster minimizes internal data movement.
14 
Open Source Projects
Apache Mahout 
•Apache incubator project (2007) 
•Machine learning library 
•Included in most distributions 
•Thin acceptance, few contributors 
•Diverse architecture 
•Single-node 
•MapReduce 
•New algos run on Spark 
•Recently cleaned up 
15
Apache Giraph 
•Apache top-level project 
•Runs in MapReduce 
•Dedicated graph engine 
•Used by Facebook, few others 
•Dead in the water 
•No presence in leading distros 
•No significant commercial support 
•No releases in 13 months 
•No recent code commits on Git 
16
GraphLab 
•Carnegie Mellon project (2009) 
•Distributed in-memory engine: 
•Primarily graph analysis 
•Selected machine learning algos 
•Interface from Java, JavaScript, Python 
•GraphLab Inc provides commercial support (2013, $6.75MM) 
•Independent distribution, or through Pivotal 
•Minimal development effort past six months 
17
0xdata H2O 
•Vendor-driven open source project 
•0xdata sells support, customization 
•Distributed in-memory prediction engine 
•Multiple deployment options: 
•Standalone (with HDFS) 
•Over YARN 
•In MapReduce 
•Claims 2,000+ users 
•4 public references 
•Used by a leading P&C insurer 
•Java, R, Python and Scala interfaces 
18
Apache Spark 
•Top-level Apache project (2/14) 
•Release 1.02 (8/14) 
•Distributed in-memory analytics 
•Machine learning 
•Graph analytics 
•Streaming analytics 
•Fast SQL 
•Compatible with Hadoop storage 
•Integrated with YARN 
•Scala, Python, Java interfaces (+SparkR) 
•Growing ecosystem 
•Supported in leading Hadoop distributions 
19
Analytic Features 
22 
0xdata H2O 2.2 
Apache Giraph 1.1 
Apache Mahout 0.9 
Apache Spark 1.02 
GraphLab 2.2 
Prediction 
+++ 
+ 
+++ 
Dimension Reduction 
+ 
+++ 
+ 
+ 
Clustering 
+ 
+++ 
+ 
+++ 
Collaborative Filtering 
+++ 
+ 
+++ 
Text Analytics 
+++ 
+++ 
Matrix Operations 
+ 
+++ 
+ 
Graph Analysis 
+ 
+ 
+++
Summary: Open Source 
•Giraph appears to be dead in the water 
•Mahout may be recovering from roadkill status 
•GraphLab outperforms Spark GraphX today in graph analytics 
•0xdata H2O currently has more machine learning features than Spark MLLib and a better R interface 
•Spark catching up fast 
•More resources and distribution 
•Integrated platform for ML and graph analysis 
23
24 
Commercial Software
Alpine 
•Business user interface 
•Collaboration environment 
•Broad library of techniques 
•Strong cloud offering 
•Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum 
•Push-down MapReduce 
•Certified on Spark 
•Small but growing customer base 
25
IBM SPSS Analytics Server 
•Introduced 2013 
•Serves as “back end” for SPSS Modeler 
•Uses push-down MR 
•Limited analytic feature set 
•IBM supports on multiple Hadoop distros 
•Customer acceptance unknown 
26
Revolution Analytics ScaleR 
•ScaleR library of distributed statistics, machine learning functions 
•Tools to distribute arbitrary R functions 
•Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC 
•Hadoop edition uses MR push-down 
•Tools simplify installation in large clusters 
•R interface 
•Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces 
27
Skytree Server 
•Georgia Tech’s FastLab project, repurposed as commercial software 
•Distributed machine learning platform 
•Very opaque about technical details 
•User interface is an API 
•Co-located in Hadoop under YARN 
•Just certified by Hortonworks 
•Customer acceptance unknown 
•No new public references in a year 
•Used by leading credit card company 
28
SAS High-Performance Analytics 
•Distributed in-memory analytics 
•Designed to run in special-purpose appliances (2011) 
•Repurposed to run in Hadoop (2013) 
•Co-exists poorly — cannot run SAS and MapReduce at the same time 
•Reads entire dataset into memory 
•Uses MPI to communicate among nodes 
•Requires upgrades from standard Hadoop infrastructure 
•Customer acceptance unknown 
•No public references 
•Generic success stories missing from Strata presos 
29
SAS LASR Server 
•SAS’ “other” distributed in-memory platform 
•Back end for several end-user products 
•SAS Visual Analytics (2012) 
•SAS Visual Statistics (New) 
•SAS In-Memory Statistics for Hadoop (New) 
•Recently added statistics and machine learning 
•Does not read raw HDFS; must be transformed to proprietary SASHDAT 
•Like HPA, reads entire dataset into memory. 
•16 Core 256GB node can load 75GB table 
•Runs DS2 programs, not Legacy SAS programs 
•Fast, but with limited feature set 
•SAS claims 1,400 “sites” for Visual Analytics 
•Many of those are standalone boxes 
30
Summary: Commercial 
•Alpine’s interface is compelling to business user 
•IBM Analytics Server is a good first release 
•RRE ScaleR appeals to R users, plays well in Hadoop sandbox 
•Skytree Server: strong in prediction 
•SAS: why two competing memory-centric architectures? 
31
Progress 
•Spark: blindingly fast maturity 
•Rapidly expanding library of analytic features 
•Growing developer community, ecosystem 
•Commercial: from zero to many 
32
Interesting Questions 
•Will Mahout get a second wind? 
•Will Spark MLLib displace 0xdata? 
•Will Spark GraphX catch up to GraphLab? 
•Can Spark Streaming compete with Storm and commercial entrants? 
•How quickly will customers adopt memory-centric architecture for analytics? 
•What will Alpine and MicroStrategy do with Spark? 
•When will SAS announce a reference customer for HPA/LASR in Hadoop? 
33
Questions 
34
Thank You 
35
36 
Advanced Analytics with Big Data 
Thomas W. Dinsmore

Weitere ähnliche Inhalte

Was ist angesagt?

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark Roger Brinkley
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 

Was ist angesagt? (20)

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Apache drill
Apache drillApache drill
Apache drill
 

Andere mochten auch

How Obama Won With Big Data (Sam Zindel at Big Data Brighton)
How Obama Won With Big Data (Sam Zindel at Big Data Brighton)How Obama Won With Big Data (Sam Zindel at Big Data Brighton)
How Obama Won With Big Data (Sam Zindel at Big Data Brighton)Brandwatch
 
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus Himmelstrup
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus HimmelstrupMarketingcamp2015 - B&O User Engagement i Analytics - Rasmus Himmelstrup
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus HimmelstrupRasmus Gi
 
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...W2O Group
 
Michael Plante, Inside Sales: The AI Revolution
Michael Plante, Inside Sales: The AI RevolutionMichael Plante, Inside Sales: The AI Revolution
Michael Plante, Inside Sales: The AI RevolutionW2O Group
 
The SEO Revolution Will Not Be Televised
The SEO Revolution Will Not Be TelevisedThe SEO Revolution Will Not Be Televised
The SEO Revolution Will Not Be TelevisedRand Fishkin
 
Latest trends in Business Analytics
Latest trends in Business AnalyticsLatest trends in Business Analytics
Latest trends in Business AnalyticsPuneet Bhalla
 
How much data is generated every minute?
How much data is generated every minute? How much data is generated every minute?
How much data is generated every minute? Domo
 
10 signs you need business management software.
10 signs you need business management software.10 signs you need business management software.
10 signs you need business management software.Domo
 
Evolution of the infographic: Then, now, and future-now.
Evolution of the infographic: Then, now, and future-now. Evolution of the infographic: Then, now, and future-now.
Evolution of the infographic: Then, now, and future-now. Domo
 
Forrester Report: The Total Economic Impact of Domo
Forrester Report: The Total Economic Impact of DomoForrester Report: The Total Economic Impact of Domo
Forrester Report: The Total Economic Impact of DomoDomo
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
AI For Enterprise
AI For EnterpriseAI For Enterprise
AI For EnterpriseNVIDIA
 
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)Matt Turck
 
Making Sense of Analytics
Making Sense of AnalyticsMaking Sense of Analytics
Making Sense of AnalyticsDana DiTomaso
 
SAP HANA Cloud Platform - Overview
SAP HANA Cloud Platform - OverviewSAP HANA Cloud Platform - Overview
SAP HANA Cloud Platform - OverviewMatthias Steiner
 

Andere mochten auch (20)

Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
Big Data, PR and the Future, 9/2014
Big Data, PR and the Future, 9/2014Big Data, PR and the Future, 9/2014
Big Data, PR and the Future, 9/2014
 
MarketSim Modeling
MarketSim ModelingMarketSim Modeling
MarketSim Modeling
 
How Obama Won With Big Data (Sam Zindel at Big Data Brighton)
How Obama Won With Big Data (Sam Zindel at Big Data Brighton)How Obama Won With Big Data (Sam Zindel at Big Data Brighton)
How Obama Won With Big Data (Sam Zindel at Big Data Brighton)
 
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus Himmelstrup
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus HimmelstrupMarketingcamp2015 - B&O User Engagement i Analytics - Rasmus Himmelstrup
Marketingcamp2015 - B&O User Engagement i Analytics - Rasmus Himmelstrup
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...
Stephen Roop, Freight Shuttle International: Autonomous Freight: Bringing the...
 
Michael Plante, Inside Sales: The AI Revolution
Michael Plante, Inside Sales: The AI RevolutionMichael Plante, Inside Sales: The AI Revolution
Michael Plante, Inside Sales: The AI Revolution
 
The SEO Revolution Will Not Be Televised
The SEO Revolution Will Not Be TelevisedThe SEO Revolution Will Not Be Televised
The SEO Revolution Will Not Be Televised
 
Latest trends in Business Analytics
Latest trends in Business AnalyticsLatest trends in Business Analytics
Latest trends in Business Analytics
 
How much data is generated every minute?
How much data is generated every minute? How much data is generated every minute?
How much data is generated every minute?
 
10 signs you need business management software.
10 signs you need business management software.10 signs you need business management software.
10 signs you need business management software.
 
Evolution of the infographic: Then, now, and future-now.
Evolution of the infographic: Then, now, and future-now. Evolution of the infographic: Then, now, and future-now.
Evolution of the infographic: Then, now, and future-now.
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Forrester Report: The Total Economic Impact of Domo
Forrester Report: The Total Economic Impact of DomoForrester Report: The Total Economic Impact of Domo
Forrester Report: The Total Economic Impact of Domo
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
AI For Enterprise
AI For EnterpriseAI For Enterprise
AI For Enterprise
 
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
 
Making Sense of Analytics
Making Sense of AnalyticsMaking Sense of Analytics
Making Sense of Analytics
 
SAP HANA Cloud Platform - Overview
SAP HANA Cloud Platform - OverviewSAP HANA Cloud Platform - Overview
SAP HANA Cloud Platform - Overview
 

Ähnlich wie Advanced Analytics and Big Data (August 2014)

Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 

Ähnlich wie Advanced Analytics and Big Data (August 2014) (20)

Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 

Kürzlich hochgeladen

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Advanced Analytics and Big Data (August 2014)

  • 1. 1 Advanced Analytics with Big Data Thomas W. Dinsmore
  • 2. Advanced Analytics with Big Data •What do we mean by “Big Data”? •Do we need to use all of the data? •What analytics can run inside Big Data platforms? 2
  • 3. Big Data •Data that cannot be efficiently handled in a relational database •The three Vs: •Volume •Variety •Velocity 3
  • 4. Big Data Platforms •Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc •Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica, Par Accel/Redshift etc etc •NoSQL/NewSQL: Cassandra, Mongo, MemSQL •Streaming engines: Infosphere Streams 4 Convergence: Federated SQL engines (e.g.) Pivotal Hawq
  • 5. 6 Analytics Platform For aggregate models, you can simply sample the data and work offline.
  • 6. 7 Anomaly Detection Affinity Analysis Microsegmentation Social Network Analysis Collaborative Filtering However, for some use cases you may need to use all of the data.
  • 7. 8 Catastrophic Risk Modeling Modeling with Fine-grained Behavioral Data For other use cases, using all of the data is worth extra time and effort.
  • 8. 9 HDFS HDFS HDFS HDFS HDFS HDFS Data Most legacy analytic packages can read HDFS files.
  • 9. 10 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Data Some tools also provide pass-through capabilities.
  • 10. 11 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Advantages •Co-exists w/ other applications •Integrated workload management •Simplified administration Disdvantages •MapReduce latency Several tools translate user requests to MapReduce. This eliminates data movement and co-exists well with other applications.
  • 11. 12 YARN HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce Advantages •Easy to adapt legacy apps •Isolates analytic workload Disdvantages •Data moves within the cluster •Requires YARN YARN (*) makes it possible to bypass MapReduce and run analytics in memory on dedicated nodes. (*) Yet Another Resource Negotiater
  • 12. 13 HDFS Map Reduce YARN HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce Advantages •Lowest latency Disdvantages •Upgrade every node •Requires YARN Distributing in-memory analytics across the Hadoop cluster minimizes internal data movement.
  • 13. 14 Open Source Projects
  • 14. Apache Mahout •Apache incubator project (2007) •Machine learning library •Included in most distributions •Thin acceptance, few contributors •Diverse architecture •Single-node •MapReduce •New algos run on Spark •Recently cleaned up 15
  • 15. Apache Giraph •Apache top-level project •Runs in MapReduce •Dedicated graph engine •Used by Facebook, few others •Dead in the water •No presence in leading distros •No significant commercial support •No releases in 13 months •No recent code commits on Git 16
  • 16. GraphLab •Carnegie Mellon project (2009) •Distributed in-memory engine: •Primarily graph analysis •Selected machine learning algos •Interface from Java, JavaScript, Python •GraphLab Inc provides commercial support (2013, $6.75MM) •Independent distribution, or through Pivotal •Minimal development effort past six months 17
  • 17. 0xdata H2O •Vendor-driven open source project •0xdata sells support, customization •Distributed in-memory prediction engine •Multiple deployment options: •Standalone (with HDFS) •Over YARN •In MapReduce •Claims 2,000+ users •4 public references •Used by a leading P&C insurer •Java, R, Python and Scala interfaces 18
  • 18. Apache Spark •Top-level Apache project (2/14) •Release 1.02 (8/14) •Distributed in-memory analytics •Machine learning •Graph analytics •Streaming analytics •Fast SQL •Compatible with Hadoop storage •Integrated with YARN •Scala, Python, Java interfaces (+SparkR) •Growing ecosystem •Supported in leading Hadoop distributions 19
  • 19. Analytic Features 22 0xdata H2O 2.2 Apache Giraph 1.1 Apache Mahout 0.9 Apache Spark 1.02 GraphLab 2.2 Prediction +++ + +++ Dimension Reduction + +++ + + Clustering + +++ + +++ Collaborative Filtering +++ + +++ Text Analytics +++ +++ Matrix Operations + +++ + Graph Analysis + + +++
  • 20. Summary: Open Source •Giraph appears to be dead in the water •Mahout may be recovering from roadkill status •GraphLab outperforms Spark GraphX today in graph analytics •0xdata H2O currently has more machine learning features than Spark MLLib and a better R interface •Spark catching up fast •More resources and distribution •Integrated platform for ML and graph analysis 23
  • 22. Alpine •Business user interface •Collaboration environment •Broad library of techniques •Strong cloud offering •Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum •Push-down MapReduce •Certified on Spark •Small but growing customer base 25
  • 23. IBM SPSS Analytics Server •Introduced 2013 •Serves as “back end” for SPSS Modeler •Uses push-down MR •Limited analytic feature set •IBM supports on multiple Hadoop distros •Customer acceptance unknown 26
  • 24. Revolution Analytics ScaleR •ScaleR library of distributed statistics, machine learning functions •Tools to distribute arbitrary R functions •Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC •Hadoop edition uses MR push-down •Tools simplify installation in large clusters •R interface •Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces 27
  • 25. Skytree Server •Georgia Tech’s FastLab project, repurposed as commercial software •Distributed machine learning platform •Very opaque about technical details •User interface is an API •Co-located in Hadoop under YARN •Just certified by Hortonworks •Customer acceptance unknown •No new public references in a year •Used by leading credit card company 28
  • 26. SAS High-Performance Analytics •Distributed in-memory analytics •Designed to run in special-purpose appliances (2011) •Repurposed to run in Hadoop (2013) •Co-exists poorly — cannot run SAS and MapReduce at the same time •Reads entire dataset into memory •Uses MPI to communicate among nodes •Requires upgrades from standard Hadoop infrastructure •Customer acceptance unknown •No public references •Generic success stories missing from Strata presos 29
  • 27. SAS LASR Server •SAS’ “other” distributed in-memory platform •Back end for several end-user products •SAS Visual Analytics (2012) •SAS Visual Statistics (New) •SAS In-Memory Statistics for Hadoop (New) •Recently added statistics and machine learning •Does not read raw HDFS; must be transformed to proprietary SASHDAT •Like HPA, reads entire dataset into memory. •16 Core 256GB node can load 75GB table •Runs DS2 programs, not Legacy SAS programs •Fast, but with limited feature set •SAS claims 1,400 “sites” for Visual Analytics •Many of those are standalone boxes 30
  • 28. Summary: Commercial •Alpine’s interface is compelling to business user •IBM Analytics Server is a good first release •RRE ScaleR appeals to R users, plays well in Hadoop sandbox •Skytree Server: strong in prediction •SAS: why two competing memory-centric architectures? 31
  • 29. Progress •Spark: blindingly fast maturity •Rapidly expanding library of analytic features •Growing developer community, ecosystem •Commercial: from zero to many 32
  • 30. Interesting Questions •Will Mahout get a second wind? •Will Spark MLLib displace 0xdata? •Will Spark GraphX catch up to GraphLab? •Can Spark Streaming compete with Storm and commercial entrants? •How quickly will customers adopt memory-centric architecture for analytics? •What will Alpine and MicroStrategy do with Spark? •When will SAS announce a reference customer for HPA/LASR in Hadoop? 33
  • 33. 36 Advanced Analytics with Big Data Thomas W. Dinsmore