Advanced Analytics and Big Data (August 2014)

1
Advanced Analytics with Big Data
Thomas W. Dinsmore

•What do we mean by “Big Data”?
•Do we need to use all of the data?
•What analytics can run inside Big Data platforms?
2

Big Data
•Data that cannot be efficiently handled in a relational database
•The three Vs:
•Volume
•Variety
•Velocity
3

Big Data Platforms
•Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc
•Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica, Par Accel/Redshift etc etc
•NoSQL/NewSQL: Cassandra, Mongo, MemSQL
•Streaming engines: Infosphere Streams
4
Convergence: Federated SQL engines (e.g.) Pivotal Hawq

6
Analytics Platform
For aggregate models, you can simply sample the data and work offline.

7
Anomaly
Detection
Affinity
Analysis
Microsegmentation
Social Network Analysis
Collaborative
Filtering
However, for some use cases you may need to use all of the data.

8
Catastrophic Risk Modeling
Modeling with Fine-grained
Behavioral Data
For other use cases, using all of the data is worth extra time and effort.

9
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Data
Most legacy analytic packages can read HDFS files.

10
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
MapReduce
Data
Some tools also provide pass-through capabilities.

11
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
MapReduce
Advantages
•Co-exists w/ other applications
•Integrated workload management
•Simplified administration
Disdvantages
•MapReduce latency
Several tools translate user requests to MapReduce. This eliminates data movement and co-exists well with other applications.

12
YARN
HDFS
Map Reduce
HDFS
Map
Reduce
HDFS
Map Reduce
HDFS
Map
Reduce
HDFS
Map Reduce
HDFS
Map
Reduce
Advantages
•Easy to adapt legacy apps
•Isolates analytic workload
Disdvantages
•Data moves within the cluster
•Requires YARN
YARN (*) makes it possible to bypass MapReduce and run analytics in memory on dedicated nodes.
(*) Yet Another Resource Negotiater

13
HDFS
Map
Reduce
YARN
HDFS
Map Reduce
HDFS
Map Reduce
HDFS
Map
Reduce
HDFS
Map Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
Advantages
•Lowest latency
Disdvantages
•Upgrade every node
•Requires YARN
Distributing in-memory analytics across the Hadoop cluster minimizes internal data movement.

Apache Mahout
•Apache incubator project (2007)
•Machine learning library
•Included in most distributions
•Thin acceptance, few contributors
•Diverse architecture
•Single-node
•MapReduce
•New algos run on Spark
•Recently cleaned up
15

Apache Giraph
•Apache top-level project
•Runs in MapReduce
•Dedicated graph engine
•Used by Facebook, few others
•Dead in the water
•No presence in leading distros
•No significant commercial support
•No releases in 13 months
•No recent code commits on Git
16

GraphLab
•Carnegie Mellon project (2009)
•Distributed in-memory engine:
•Primarily graph analysis
•Selected machine learning algos
•Interface from Java, JavaScript, Python
•GraphLab Inc provides commercial support (2013, $6.75MM)
•Independent distribution, or through Pivotal
•Minimal development effort past six months
17

0xdata H2O
•Vendor-driven open source project
•0xdata sells support, customization
•Distributed in-memory prediction engine
•Multiple deployment options:
•Standalone (with HDFS)
•Over YARN
•In MapReduce
•Claims 2,000+ users
•4 public references
•Used by a leading P&C insurer
•Java, R, Python and Scala interfaces
18

Apache Spark
•Top-level Apache project (2/14)
•Release 1.02 (8/14)
•Distributed in-memory analytics
•Machine learning
•Graph analytics
•Streaming analytics
•Fast SQL
•Compatible with Hadoop storage
•Integrated with YARN
•Scala, Python, Java interfaces (+SparkR)
•Growing ecosystem
•Supported in leading Hadoop distributions
19

Analytic Features
22
0xdata H2O 2.2
Apache Giraph 1.1
Apache Mahout 0.9
Apache Spark 1.02
GraphLab 2.2
Prediction
+++
+
+++
Dimension Reduction
+
+++
+
+
Clustering
+
+++
+
+++
Collaborative Filtering
+++
+
+++
Text Analytics
+++
+++
Matrix Operations
+
+++
+
Graph Analysis
+
+
+++

Summary: Open Source
•Giraph appears to be dead in the water
•Mahout may be recovering from roadkill status
•GraphLab outperforms Spark GraphX today in graph analytics
•0xdata H2O currently has more machine learning features than Spark MLLib and a better R interface
•Spark catching up fast
•More resources and distribution
•Integrated platform for ML and graph analysis
23

Alpine
•Business user interface
•Collaboration environment
•Broad library of techniques
•Strong cloud offering
•Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum
•Push-down MapReduce
•Certified on Spark
•Small but growing customer base
25

IBM SPSS Analytics Server
•Introduced 2013
•Serves as “back end” for SPSS Modeler
•Uses push-down MR
•Limited analytic feature set
•IBM supports on multiple Hadoop distros
•Customer acceptance unknown
26

Revolution Analytics ScaleR
•ScaleR library of distributed statistics, machine learning functions
•Tools to distribute arbitrary R functions
•Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC
•Hadoop edition uses MR push-down
•Tools simplify installation in large clusters
•R interface
•Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces
27

Skytree Server
•Georgia Tech’s FastLab project, repurposed as commercial software
•Distributed machine learning platform
•Very opaque about technical details
•User interface is an API
•Co-located in Hadoop under YARN
•Just certified by Hortonworks
•No new public references in a year
•Used by leading credit card company
28

SAS High-Performance Analytics
•Distributed in-memory analytics
•Designed to run in special-purpose appliances (2011)
•Repurposed to run in Hadoop (2013)
•Co-exists poorly — cannot run SAS and MapReduce at the same time
•Reads entire dataset into memory
•Uses MPI to communicate among nodes
•Requires upgrades from standard Hadoop infrastructure
•No public references
•Generic success stories missing from Strata presos
29

SAS LASR Server
•SAS’ “other” distributed in-memory platform
•Back end for several end-user products
•SAS Visual Analytics (2012)
•SAS Visual Statistics (New)
•SAS In-Memory Statistics for Hadoop (New)
•Recently added statistics and machine learning
•Does not read raw HDFS; must be transformed to proprietary SASHDAT
•Like HPA, reads entire dataset into memory.
•16 Core 256GB node can load 75GB table
•Runs DS2 programs, not Legacy SAS programs
•Fast, but with limited feature set
•SAS claims 1,400 “sites” for Visual Analytics
•Many of those are standalone boxes
30

Summary: Commercial
•Alpine’s interface is compelling to business user
•IBM Analytics Server is a good first release
•RRE ScaleR appeals to R users, plays well in Hadoop sandbox
•Skytree Server: strong in prediction
•SAS: why two competing memory-centric architectures?
31

Progress
•Spark: blindingly fast maturity
•Rapidly expanding library of analytic features
•Growing developer community, ecosystem
•Commercial: from zero to many
32

Interesting Questions
•Will Mahout get a second wind?
•Will Spark MLLib displace 0xdata?
•Will Spark GraphX catch up to GraphLab?
•Can Spark Streaming compete with Storm and commercial entrants?
•How quickly will customers adopt memory-centric architecture for analytics?
•What will Alpine and MicroStrategy do with Spark?
•When will SAS announce a reference customer for HPA/LASR in Hadoop?
33

36
Thomas W. Dinsmore

Advanced Analytics and Big Data (August 2014)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Advanced Analytics and Big Data (August 2014)

Ähnlich wie Advanced Analytics and Big Data (August 2014) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Advanced Analytics and Big Data (August 2014)