The Analytics Frontier of the Hadoop Eco-System

The Analytics Frontier of the Hadoop Eco-system
Ted Willke
Senior Principal Engineer and GM

•Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining
•Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model
•YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported
Today | Hadoop Analytics

•Variety - Expansion of data primitives in commercial use
•Speed - Data processing models evolving (batch  streaming)
•Complexity – Monolithic analytics  analytics pipelines
•Intelligence – Prescriptive ML  Applying ML to ML itself
•Ease of Use – Gap between skills and needs growing
Trends | Hadoop Eco-System Analytics

Life Sciences
Personalized medicine, drug repurposing predictions, integration of heterogeneous data
Education
Personalized instruction, outcomes measurement and intervention
Network Security
Data fusion, threat assessment and identification
Retail
Inventory management, product display management, demand forecasting
Trends | Areas of Application (to name a few)

Variety | Primitives Usage Patterns
Key-Value Document Graph
Sync (I/O) Async (Bus) Off-line (Queue)
API (Remote) LIB (Local)
Model
Access
Implementation
Column SQL

•When the problem is an information network
•When a graph is a natural way of expressing the algorithm
•When you want to study specific relationships
•When you want faster machine learning or solvers on sparse data
shortest path
central influence
sub networks
triangle count
Variety | Graphical Model

High
Program
Importance
(Centrality)
Low
Graph of channel viewing behavior
Current popular
surfing patterns
SH002463130000
EP005544723744
Changes in surfing behavior may predict customer churn.
Variety | Graph Statistics

Preference and Similarity Recommendations
User
Movie
1.7MM Nodes
23.9MM Edges
similar cast
prefers
similar topic
userId: A0A22A5
title: The Godfather
genre: Crime drama
cast: [M. Brando, Al Pacino]
title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer]
title: The Departed
genre: Crime drama
cast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Variety | Graph Search

10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes 185MM Edges
URL
Domain
IP Address
Calculation of priors
LBP Messaging
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
Variety | Graphical Machine Learning

Variety | Loopy Belief Propagation on the (semantic) web
Reputations
Neutral
Good
Bad
Suspect

Variety | Unification with Apache Spark
Image Source: Databricks
•In-memory structures (RDDs) support both table and graph abstractions
•Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming real-time
Spark
SQL
MLLib
machine learning
DStream’s: Streams of RDD’s
SchemaRDD’s
RDD-Based Matrices
GraphX
graph processing/
machine learning
RDD-Based Graphs

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Unification within the In-Memory Database (IMDB)
•Index data structure for graph traversal
•Prototyped in SAP HANA distributed columnar IMDB
•Lays foundation for complex graph query and algorithms

Variety | Graph Traversal

Variety | Graph Indexing

Variety | Graph Traversal Results

Cloud Infrastructure
UI
Data Platform
Analytics Platform
Datacenter
Network
Gateway
Thing
Services
Speed | Hadoop Meets The Internet of Things

Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014
Data Stream
Feature Processing
Model Updates
Learning
Distributed Messaging System
(e.g., Kafka)
Speed | Stream Processing Pipeline

•Data replay (e.g., a bug is found or application improved)
•Getting faster and more efficient than “fast batch”
•Time-evolving models and computation
Speed | Challenges

Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014
Lambda
Kappa
•Implement transform logic twice
•Federate information at query time
•Retains input data unchanged
The thinking continues to evolve....
•Retain full replay window
•2nd instance can re-process
•Query against latest table
Speed | Cluster-Scale Stream Processing

Apply Dstreams to built-in:
•Machine Learning
•Graph Processing
Speed | Spark Streaming

•Mini-batch +/- windowing
•Analytics can be run on any of the resultant RDDs
•No provisions for merging RDDs
Speed | Spark (Discretized) Streaming
(Mini-) batch Streaming  (Mini-) batch Analytics

Image Source: GraphX project
•Graph processing engine on Spark
•Supports Pregel-style vertex programming
•View same data as either graphs or collections
Speed | GraphX API for Spark

•Current Spark streaming provides mini-batch streaming
•No concept of data (model) merging
•GraphX is currently designed for static graphs:
1.Merge table data prior to graph pipeline
2.Re-generate entire (accumulated) graph
3.Re-run machine learning at each window
Speed | Spark Streaming for GraphX?
Straightforward, but wastes computation and time. Can we do better?

•Merge information directly into data model used by algorithms
•Static algorithms -> Online algorithms
•Incremental re-computation triggered by changes in data values or data structure
•Possible with many machine learning algos (PageRank as example)
•Evolve IM data stores to maximize performance and freshness
•Better partitioning algorithms  reduced data replication
•Dynamic indexing  fast retrieval
Speed | Online Version of GraphX

Static PageRank (delta method)
Online PageRank
Speed | Online PageRank
Good for algos with abelian accumulators (commutative, associative, with inverse)

0.0
0.2
0.4
0.6
0.8
1.0
1.2
50K
100K
200K
400K
600K
800K
1M
Convergence Rate
(Normalized Execution TIme)
Throughput (Edges/Second)
Convergence Rate
Naive
incremental
•Algorithm: Page Rank
•Reset probability: 0.15
•Convergence Threshold: 0.001
•1 Master + 3 workers
•Distributed Messaging System: Kafka
•Spark 1.1.0 + our graph streaming
0%
20%
40%
60%
80%
100%
120%
50K
100K
200K
400K
600K
800K
1M
Normalized Messages Sent
Throughput (Edges/Second)
Communication Overhead
naive
incremental
Speed | (Really) Early Results for Online PageRank

Complexity | Challenges
•Feature Engineering for Data Science
•Monolithic Analytics  Complex Pipeline Analytics

Complexity | Directed Acyclic Graphs of Actions
•Common in Data Science “feature engineering”
•Developed iteratively
•Becomes a new tool in the toolbox
A
A
B
C

Source: ISTC-Pervasive Computing
Discriminative structures come at multiple scales and varying deformations
Complexity | Hierarchical Matching Pursuit for Image Classification
•Feature learning
•Multiple layers to learn
•Multipath sparse coding

Source: ISTC-Pervasive Computing
•Robustness
–Local deformations such as translation, rotation and scaling
–Lighting condition changes
–Viewpoint and pose changes
–Large intra-class variations
•Hierarchy
–Sparse data: The total number of possible image patches grows exponentially with their sizes
–Shared structure: Large patches could share similar or even same small patches
Complexity | Robust Hierarchical Representations?

Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013
Complexity | Object Recognition on Caltech 256 Benchmark
#Training Images
15
30
45
60
Local NBB [1]
33.5
40.1
-
-
LLC [2]
34.4
41.2
45.3
47.7
CRBM [3]
35.1
42.1
45.7
47.9
LASERC [4]
35.2
43.6
-
-
LP-beta [5]
-
45.8
-
-
Our Work
41.1
48.7
52.8
56.2
[1] S. McCann and D. Lowe, CVPR 12 [2] J. Wang et al, CVPR 10 [3] K Sohn et al, ICCV 11 [4] K. Nguyen et al, ECCV 12 [5] P. Gehler and S. Nowozin, ICCV 09
Much better than the state of the art
(especially when given more data)

Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
Distributed Deep Learning Library
Spark
Hadoop
IA/MIC
IA/MIC
IA/MIC
IA/MIC
IA/MIC
IA/MIC
Complexity | Deep Learning Library for Spark
•Open source
•Spark MLlib contribution
•Optimized for IA
Complete POC in 2015

•Selecting the right data to process
•Selecting the right features to engineer
•Selecting the right algorithm to run
Intelligence | Challenges

Image Source: University of Nebraska-Lincoln
Intelligence | Ensemble Learning (Wisdom of the Crowd)
Trade computational power for automated experimentation
•Tackles the data and algorithm selection problem
•Diversification methods vary
•Bagging
•Boosting
•Combining techniques vary
•Majority vote on label
•Bucket of models
•N bagged predictors  N times the computation

Intelligence | Beyond Ensemble Learning
•Downsides of ensemble learning include the number of:
•Tunable parameters
•Selection criteria
•Companies claim that non-parametric methods that require no selection of criteria are in development
For now, it’s the Wisdom of the Crowd. Stay tuned!

Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Skills shortage at intersection of systems engineering and data analysis
Painful data ingestion and preparation
Tools that are not designed with loopbacks in mind
Pipeline state not easy to manage, especially for collaboration
Composing pipeline is DIY
Ease of Use | Data Science Workflow

Congratulations! You are a data scientist!

Intel Confidential
Decomposing the “data scientist”
Source: 2013 Report from Accenture Institute for High Performance

Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html, accessed on 9/30/2014
Ease of Use | Programming Languages
WordCount: The “Hello World” for Big Data
In Java MapReduce

Python
R
Dataflow GUI
...
Datacenter / Cloud
Network
Client
“Data Science” API
Connect
Manage
Secure
Analyzedistributed and parallel
Manage
Secure
Connect
Analyzelocal
Query
Big Data Java/Scala/C++ Computational Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data Wrangling
Analyst Skills
The Other Skills
Ease of Use | Making Big Data Familiar

•One consistent API for:
–ETL & feature engineering
–Including Spark and whatever comes next
–Graph construction, databases, analytics, query
–Same API for Titan, Neo4j, etc.
–Same API for Giraph, GraphX, GraphLab, etc.
–Machine learning & statistical analytics
•Programming language integration
•Extensibility at the core
Ease of Use | API Functionality

POST https://site.com/joe/graphs/29/transforms
{
operation: "ml.cgd",
arguments: [ {
edge_properties = ["rating"],
output_property_prefix="cgd_",
vertex_type = "vertex_type",
edge_type = "splits",
max_supersteps = 20,
feature_dimension = 3,
convergence_threshold = 0,
cgd_lambda = 0.65,
learning_output_interval = 1,
bias_on = true,
num_iters = 3
}]
}
Ease of Use | Run CGD (REST Call)

201 Created
{
operation: “ml.gcd", argments : (same as request)
id: 2, created: "2014-01-31 10:51:05.1234",
depends_on: [{
link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1}
type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null,
status: “pending”
}]
links: [
{ rel: “self”, method: “GET”,
uri: https://site.com/v1/graphs/29/transforms/2},
{rel=“intel:idpat-progress”, method=“GET”,
uri: https://site.com/v1/graphs/29/transforms/2/progress}
{rel=“intel:idpat-cancel”, method=“DELETE”,
uri: https://site.com/user/joe/graphs/29/transforms/2}]
}
Ease of Use | Run CGD (REST Response)

FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP
APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” API
Intel Analytics Toolkit
Ease of Use | Delivering It
Unified UI’s across the workflow
Easier feature & model creation
End-to-end graph pipeline
Fully scalable throughout
Multiple data primitives
Optimized for IA
Cloud & On-Prem
Python
Libraries
3rd Party GUIs/SDKs
Viz
Tools
Future Libraries
BI Connectors
Query Interfaces
...

Approach
Algorithm
Category
Applications/Use Cases
Loopy Belief Propagation (LBP)
Structured Prediction
Personalized recs, image de-noising
Label Propagation
Structured Prediction
Personalized recommendations
Alternating Least Squares (ALS)
Collaborative Filtering
Recommenders
Conjugate Gradient Descent (CGD)
Collaborative Filtering
Recommenders
Connected Components
Graph Analytics
Network manipulation, image analysis
Latent Dirichlet Allocation (LDA)
Topic Modeling
Document Clustering
Structure Attribute
Clustering
Network analysis, consumer seg
K-Truss
Clustering
Social network analysis
KNN*
Clustering
Recommenders
Logistic Regression*
Classification
Fraud detection
Random Forest*
Classification
Fraud detection, consumer seg
Generalized Linear Model (Binomial, Poisson)
Non-linear Curve Fitting
Forecasting, pricing, market mix models
Association Rule Mining
Data Mining
Market basket analysis, recommenders
Frequent Pattern Mining*
Data Mining
Pattern Recognition
Graph
50
Ease of Use | A Full Spectrum of Analytics

Real Time Database
BQL – BigDAWG Query Language & Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist” provides portability
Historical / Analytics Databases
Spill
Stream
Ease of Use | Future Vision – BigDAWG

Ease of Use | Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language & Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB
S-Store
“Narrow waist” provides portability
MyriaX
Historical / Analytics DBMSs
Spill
Stream

Ease of Use | BigDAWG Deliverables ‘15-’16
•Complete prototype “big data” stack and reference implementation
•Battle-tested on multiple use cases
•Standard federation language (BQL)
•Next-generation interface for analytics
•Next-generation stream processing system
Stay Tuned!
http://istc-bigdata.org/

1.Big Data Visualization, especially graph*
2.Big Data DB that supports relational and graph equally*
3.A better workflow manager (Like Oozie for Hadoop, Spark, etc.)*
4.UI partners (R, Julia, etc.)
5.Better portable machine learning models (like PMML) that also capture feature engineering (not just algos)
6.Cluster monitoring (GUI, etc.) that works across many big data tools
7.Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting
8.Cluster auto-configuration tools
•* - Open source STRONGLY preferred
Technology Wish List

•Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? TED.WILLKE@INTEL.COM
•GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/
Call to Action

The Analytics Frontier of the Hadoop Eco-System

The Analytics Frontier of the Hadoop Eco-System

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Analytics Frontier of the Hadoop Eco-System

Ähnlich wie The Analytics Frontier of the Hadoop Eco-System (20)

Mehr von inside-BigData.com

Mehr von inside-BigData.com (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Analytics Frontier of the Hadoop Eco-System