In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
The Analytics Frontier of the Hadoop Eco-System
1. The Analytics Frontier of the Hadoop Eco-system
Ted Willke
Senior Principal Engineer and GM
2. •Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining
•Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model
•YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported
Today | Hadoop Analytics
3. •Variety - Expansion of data primitives in commercial use
•Speed - Data processing models evolving (batch streaming)
•Complexity – Monolithic analytics analytics pipelines
•Intelligence – Prescriptive ML Applying ML to ML itself
•Ease of Use – Gap between skills and needs growing
Trends | Hadoop Eco-System Analytics
4. Life Sciences
Personalized medicine, drug repurposing predictions, integration of heterogeneous data
Education
Personalized instruction, outcomes measurement and intervention
Network Security
Data fusion, threat assessment and identification
Retail
Inventory management, product display management, demand forecasting
Trends | Areas of Application (to name a few)
7. •When the problem is an information network
•When a graph is a natural way of expressing the algorithm
•When you want to study specific relationships
•When you want faster machine learning or solvers on sparse data
shortest path
central influence
sub networks
triangle count
Variety | Graphical Model
8. High
Program
Importance
(Centrality)
Low
Graph of channel viewing behavior
Current popular
surfing patterns
SH002463130000
EP005544723744
Changes in surfing behavior may predict customer churn.
Variety | Graph Statistics
9. Preference and Similarity Recommendations
User
Movie
1.7MM Nodes
23.9MM Edges
similar cast
prefers
similar topic
userId: A0A22A5
title: The Godfather
genre: Crime drama
cast: [M. Brando, Al Pacino]
title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer]
title: The Departed
genre: Crime drama
cast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Variety | Graph Search
10. 10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes 185MM Edges
URL
Domain
IP Address
Calculation of priors
LBP Messaging
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
Variety | Graphical Machine Learning
11. Variety | Loopy Belief Propagation on the (semantic) web
Reputations
Neutral
Good
Bad
Suspect
12. Variety | Unification with Apache Spark
Image Source: Databricks
•In-memory structures (RDDs) support both table and graph abstractions
•Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming real-time
Spark
SQL
MLLib
machine learning
DStream’s: Streams of RDD’s
SchemaRDD’s
RDD-Based Matrices
GraphX
graph processing/
machine learning
RDD-Based Graphs
13. Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Unification within the In-Memory Database (IMDB)
•Index data structure for graph traversal
•Prototyped in SAP HANA distributed columnar IMDB
•Lays foundation for complex graph query and algorithms
20. •Data replay (e.g., a bug is found or application improved)
•Getting faster and more efficient than “fast batch”
•Time-evolving models and computation
Speed | Challenges
21. Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014
Lambda
Kappa
•Implement transform logic twice
•Federate information at query time
•Retains input data unchanged
The thinking continues to evolve....
•Retain full replay window
•2nd instance can re-process
•Query against latest table
Speed | Cluster-Scale Stream Processing
24. Image Source: GraphX project
•Graph processing engine on Spark
•Supports Pregel-style vertex programming
•View same data as either graphs or collections
Speed | GraphX API for Spark
25. •Current Spark streaming provides mini-batch streaming
•No concept of data (model) merging
•GraphX is currently designed for static graphs:
1.Merge table data prior to graph pipeline
2.Re-generate entire (accumulated) graph
3.Re-run machine learning at each window
Speed | Spark Streaming for GraphX?
Straightforward, but wastes computation and time. Can we do better?
26. •Merge information directly into data model used by algorithms
•Static algorithms -> Online algorithms
•Incremental re-computation triggered by changes in data values or data structure
•Possible with many machine learning algos (PageRank as example)
•Evolve IM data stores to maximize performance and freshness
•Better partitioning algorithms reduced data replication
•Dynamic indexing fast retrieval
Speed | Online Version of GraphX
27. Static PageRank (delta method)
Online PageRank
Speed | Online PageRank
Good for algos with abelian accumulators (commutative, associative, with inverse)
30. Complexity | Challenges
•Feature Engineering for Data Science
•Monolithic Analytics Complex Pipeline Analytics
31. Complexity | Directed Acyclic Graphs of Actions
•Common in Data Science “feature engineering”
•Developed iteratively
•Becomes a new tool in the toolbox
A
A
B
C
32. Source: ISTC-Pervasive Computing
Discriminative structures come at multiple scales and varying deformations
Complexity | Hierarchical Matching Pursuit for Image Classification
•Feature learning
•Multiple layers to learn
•Multipath sparse coding
33. Source: ISTC-Pervasive Computing
•Robustness
–Local deformations such as translation, rotation and scaling
–Lighting condition changes
–Viewpoint and pose changes
–Large intra-class variations
•Hierarchy
–Sparse data: The total number of possible image patches grows exponentially with their sizes
–Shared structure: Large patches could share similar or even same small patches
Complexity | Robust Hierarchical Representations?
34. Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013
Complexity | Object Recognition on Caltech 256 Benchmark
#Training Images
15
30
45
60
Local NBB [1]
33.5
40.1
-
-
LLC [2]
34.4
41.2
45.3
47.7
CRBM [3]
35.1
42.1
45.7
47.9
LASERC [4]
35.2
43.6
-
-
LP-beta [5]
-
45.8
-
-
Our Work
41.1
48.7
52.8
56.2
[1] S. McCann and D. Lowe, CVPR 12 [2] J. Wang et al, CVPR 10 [3] K Sohn et al, ICCV 11 [4] K. Nguyen et al, ECCV 12 [5] P. Gehler and S. Nowozin, ICCV 09
Much better than the state of the art
(especially when given more data)
35. Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
Distributed Deep Learning Library
Spark
Hadoop
IA/MIC
IA/MIC
IA/MIC
IA/MIC
IA/MIC
IA/MIC
Complexity | Deep Learning Library for Spark
•Open source
•Spark MLlib contribution
•Optimized for IA
Complete POC in 2015
37. •Selecting the right data to process
•Selecting the right features to engineer
•Selecting the right algorithm to run
Intelligence | Challenges
38. Image Source: University of Nebraska-Lincoln
Intelligence | Ensemble Learning (Wisdom of the Crowd)
Trade computational power for automated experimentation
•Tackles the data and algorithm selection problem
•Diversification methods vary
•Bagging
•Boosting
•Combining techniques vary
•Majority vote on label
•Bucket of models
•N bagged predictors N times the computation
39. Intelligence | Beyond Ensemble Learning
•Downsides of ensemble learning include the number of:
•Tunable parameters
•Selection criteria
•Companies claim that non-parametric methods that require no selection of criteria are in development
For now, it’s the Wisdom of the Crowd. Stay tuned!
41. Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Skills shortage at intersection of systems engineering and data analysis
Painful data ingestion and preparation
Tools that are not designed with loopbacks in mind
Pipeline state not easy to manage, especially for collaboration
Composing pipeline is DIY
Ease of Use | Data Science Workflow
45. Python
R
Dataflow GUI
...
Datacenter / Cloud
Network
Client
“Data Science” API
Connect
Manage
Secure
Analyzedistributed and parallel
Manage
Secure
Connect
Analyzelocal
Query
Big Data Java/Scala/C++ Computational Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data Wrangling
Analyst Skills
The Other Skills
Ease of Use | Making Big Data Familiar
46. •One consistent API for:
–ETL & feature engineering
–Including Spark and whatever comes next
–Graph construction, databases, analytics, query
–Same API for Titan, Neo4j, etc.
–Same API for Giraph, GraphX, GraphLab, etc.
–Machine learning & statistical analytics
•Programming language integration
•Extensibility at the core
Ease of Use | API Functionality
49. FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP
APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” API
Intel Analytics Toolkit
Ease of Use | Delivering It
Unified UI’s across the workflow
Easier feature & model creation
End-to-end graph pipeline
Fully scalable throughout
Multiple data primitives
Optimized for IA
Cloud & On-Prem
Python
Libraries
3rd Party GUIs/SDKs
Viz
Tools
Future Libraries
BI Connectors
Query Interfaces
...
50. Approach
Algorithm
Category
Applications/Use Cases
Loopy Belief Propagation (LBP)
Structured Prediction
Personalized recs, image de-noising
Label Propagation
Structured Prediction
Personalized recommendations
Alternating Least Squares (ALS)
Collaborative Filtering
Recommenders
Conjugate Gradient Descent (CGD)
Collaborative Filtering
Recommenders
Connected Components
Graph Analytics
Network manipulation, image analysis
Latent Dirichlet Allocation (LDA)
Topic Modeling
Document Clustering
Structure Attribute
Clustering
Network analysis, consumer seg
K-Truss
Clustering
Social network analysis
KNN*
Clustering
Recommenders
Logistic Regression*
Classification
Fraud detection
Random Forest*
Classification
Fraud detection, consumer seg
Generalized Linear Model (Binomial, Poisson)
Non-linear Curve Fitting
Forecasting, pricing, market mix models
Association Rule Mining
Data Mining
Market basket analysis, recommenders
Frequent Pattern Mining*
Data Mining
Pattern Recognition
Graph
50
Ease of Use | A Full Spectrum of Analytics
51. Real Time Database
BQL – BigDAWG Query Language & Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist” provides portability
Historical / Analytics Databases
Spill
Stream
Ease of Use | Future Vision – BigDAWG
52. Ease of Use | Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language & Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB
S-Store
“Narrow waist” provides portability
MyriaX
Historical / Analytics DBMSs
Spill
Stream
53. Ease of Use | BigDAWG Deliverables ‘15-’16
•Complete prototype “big data” stack and reference implementation
•Battle-tested on multiple use cases
•Standard federation language (BQL)
•Next-generation interface for analytics
•Next-generation stream processing system
Stay Tuned!
http://istc-bigdata.org/
54. 1.Big Data Visualization, especially graph*
2.Big Data DB that supports relational and graph equally*
3.A better workflow manager (Like Oozie for Hadoop, Spark, etc.)*
4.UI partners (R, Julia, etc.)
5.Better portable machine learning models (like PMML) that also capture feature engineering (not just algos)
6.Cluster monitoring (GUI, etc.) that works across many big data tools
7.Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting
8.Cluster auto-configuration tools
•* - Open source STRONGLY preferred
Technology Wish List
55. •Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? TED.WILLKE@INTEL.COM
•GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/
Call to Action