SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
The Analytics Frontier of the Hadoop Eco-system 
Ted Willke 
Senior Principal Engineer and GM
•Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining 
•Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model 
•YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported 
Today | Hadoop Analytics
•Variety - Expansion of data primitives in commercial use 
•Speed - Data processing models evolving (batch  streaming) 
•Complexity – Monolithic analytics  analytics pipelines 
•Intelligence – Prescriptive ML  Applying ML to ML itself 
•Ease of Use – Gap between skills and needs growing 
Trends | Hadoop Eco-System Analytics
Life Sciences 
Personalized medicine, drug repurposing predictions, integration of heterogeneous data 
Education 
Personalized instruction, outcomes measurement and intervention 
Network Security 
Data fusion, threat assessment and identification 
Retail 
Inventory management, product display management, demand forecasting 
Trends | Areas of Application (to name a few)
Variety
Variety | Primitives Usage Patterns 
Key-Value Document Graph 
Sync (I/O) Async (Bus) Off-line (Queue) 
API (Remote) LIB (Local) 
Model 
Access 
Implementation 
Column SQL
•When the problem is an information network 
•When a graph is a natural way of expressing the algorithm 
•When you want to study specific relationships 
•When you want faster machine learning or solvers on sparse data 
shortest path 
central influence 
sub networks 
triangle count 
Variety | Graphical Model
High 
Program 
Importance 
(Centrality) 
Low 
Graph of channel viewing behavior 
Current popular 
surfing patterns 
SH002463130000 
EP005544723744 
Changes in surfing behavior may predict customer churn. 
Variety | Graph Statistics
Preference and Similarity Recommendations 
User 
Movie 
1.7MM Nodes 
23.9MM Edges 
similar cast 
prefers 
similar topic 
userId: A0A22A5 
title: The Godfather 
genre: Crime drama 
cast: [M. Brando, Al Pacino] 
title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer] 
title: The Departed 
genre: Crime drama 
cast: [L. DiCaprio, M. Damon] 
weight=11.8 
weight=0.67 
weight=0.03 
weight=14.98 
Variety | Graph Search
10 
URL Ground-Truth Data 
IP/Domain Reputations 
420MM Records 
74.5MM Nodes 185MM Edges 
URL 
Domain 
IP Address 
Calculation of priors 
LBP Messaging 
84.231.82.93 
86.39.155.137 
forum.vsichko.com 
hermansonskok.se 
euskzzbz.nonetheups.com 
keesenbep.spaces.live.com 
Variety | Graphical Machine Learning
Variety | Loopy Belief Propagation on the (semantic) web 
Reputations 
Neutral 
Good 
Bad 
Suspect
Variety | Unification with Apache Spark 
Image Source: Databricks 
•In-memory structures (RDDs) support both table and graph abstractions 
•Batch processing and Spark streaming 
Spark 
RDDs, Transformations, and Actions 
Spark Streaming real-time 
Spark 
SQL 
MLLib 
machine learning 
DStream’s: Streams of RDD’s 
SchemaRDD’s 
RDD-Based Matrices 
GraphX 
graph processing/ 
machine learning 
RDD-Based Graphs
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 
Variety | Unification within the In-Memory Database (IMDB) 
•Index data structure for graph traversal 
•Prototyped in SAP HANA distributed columnar IMDB 
•Lays foundation for complex graph query and algorithms
Variety | Graph Traversal 
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Graph Indexing 
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Graph Traversal Results 
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Speed
Cloud Infrastructure 
UI 
Data Platform 
Analytics Platform 
Datacenter 
Network 
Gateway 
Thing 
Services 
Speed | Hadoop Meets The Internet of Things
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 
Data Stream 
Feature Processing 
Model Updates 
Learning 
Distributed Messaging System 
(e.g., Kafka) 
Speed | Stream Processing Pipeline
•Data replay (e.g., a bug is found or application improved) 
•Getting faster and more efficient than “fast batch” 
•Time-evolving models and computation 
Speed | Challenges
Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014 
Lambda 
Kappa 
•Implement transform logic twice 
•Federate information at query time 
•Retains input data unchanged 
The thinking continues to evolve.... 
•Retain full replay window 
•2nd instance can re-process 
•Query against latest table 
Speed | Cluster-Scale Stream Processing
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 
Apply Dstreams to built-in: 
•Machine Learning 
•Graph Processing 
Speed | Spark Streaming
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 
•Mini-batch +/- windowing 
•Analytics can be run on any of the resultant RDDs 
•No provisions for merging RDDs 
Speed | Spark (Discretized) Streaming 
(Mini-) batch Streaming  (Mini-) batch Analytics
Image Source: GraphX project 
•Graph processing engine on Spark 
•Supports Pregel-style vertex programming 
•View same data as either graphs or collections 
Speed | GraphX API for Spark
•Current Spark streaming provides mini-batch streaming 
•No concept of data (model) merging 
•GraphX is currently designed for static graphs: 
1.Merge table data prior to graph pipeline 
2.Re-generate entire (accumulated) graph 
3.Re-run machine learning at each window 
Speed | Spark Streaming for GraphX? 
Straightforward, but wastes computation and time. Can we do better?
•Merge information directly into data model used by algorithms 
•Static algorithms -> Online algorithms 
•Incremental re-computation triggered by changes in data values or data structure 
•Possible with many machine learning algos (PageRank as example) 
•Evolve IM data stores to maximize performance and freshness 
•Better partitioning algorithms  reduced data replication 
•Dynamic indexing  fast retrieval 
Speed | Online Version of GraphX
Static PageRank (delta method) 
Online PageRank 
Speed | Online PageRank 
Good for algos with abelian accumulators (commutative, associative, with inverse)
0.0 
0.2 
0.4 
0.6 
0.8 
1.0 
1.2 
50K 
100K 
200K 
400K 
600K 
800K 
1M 
Convergence Rate 
(Normalized Execution TIme) 
Throughput (Edges/Second) 
Convergence Rate 
Naive 
incremental 
•Algorithm: Page Rank 
•Reset probability: 0.15 
•Convergence Threshold: 0.001 
•1 Master + 3 workers 
•Distributed Messaging System: Kafka 
•Spark 1.1.0 + our graph streaming 
0% 
20% 
40% 
60% 
80% 
100% 
120% 
50K 
100K 
200K 
400K 
600K 
800K 
1M 
Normalized Messages Sent 
Throughput (Edges/Second) 
Communication Overhead 
naive 
incremental 
Speed | (Really) Early Results for Online PageRank
Complexity
Complexity | Challenges 
•Feature Engineering for Data Science 
•Monolithic Analytics  Complex Pipeline Analytics
Complexity | Directed Acyclic Graphs of Actions 
•Common in Data Science “feature engineering” 
•Developed iteratively 
•Becomes a new tool in the toolbox 
A 
A 
B 
C
Source: ISTC-Pervasive Computing 
Discriminative structures come at multiple scales and varying deformations 
Complexity | Hierarchical Matching Pursuit for Image Classification 
•Feature learning 
•Multiple layers to learn 
•Multipath sparse coding
Source: ISTC-Pervasive Computing 
•Robustness 
–Local deformations such as translation, rotation and scaling 
–Lighting condition changes 
–Viewpoint and pose changes 
–Large intra-class variations 
•Hierarchy 
–Sparse data: The total number of possible image patches grows exponentially with their sizes 
–Shared structure: Large patches could share similar or even same small patches 
Complexity | Robust Hierarchical Representations?
Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013 
Complexity | Object Recognition on Caltech 256 Benchmark 
#Training Images 
15 
30 
45 
60 
Local NBB [1] 
33.5 
40.1 
- 
- 
LLC [2] 
34.4 
41.2 
45.3 
47.7 
CRBM [3] 
35.1 
42.1 
45.7 
47.9 
LASERC [4] 
35.2 
43.6 
- 
- 
LP-beta [5] 
- 
45.8 
- 
- 
Our Work 
41.1 
48.7 
52.8 
56.2 
[1] S. McCann and D. Lowe, CVPR 12 [2] J. Wang et al, CVPR 10 [3] K Sohn et al, ICCV 11 [4] K. Nguyen et al, ECCV 12 [5] P. Gehler and S. Nowozin, ICCV 09 
Much better than the state of the art 
(especially when given more data)
Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) 
Distributed Deep Learning Library 
Spark 
Hadoop 
IA/MIC 
IA/MIC 
IA/MIC 
IA/MIC 
IA/MIC 
IA/MIC 
Complexity | Deep Learning Library for Spark 
•Open source 
•Spark MLlib contribution 
•Optimized for IA 
Complete POC in 2015
Intelligence
•Selecting the right data to process 
•Selecting the right features to engineer 
•Selecting the right algorithm to run 
Intelligence | Challenges
Image Source: University of Nebraska-Lincoln 
Intelligence | Ensemble Learning (Wisdom of the Crowd) 
Trade computational power for automated experimentation 
•Tackles the data and algorithm selection problem 
•Diversification methods vary 
•Bagging 
•Boosting 
•Combining techniques vary 
•Majority vote on label 
•Bucket of models 
•N bagged predictors  N times the computation
Intelligence | Beyond Ensemble Learning 
•Downsides of ensemble learning include the number of: 
•Tunable parameters 
•Selection criteria 
•Companies claim that non-parametric methods that require no selection of criteria are in development 
For now, it’s the Wisdom of the Crowd. Stay tuned!
Ease of Use
Ingest & Clean 
Engineer Features 
Structure Model 
Train Model 
Query & Analyze 
Learn 
Visualize 
Skills shortage at intersection of systems engineering and data analysis 
Painful data ingestion and preparation 
Tools that are not designed with loopbacks in mind 
Pipeline state not easy to manage, especially for collaboration 
Composing pipeline is DIY 
Ease of Use | Data Science Workflow
Congratulations! You are a data scientist!
Intel Confidential 
Decomposing the “data scientist” 
Source: 2013 Report from Accenture Institute for High Performance
Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html, accessed on 9/30/2014 
Ease of Use | Programming Languages 
WordCount: The “Hello World” for Big Data 
In Java MapReduce
Python 
R 
Dataflow GUI 
... 
Datacenter / Cloud 
Network 
Client 
“Data Science” API 
Connect 
Manage 
Secure 
Analyzedistributed and parallel 
Manage 
Secure 
Connect 
Analyzelocal 
Query 
Big Data Java/Scala/C++ Computational Frameworks 
Big Data Algorithms 
Cluster Workload Mgmt 
Cluster Storage 
Machine Learning & Statistics 
Data Wrangling 
Analyst Skills 
The Other Skills 
Ease of Use | Making Big Data Familiar
•One consistent API for: 
–ETL & feature engineering 
–Including Spark and whatever comes next 
–Graph construction, databases, analytics, query 
–Same API for Titan, Neo4j, etc. 
–Same API for Giraph, GraphX, GraphLab, etc. 
–Machine learning & statistical analytics 
•Programming language integration 
•Extensibility at the core 
Ease of Use | API Functionality
POST https://site.com/joe/graphs/29/transforms 
{ 
operation: "ml.cgd", 
arguments: [ { 
edge_properties = ["rating"], 
output_property_prefix="cgd_", 
vertex_type = "vertex_type", 
edge_type = "splits", 
max_supersteps = 20, 
feature_dimension = 3, 
convergence_threshold = 0, 
cgd_lambda = 0.65, 
learning_output_interval = 1, 
bias_on = true, 
num_iters = 3 
}] 
} 
Ease of Use | Run CGD (REST Call)
201 Created 
{ 
operation: “ml.gcd", argments : (same as request) 
id: 2, created: "2014-01-31 10:51:05.1234", 
depends_on: [{ 
link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1} 
type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null, 
status: “pending” 
}] 
links: [ 
{ rel: “self”, method: “GET”, 
uri: https://site.com/v1/graphs/29/transforms/2}, 
{rel=“intel:idpat-progress”, method=“GET”, 
uri: https://site.com/v1/graphs/29/transforms/2/progress} 
{rel=“intel:idpat-cancel”, method=“DELETE”, 
uri: https://site.com/user/joe/graphs/29/transforms/2}] 
} 
Ease of Use | Run CGD (REST Response)
FILESYSTEMS AND NOSQL STORAGE 
HW PLATFORM 
APACHE HADOOP 
APACHE SPARK 
DATA WRANGLING 
MACHINE LEARNING AND STATISTICS 
Graphical Algorithms 
Classical Algorithms 
Graph Construction Tools 
Useful String Manipulation 
Useful Math Operators 
“DATA SCIENCE” API 
Intel Analytics Toolkit 
Ease of Use | Delivering It 
Unified UI’s across the workflow 
Easier feature & model creation 
End-to-end graph pipeline 
Fully scalable throughout 
Multiple data primitives 
Optimized for IA 
Cloud & On-Prem 
Python 
Libraries 
3rd Party GUIs/SDKs 
Viz 
Tools 
Future Libraries 
BI Connectors 
Query Interfaces 
...
Approach 
Algorithm 
Category 
Applications/Use Cases 
Loopy Belief Propagation (LBP) 
Structured Prediction 
Personalized recs, image de-noising 
Label Propagation 
Structured Prediction 
Personalized recommendations 
Alternating Least Squares (ALS) 
Collaborative Filtering 
Recommenders 
Conjugate Gradient Descent (CGD) 
Collaborative Filtering 
Recommenders 
Connected Components 
Graph Analytics 
Network manipulation, image analysis 
Latent Dirichlet Allocation (LDA) 
Topic Modeling 
Document Clustering 
Structure Attribute 
Clustering 
Network analysis, consumer seg 
K-Truss 
Clustering 
Social network analysis 
KNN* 
Clustering 
Recommenders 
Logistic Regression* 
Classification 
Fraud detection 
Random Forest* 
Classification 
Fraud detection, consumer seg 
Generalized Linear Model (Binomial, Poisson) 
Non-linear Curve Fitting 
Forecasting, pricing, market mix models 
Association Rule Mining 
Data Mining 
Market basket analysis, recommenders 
Frequent Pattern Mining* 
Data Mining 
Pattern Recognition 
Graph 
50 
Ease of Use | A Full Spectrum of Analytics
Real Time Database 
BQL – BigDAWG Query Language & Compiler 
Analytics Libraries 
Hardware Platforms 
Applications, Visualization, Languages 
“Narrow waist” provides portability 
Historical / Analytics Databases 
Spill 
Stream 
Ease of Use | Future Vision – BigDAWG
Ease of Use | Future Vision – BigDAWG 
Real Time DBMSs 
BQL – BigDAWG Query Language & Compiler 
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching 
Languages, e.g, Julia, R, MLbase, GraphLab 
SciDB 
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages 
TupleWare 
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon 
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT 
TileDB 
S-Store 
“Narrow waist” provides portability 
MyriaX 
Historical / Analytics DBMSs 
Spill 
Stream
Ease of Use | BigDAWG Deliverables ‘15-’16 
•Complete prototype “big data” stack and reference implementation 
•Battle-tested on multiple use cases 
•Standard federation language (BQL) 
•Next-generation interface for analytics 
•Next-generation stream processing system 
Stay Tuned! 
http://istc-bigdata.org/
1.Big Data Visualization, especially graph* 
2.Big Data DB that supports relational and graph equally* 
3.A better workflow manager (Like Oozie for Hadoop, Spark, etc.)* 
4.UI partners (R, Julia, etc.) 
5.Better portable machine learning models (like PMML) that also capture feature engineering (not just algos) 
6.Cluster monitoring (GUI, etc.) that works across many big data tools 
7.Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting 
8.Cluster auto-configuration tools 
•* - Open source STRONGLY preferred 
Technology Wish List
•Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? TED.WILLKE@INTEL.COM 
•GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/ 
Call to Action
The Analytics Frontier of the Hadoop Eco-System

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
 
Information Exploitation at BBN
Information Exploitation at BBNInformation Exploitation at BBN
Information Exploitation at BBNPlamen Petrov
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftŁukasz Grala
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Databricks
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKJan Wiegelmann
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AINeo4j
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stageNick Handel
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchDaniel Schneiter
 

Was ist angesagt? (20)

Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph Algorithms
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Information Exploitation at BBN
Information Exploitation at BBNInformation Exploitation at BBN
Information Exploitation at BBN
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACK
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 

Ähnlich wie The Analytics Frontier of the Hadoop Eco-System

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AINeo4j
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with GraphsNeo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with GraphsNeo4j
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 

Ähnlich wie The Analytics Frontier of the Hadoop Eco-System (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with GraphsNeo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

The Analytics Frontier of the Hadoop Eco-System

  • 1. The Analytics Frontier of the Hadoop Eco-system Ted Willke Senior Principal Engineer and GM
  • 2. •Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining •Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model •YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported Today | Hadoop Analytics
  • 3. •Variety - Expansion of data primitives in commercial use •Speed - Data processing models evolving (batch  streaming) •Complexity – Monolithic analytics  analytics pipelines •Intelligence – Prescriptive ML  Applying ML to ML itself •Ease of Use – Gap between skills and needs growing Trends | Hadoop Eco-System Analytics
  • 4. Life Sciences Personalized medicine, drug repurposing predictions, integration of heterogeneous data Education Personalized instruction, outcomes measurement and intervention Network Security Data fusion, threat assessment and identification Retail Inventory management, product display management, demand forecasting Trends | Areas of Application (to name a few)
  • 6. Variety | Primitives Usage Patterns Key-Value Document Graph Sync (I/O) Async (Bus) Off-line (Queue) API (Remote) LIB (Local) Model Access Implementation Column SQL
  • 7. •When the problem is an information network •When a graph is a natural way of expressing the algorithm •When you want to study specific relationships •When you want faster machine learning or solvers on sparse data shortest path central influence sub networks triangle count Variety | Graphical Model
  • 8. High Program Importance (Centrality) Low Graph of channel viewing behavior Current popular surfing patterns SH002463130000 EP005544723744 Changes in surfing behavior may predict customer churn. Variety | Graph Statistics
  • 9. Preference and Similarity Recommendations User Movie 1.7MM Nodes 23.9MM Edges similar cast prefers similar topic userId: A0A22A5 title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino] title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer] title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon] weight=11.8 weight=0.67 weight=0.03 weight=14.98 Variety | Graph Search
  • 10. 10 URL Ground-Truth Data IP/Domain Reputations 420MM Records 74.5MM Nodes 185MM Edges URL Domain IP Address Calculation of priors LBP Messaging 84.231.82.93 86.39.155.137 forum.vsichko.com hermansonskok.se euskzzbz.nonetheups.com keesenbep.spaces.live.com Variety | Graphical Machine Learning
  • 11. Variety | Loopy Belief Propagation on the (semantic) web Reputations Neutral Good Bad Suspect
  • 12. Variety | Unification with Apache Spark Image Source: Databricks •In-memory structures (RDDs) support both table and graph abstractions •Batch processing and Spark streaming Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s RDD-Based Matrices GraphX graph processing/ machine learning RDD-Based Graphs
  • 13. Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) Variety | Unification within the In-Memory Database (IMDB) •Index data structure for graph traversal •Prototyped in SAP HANA distributed columnar IMDB •Lays foundation for complex graph query and algorithms
  • 14. Variety | Graph Traversal Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
  • 15. Variety | Graph Indexing Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
  • 16. Variety | Graph Traversal Results Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
  • 17. Speed
  • 18. Cloud Infrastructure UI Data Platform Analytics Platform Datacenter Network Gateway Thing Services Speed | Hadoop Meets The Internet of Things
  • 19. Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 Data Stream Feature Processing Model Updates Learning Distributed Messaging System (e.g., Kafka) Speed | Stream Processing Pipeline
  • 20. •Data replay (e.g., a bug is found or application improved) •Getting faster and more efficient than “fast batch” •Time-evolving models and computation Speed | Challenges
  • 21. Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014 Lambda Kappa •Implement transform logic twice •Federate information at query time •Retains input data unchanged The thinking continues to evolve.... •Retain full replay window •2nd instance can re-process •Query against latest table Speed | Cluster-Scale Stream Processing
  • 22. Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 Apply Dstreams to built-in: •Machine Learning •Graph Processing Speed | Spark Streaming
  • 23. Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014 •Mini-batch +/- windowing •Analytics can be run on any of the resultant RDDs •No provisions for merging RDDs Speed | Spark (Discretized) Streaming (Mini-) batch Streaming  (Mini-) batch Analytics
  • 24. Image Source: GraphX project •Graph processing engine on Spark •Supports Pregel-style vertex programming •View same data as either graphs or collections Speed | GraphX API for Spark
  • 25. •Current Spark streaming provides mini-batch streaming •No concept of data (model) merging •GraphX is currently designed for static graphs: 1.Merge table data prior to graph pipeline 2.Re-generate entire (accumulated) graph 3.Re-run machine learning at each window Speed | Spark Streaming for GraphX? Straightforward, but wastes computation and time. Can we do better?
  • 26. •Merge information directly into data model used by algorithms •Static algorithms -> Online algorithms •Incremental re-computation triggered by changes in data values or data structure •Possible with many machine learning algos (PageRank as example) •Evolve IM data stores to maximize performance and freshness •Better partitioning algorithms  reduced data replication •Dynamic indexing  fast retrieval Speed | Online Version of GraphX
  • 27. Static PageRank (delta method) Online PageRank Speed | Online PageRank Good for algos with abelian accumulators (commutative, associative, with inverse)
  • 28. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 50K 100K 200K 400K 600K 800K 1M Convergence Rate (Normalized Execution TIme) Throughput (Edges/Second) Convergence Rate Naive incremental •Algorithm: Page Rank •Reset probability: 0.15 •Convergence Threshold: 0.001 •1 Master + 3 workers •Distributed Messaging System: Kafka •Spark 1.1.0 + our graph streaming 0% 20% 40% 60% 80% 100% 120% 50K 100K 200K 400K 600K 800K 1M Normalized Messages Sent Throughput (Edges/Second) Communication Overhead naive incremental Speed | (Really) Early Results for Online PageRank
  • 30. Complexity | Challenges •Feature Engineering for Data Science •Monolithic Analytics  Complex Pipeline Analytics
  • 31. Complexity | Directed Acyclic Graphs of Actions •Common in Data Science “feature engineering” •Developed iteratively •Becomes a new tool in the toolbox A A B C
  • 32. Source: ISTC-Pervasive Computing Discriminative structures come at multiple scales and varying deformations Complexity | Hierarchical Matching Pursuit for Image Classification •Feature learning •Multiple layers to learn •Multipath sparse coding
  • 33. Source: ISTC-Pervasive Computing •Robustness –Local deformations such as translation, rotation and scaling –Lighting condition changes –Viewpoint and pose changes –Large intra-class variations •Hierarchy –Sparse data: The total number of possible image patches grows exponentially with their sizes –Shared structure: Large patches could share similar or even same small patches Complexity | Robust Hierarchical Representations?
  • 34. Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013 Complexity | Object Recognition on Caltech 256 Benchmark #Training Images 15 30 45 60 Local NBB [1] 33.5 40.1 - - LLC [2] 34.4 41.2 45.3 47.7 CRBM [3] 35.1 42.1 45.7 47.9 LASERC [4] 35.2 43.6 - - LP-beta [5] - 45.8 - - Our Work 41.1 48.7 52.8 56.2 [1] S. McCann and D. Lowe, CVPR 12 [2] J. Wang et al, CVPR 10 [3] K Sohn et al, ICCV 11 [4] K. Nguyen et al, ECCV 12 [5] P. Gehler and S. Nowozin, ICCV 09 Much better than the state of the art (especially when given more data)
  • 35. Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) Distributed Deep Learning Library Spark Hadoop IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC Complexity | Deep Learning Library for Spark •Open source •Spark MLlib contribution •Optimized for IA Complete POC in 2015
  • 37. •Selecting the right data to process •Selecting the right features to engineer •Selecting the right algorithm to run Intelligence | Challenges
  • 38. Image Source: University of Nebraska-Lincoln Intelligence | Ensemble Learning (Wisdom of the Crowd) Trade computational power for automated experimentation •Tackles the data and algorithm selection problem •Diversification methods vary •Bagging •Boosting •Combining techniques vary •Majority vote on label •Bucket of models •N bagged predictors  N times the computation
  • 39. Intelligence | Beyond Ensemble Learning •Downsides of ensemble learning include the number of: •Tunable parameters •Selection criteria •Companies claim that non-parametric methods that require no selection of criteria are in development For now, it’s the Wisdom of the Crowd. Stay tuned!
  • 41. Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Skills shortage at intersection of systems engineering and data analysis Painful data ingestion and preparation Tools that are not designed with loopbacks in mind Pipeline state not easy to manage, especially for collaboration Composing pipeline is DIY Ease of Use | Data Science Workflow
  • 42. Congratulations! You are a data scientist!
  • 43. Intel Confidential Decomposing the “data scientist” Source: 2013 Report from Accenture Institute for High Performance
  • 44. Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html, accessed on 9/30/2014 Ease of Use | Programming Languages WordCount: The “Hello World” for Big Data In Java MapReduce
  • 45. Python R Dataflow GUI ... Datacenter / Cloud Network Client “Data Science” API Connect Manage Secure Analyzedistributed and parallel Manage Secure Connect Analyzelocal Query Big Data Java/Scala/C++ Computational Frameworks Big Data Algorithms Cluster Workload Mgmt Cluster Storage Machine Learning & Statistics Data Wrangling Analyst Skills The Other Skills Ease of Use | Making Big Data Familiar
  • 46. •One consistent API for: –ETL & feature engineering –Including Spark and whatever comes next –Graph construction, databases, analytics, query –Same API for Titan, Neo4j, etc. –Same API for Giraph, GraphX, GraphLab, etc. –Machine learning & statistical analytics •Programming language integration •Extensibility at the core Ease of Use | API Functionality
  • 47. POST https://site.com/joe/graphs/29/transforms { operation: "ml.cgd", arguments: [ { edge_properties = ["rating"], output_property_prefix="cgd_", vertex_type = "vertex_type", edge_type = "splits", max_supersteps = 20, feature_dimension = 3, convergence_threshold = 0, cgd_lambda = 0.65, learning_output_interval = 1, bias_on = true, num_iters = 3 }] } Ease of Use | Run CGD (REST Call)
  • 48. 201 Created { operation: “ml.gcd", argments : (same as request) id: 2, created: "2014-01-31 10:51:05.1234", depends_on: [{ link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1} type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null, status: “pending” }] links: [ { rel: “self”, method: “GET”, uri: https://site.com/v1/graphs/29/transforms/2}, {rel=“intel:idpat-progress”, method=“GET”, uri: https://site.com/v1/graphs/29/transforms/2/progress} {rel=“intel:idpat-cancel”, method=“DELETE”, uri: https://site.com/user/joe/graphs/29/transforms/2}] } Ease of Use | Run CGD (REST Response)
  • 49. FILESYSTEMS AND NOSQL STORAGE HW PLATFORM APACHE HADOOP APACHE SPARK DATA WRANGLING MACHINE LEARNING AND STATISTICS Graphical Algorithms Classical Algorithms Graph Construction Tools Useful String Manipulation Useful Math Operators “DATA SCIENCE” API Intel Analytics Toolkit Ease of Use | Delivering It Unified UI’s across the workflow Easier feature & model creation End-to-end graph pipeline Fully scalable throughout Multiple data primitives Optimized for IA Cloud & On-Prem Python Libraries 3rd Party GUIs/SDKs Viz Tools Future Libraries BI Connectors Query Interfaces ...
  • 50. Approach Algorithm Category Applications/Use Cases Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising Label Propagation Structured Prediction Personalized recommendations Alternating Least Squares (ALS) Collaborative Filtering Recommenders Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders Connected Components Graph Analytics Network manipulation, image analysis Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering Structure Attribute Clustering Network analysis, consumer seg K-Truss Clustering Social network analysis KNN* Clustering Recommenders Logistic Regression* Classification Fraud detection Random Forest* Classification Fraud detection, consumer seg Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models Association Rule Mining Data Mining Market basket analysis, recommenders Frequent Pattern Mining* Data Mining Pattern Recognition Graph 50 Ease of Use | A Full Spectrum of Analytics
  • 51. Real Time Database BQL – BigDAWG Query Language & Compiler Analytics Libraries Hardware Platforms Applications, Visualization, Languages “Narrow waist” provides portability Historical / Analytics Databases Spill Stream Ease of Use | Future Vision – BigDAWG
  • 52. Ease of Use | Future Vision – BigDAWG Real Time DBMSs BQL – BigDAWG Query Language & Compiler Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching Languages, e.g, Julia, R, MLbase, GraphLab SciDB Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages TupleWare Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT TileDB S-Store “Narrow waist” provides portability MyriaX Historical / Analytics DBMSs Spill Stream
  • 53. Ease of Use | BigDAWG Deliverables ‘15-’16 •Complete prototype “big data” stack and reference implementation •Battle-tested on multiple use cases •Standard federation language (BQL) •Next-generation interface for analytics •Next-generation stream processing system Stay Tuned! http://istc-bigdata.org/
  • 54. 1.Big Data Visualization, especially graph* 2.Big Data DB that supports relational and graph equally* 3.A better workflow manager (Like Oozie for Hadoop, Spark, etc.)* 4.UI partners (R, Julia, etc.) 5.Better portable machine learning models (like PMML) that also capture feature engineering (not just algos) 6.Cluster monitoring (GUI, etc.) that works across many big data tools 7.Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting 8.Cluster auto-configuration tools •* - Open source STRONGLY preferred Technology Wish List
  • 55. •Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? TED.WILLKE@INTEL.COM •GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/ Call to Action