SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
A Database-Hadoop Hybrid Approach
to Scalable Machine Learning
Makoto YUI, Isao Kojima AIST, Japan
<m.yui@aist.go.jp>
June 30, 2013
IEEE BigData Congress 2013, Santa Clara
Outline
1. Motivation & Problem Description
2. Our Hybrid Approach to Scalable Machine
Learning
 Architecture
 Our batch learning scheme on Hive
3. Experimental Evaluation
4. Conclusions and Future Directions
2
3
As we seen in the Keynote and the Panel discussion (2nd day)
of this conference, data analytics and machine learning are
obviously getting more attentions along with Big Data
Suppose then that you were a developer and your
manager is willing to ..
Manager
Developer
What’s possible choices around there?
In-database Analytics
 MADlib (open-source project lead by Greenplum)
 Bismarck (Project at wisc.edu, SIGMOD’12)
 SAS In-database Analytics
 Fuzzy Logix (Sybase)
and more..
Machine Learning on Hadoop
 Apache Mahout
 Vowpal Wabbit (open-source project at MS research)
 In-house analytical tools
e.g., Twitter, SIGMOD’12
4
Two popular schools of thought for performing large-scale
machine learning that does not fit in memory space
4 Issues needed to be considered
1. Scalability
5
Scalability would always be a problem when
handing Big Data
4 Issues needed to be considered
1. Scalability
2. Data movement
6
Data movement is important because
Moving data is a critical issue when the size of
dataset shift from terabytes to petabytes and
beyond
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
7
Considering transactions is important for real-
time/online predictions because most of
transaction records, which are valuable for
predictions, are stored in relational databases
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
4. Latency and throughput
8
Latency and throughput are the key issues for
achieving online prediction and/or real-time
analytics
Which is better?
9
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Hadoop
+ Fault Tolerance
+ Straggler node
handling
+ Scale-out
Which is better?
10
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Hadoop
It depends on where the data
initially stored and purposes
of using the data
+ HDFS is useful for
append-only and
archiving purposes
+ ETL Processing
(feature engineering)
+ RDBMS is reliable
as a transactional
data store
Which is better?
11
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Hadoop
+ Small fraction
updates
+ Index-lookup for
online prediction
Which is better?
12
Scalability
Data
movement
Transactions
In-database
analytics
Machine learning
on Hadoop
+ Incremental
learning for each
training instance
- High latency
bottleneck in job
submitting process
+ Batch processing
Latency Throughput
Idea behind DB-Hadoop Hybrid approach
13
scalability
Data
movement
Transactions
Batch learning
on Hadoop
Incremental learning
and prediction in
a relational database
Just an illustration, you know
Latency Throughput
Inside the box
– How tocombine them
14
Postgres
Hadoop cluster
node
node
node
・・・
OLTP
transactions
Training data
Prediction model
Incremental
learning
implemented as a
database stored procedure
Trickle training data to Hadoop HDFS little by little
and bringing back prediction models periodicity
Batch
learning
15
Trickle
updates
Source
database
Trickle updates
in the queue periodically
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner
16
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
Prediction
model
Batch learning process
build a model
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner
up-to-date
model Export a prediction
model
Trickle updates
in the queue periodically
17
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
Prediction
model
Batch learning process
build a model
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner
up-to-date
model
Select the latest one
Insert a new one Export a prediction
model
Trickle updates
in the queue periodically
18
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
Prediction
model
Batch learning process
build a model
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner
up-to-date
model
Select the latest one
Insert a new one Export a prediction
model
Transactional
updates
Online prediction
User can control the flow considering requirements and performance
 Real-time prediction is possible using database triggers on the staging table
Trickle updates
in the queue periodically
19
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
Prediction
model
Batch learning process
build a model
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner
up-to-date
model
Select the latest one
Insert a new one Export a prediction
model
The workflow consists of continuous and
independent processes
Trickle updates
in the queue periodically
Existing Approach for Parallel Batch Learning
―MachineLearningasUserDefinedAggregates(UDAF)
20
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
merge
tuple
<label, array<features >
array<weight>
array<sum of weight>,
array<count>
Training
table
Prediction model
UDAF
-1, <2,7, 9>
..
+1, <3,8>
final
merge
merge
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
 Bottleneck in the final merge
Scalability is limited by the maximum
fan-out of the final merge
 Scalar aggregates computing a
large single result are not suitable
for S/N settings
 Parallel aggregation (as one in
Google Dremel) is not supported
in Hadoop/MapReduce
Aggregate tree (parallel aggregation)
to merge prediction models Problems Observed
Even though MPP databases and Hive parallelize
user-defined aggregates, the above problems
prevent using it
Purely Relational Approach for Parallel Learning
 Implemented a trainer as a set-returning
function, instead of UDAF
The purely relational way that scales on MPP
and Hive/Hadoop
 Shuffle by feature to Reducers
 Run trainers independently on mappers
and aggregate the results on reducers
Embarrassingly parallel as # of mappers and
reducers is controllable
21
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
train train
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
Our solution for parallel machine
learning on Hadoop/Hive
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT trainLogistic(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
Parameter Mixing
K. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010.
Key points
Experimental Evaluation
1. Compared the performance of our batch learning scheme
to state-of-the-art machine learning techniques, namely
Bismarck and Vowpal Wabbit
2. Conducted a online prediction scenario to see the latency
and throughput of our incremental learning scheme
Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the
largest publically available datasets for machine learning, provided by a
commercial search engine provider
Experimental Environment
In-house 33 commodity servers (32 slaves nodes for Hadoop)
each equipped with 8 processors and 24 GB memory
22
Given a prediction model is created with 80% of training data by batch
learning, the rest of data (20%) is supplied for incremental learning
 The task is predicting Click-Through-Rates of search engine ads
 The training data is about 235 million records in 23 GB
Performance Evaluation of Batch Learning
Our batch learning scheme on Hive is 5 and 7.65 times faster than
Vowpal Wabbit and Bismarck, respectively
23
AUC value (Green
Bar) represents
prediction accuracy
5x
7.65x
Throughput: 2.3 million tuples/sec on 32 nodes
Latency: 96 sec for training 235 million records of 23 GB
CAUTION: you can find the detailed number and setting in our paper
Performance Analysis in the Evaluation
24
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
 Low latency (5 sec) under
moderate (70,000 tuples/sec)
updates
 96 sec for training
 Excellent throughput (2.3 million
tuples/sec) on 32 nodes
5 s
96 s
Performance Analysis in the Evaluation
25
 Sqoop required 3 min 32 s to migrate a prediction model
80% model containing about 1.56 million records (323MB)
 Model conversion to a dense format, which is suited for online
learning/prediction on Postgres, required 58 seconds
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
5 s
96 s
212 s58 s
Non-trivial costs in model migration
Performance Analysis in the Evaluation
26
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
 “Data migration time > Training time” justifies the rationale
behind in-database analytics
 The cost of moving data is critical for online prediction as
well as in Big Data analysis
Model migration costs could be amortized with our approach
Key observations
Conclusions
 DB-Hadoop hybrid architecture for online prediction
in which the prediction model needs to be updated in a low latency process
 Design principal for achieving scalable machine learning on
Hadoop/Hive
Good throughput and Scalability
Our Batch learning scheme on Hive is 5 and 7.65 times faster
than Vowpal Wabbit and Bismarck, respectively
Acceptably Small Latency
Possibly less than 5 s, under moderate transactional updates
27
Going hybrid brings low latency to Big Data analytics
Directions for Future Work
28
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
Online
testing
 Integrating online testing schemes
(e.g., Multi-armed Bandits and A/B testing)
to the prediction pipeline
 Develop a scheme to select the best
prediction model among past models
for each user in each session
Online
prediction
Backup slides
29
Evaluation of Incremental Learning
 Given a prediction model is created with 80% of training
data by batch learning, the rest of data (20% ) is supplied for
incremental learning
30
Built model with ..
elapsed time
(in sec)
Throughput
tuples/sec AUC
Batch only (80%) 96.33 2067418.3 0.7177
+0.1% updates (80.1%) 4.99 36155.4 0.7197
+1% updates (81%) 25.96 69812.8 0.7242
+10% updates (90%) 256.03 71278.1 0.7291
+20% updates (100%) 499.61 72901.4 0.7349
Batch only (100%) 102.52 2298010.8 0.7356
Directions for Future Work
31
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
Take a common setting for OLTP that database is partitioned
across servers (a.k.a. Database Sharding) into consideration
Special Thanks
Font
Lato by Łukasz Dziedzic
Symbols by the Noun Project
Data Analysis designed by Brennan Novak
Elephant designed by Ted Mitchner
Scale designed by Laurent Patain
Heavy Load designed by Olivier Guin
Receipt designed by Benjamin Orlovski
Gauge designed by Márcio Duarte
Stopwatch designed by Ilsur Aptukov
Box designed by Travis J. Lee
Sprint Cycle designed by Jeremy J Bristol
Dilbert characters by Scott Adams Inc.
10-12-10 and 7-29-12
32

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 

Was ist angesagt? (20)

Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 

Ähnlich wie A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn HadoopEdureka!
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 

Ähnlich wie A Database-Hadoop Hybrid Approach to Scalable Machine Learning (20)

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
IJET-V3I2P24
IJET-V3I2P24IJET-V3I2P24
IJET-V3I2P24
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 

Mehr von Makoto Yui

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceMakoto Yui
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-treesMakoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache HivemallMakoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorMakoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using HivemallMakoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myuiMakoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myuiMakoto Yui
 

Mehr von Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 

Kürzlich hochgeladen

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Kürzlich hochgeladen (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

  • 1. A Database-Hadoop Hybrid Approach to Scalable Machine Learning Makoto YUI, Isao Kojima AIST, Japan <m.yui@aist.go.jp> June 30, 2013 IEEE BigData Congress 2013, Santa Clara
  • 2. Outline 1. Motivation & Problem Description 2. Our Hybrid Approach to Scalable Machine Learning  Architecture  Our batch learning scheme on Hive 3. Experimental Evaluation 4. Conclusions and Future Directions 2
  • 3. 3 As we seen in the Keynote and the Panel discussion (2nd day) of this conference, data analytics and machine learning are obviously getting more attentions along with Big Data Suppose then that you were a developer and your manager is willing to .. Manager Developer
  • 4. What’s possible choices around there? In-database Analytics  MADlib (open-source project lead by Greenplum)  Bismarck (Project at wisc.edu, SIGMOD’12)  SAS In-database Analytics  Fuzzy Logix (Sybase) and more.. Machine Learning on Hadoop  Apache Mahout  Vowpal Wabbit (open-source project at MS research)  In-house analytical tools e.g., Twitter, SIGMOD’12 4 Two popular schools of thought for performing large-scale machine learning that does not fit in memory space
  • 5. 4 Issues needed to be considered 1. Scalability 5 Scalability would always be a problem when handing Big Data
  • 6. 4 Issues needed to be considered 1. Scalability 2. Data movement 6 Data movement is important because Moving data is a critical issue when the size of dataset shift from terabytes to petabytes and beyond
  • 7. 4 Issues needed to be considered 1. Scalability 2. Data movement 3. Transactions 7 Considering transactions is important for real- time/online predictions because most of transaction records, which are valuable for predictions, are stored in relational databases
  • 8. 4 Issues needed to be considered 1. Scalability 2. Data movement 3. Transactions 4. Latency and throughput 8 Latency and throughput are the key issues for achieving online prediction and/or real-time analytics
  • 9. Which is better? 9 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop + Fault Tolerance + Straggler node handling + Scale-out
  • 10. Which is better? 10 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop It depends on where the data initially stored and purposes of using the data + HDFS is useful for append-only and archiving purposes + ETL Processing (feature engineering) + RDBMS is reliable as a transactional data store
  • 11. Which is better? 11 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop + Small fraction updates + Index-lookup for online prediction
  • 12. Which is better? 12 Scalability Data movement Transactions In-database analytics Machine learning on Hadoop + Incremental learning for each training instance - High latency bottleneck in job submitting process + Batch processing Latency Throughput
  • 13. Idea behind DB-Hadoop Hybrid approach 13 scalability Data movement Transactions Batch learning on Hadoop Incremental learning and prediction in a relational database Just an illustration, you know Latency Throughput
  • 14. Inside the box – How tocombine them 14 Postgres Hadoop cluster node node node ・・・ OLTP transactions Training data Prediction model Incremental learning implemented as a database stored procedure Trickle training data to Hadoop HDFS little by little and bringing back prediction models periodicity Batch learning
  • 15. 15 Trickle updates Source database Trickle updates in the queue periodically Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink The Detailed Architecture ―DatatoPredictionCycle Incremental Learner
  • 16. 16 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Export a prediction model Trickle updates in the queue periodically
  • 17. 17 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model Trickle updates in the queue periodically
  • 18. 18 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model Transactional updates Online prediction User can control the flow considering requirements and performance  Real-time prediction is possible using database triggers on the staging table Trickle updates in the queue periodically
  • 19. 19 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model The workflow consists of continuous and independent processes Trickle updates in the queue periodically
  • 20. Existing Approach for Parallel Batch Learning ―MachineLearningasUserDefinedAggregates(UDAF) 20 train train +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> merge tuple <label, array<features > array<weight> array<sum of weight>, array<count> Training table Prediction model UDAF -1, <2,7, 9> .. +1, <3,8> final merge merge -1, <2,7, 9> .. +1, <3,8> train train array<weight>  Bottleneck in the final merge Scalability is limited by the maximum fan-out of the final merge  Scalar aggregates computing a large single result are not suitable for S/N settings  Parallel aggregation (as one in Google Dremel) is not supported in Hadoop/MapReduce Aggregate tree (parallel aggregation) to merge prediction models Problems Observed Even though MPP databases and Hive parallelize user-defined aggregates, the above problems prevent using it
  • 21. Purely Relational Approach for Parallel Learning  Implemented a trainer as a set-returning function, instead of UDAF The purely relational way that scales on MPP and Hive/Hadoop  Shuffle by feature to Reducers  Run trainers independently on mappers and aggregate the results on reducers Embarrassingly parallel as # of mappers and reducers is controllable 21 +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> train train tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature Our solution for parallel machine learning on Hadoop/Hive SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT trainLogistic(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers Parameter Mixing K. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010. Key points
  • 22. Experimental Evaluation 1. Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit 2. Conducted a online prediction scenario to see the latency and throughput of our incremental learning scheme Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the largest publically available datasets for machine learning, provided by a commercial search engine provider Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop) each equipped with 8 processors and 24 GB memory 22 Given a prediction model is created with 80% of training data by batch learning, the rest of data (20%) is supplied for incremental learning  The task is predicting Click-Through-Rates of search engine ads  The training data is about 235 million records in 23 GB
  • 23. Performance Evaluation of Batch Learning Our batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively 23 AUC value (Green Bar) represents prediction accuracy 5x 7.65x Throughput: 2.3 million tuples/sec on 32 nodes Latency: 96 sec for training 235 million records of 23 GB CAUTION: you can find the detailed number and setting in our paper
  • 24. Performance Analysis in the Evaluation 24 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model  Low latency (5 sec) under moderate (70,000 tuples/sec) updates  96 sec for training  Excellent throughput (2.3 million tuples/sec) on 32 nodes 5 s 96 s
  • 25. Performance Analysis in the Evaluation 25  Sqoop required 3 min 32 s to migrate a prediction model 80% model containing about 1.56 million records (323MB)  Model conversion to a dense format, which is suited for online learning/prediction on Postgres, required 58 seconds Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model 5 s 96 s 212 s58 s Non-trivial costs in model migration
  • 26. Performance Analysis in the Evaluation 26 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model  “Data migration time > Training time” justifies the rationale behind in-database analytics  The cost of moving data is critical for online prediction as well as in Big Data analysis Model migration costs could be amortized with our approach Key observations
  • 27. Conclusions  DB-Hadoop hybrid architecture for online prediction in which the prediction model needs to be updated in a low latency process  Design principal for achieving scalable machine learning on Hadoop/Hive Good throughput and Scalability Our Batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively Acceptably Small Latency Possibly less than 5 s, under moderate transactional updates 27 Going hybrid brings low latency to Big Data analytics
  • 28. Directions for Future Work 28 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model Online testing  Integrating online testing schemes (e.g., Multi-armed Bandits and A/B testing) to the prediction pipeline  Develop a scheme to select the best prediction model among past models for each user in each session Online prediction
  • 30. Evaluation of Incremental Learning  Given a prediction model is created with 80% of training data by batch learning, the rest of data (20% ) is supplied for incremental learning 30 Built model with .. elapsed time (in sec) Throughput tuples/sec AUC Batch only (80%) 96.33 2067418.3 0.7177 +0.1% updates (80.1%) 4.99 36155.4 0.7197 +1% updates (81%) 25.96 69812.8 0.7242 +10% updates (90%) 256.03 71278.1 0.7291 +20% updates (100%) 499.61 72901.4 0.7349 Batch only (100%) 102.52 2298010.8 0.7356
  • 31. Directions for Future Work 31 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model Take a common setting for OLTP that database is partitioned across servers (a.k.a. Database Sharding) into consideration
  • 32. Special Thanks Font Lato by Łukasz Dziedzic Symbols by the Noun Project Data Analysis designed by Brennan Novak Elephant designed by Ted Mitchner Scale designed by Laurent Patain Heavy Load designed by Olivier Guin Receipt designed by Benjamin Orlovski Gauge designed by Márcio Duarte Stopwatch designed by Ilsur Aptukov Box designed by Travis J. Lee Sprint Cycle designed by Jeremy J Bristol Dilbert characters by Scott Adams Inc. 10-12-10 and 7-29-12 32