A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approach
to Scalable Machine Learning
Makoto YUI, Isao Kojima AIST, Japan
<m.yui@aist.go.jp>
June 30, 2013
IEEE BigData Congress 2013, Santa Clara

Outline
1. Motivation & Problem Description
2. Our Hybrid Approach to Scalable Machine
Learning
 Architecture
 Our batch learning scheme on Hive
3. Experimental Evaluation
4. Conclusions and Future Directions
2

3
As we seen in the Keynote and the Panel discussion (2nd day)
of this conference, data analytics and machine learning are
obviously getting more attentions along with Big Data
Suppose then that you were a developer and your
manager is willing to ..
Manager
Developer

What’s possible choices around there?
In-database Analytics
 MADlib (open-source project lead by Greenplum)
 Bismarck (Project at wisc.edu, SIGMOD’12)
 SAS In-database Analytics
 Fuzzy Logix (Sybase)
and more..
Machine Learning on Hadoop
 Apache Mahout
 Vowpal Wabbit (open-source project at MS research)
 In-house analytical tools
e.g., Twitter, SIGMOD’12
4
Two popular schools of thought for performing large-scale
machine learning that does not fit in memory space

4 Issues needed to be considered
1. Scalability
5
Scalability would always be a problem when
handing Big Data

1. Scalability
2. Data movement
6
Data movement is important because
Moving data is a critical issue when the size of
dataset shift from terabytes to petabytes and
beyond

1. Scalability
2. Data movement
3. Transactions
7
Considering transactions is important for real-
time/online predictions because most of
transaction records, which are valuable for
predictions, are stored in relational databases

1. Scalability
2. Data movement
3. Transactions
4. Latency and throughput
8
Latency and throughput are the key issues for
achieving online prediction and/or real-time
analytics

Which is better?
9
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Hadoop
+ Fault Tolerance
+ Straggler node
handling
+ Scale-out

Which is better?
10
Scalability
Data
movement
In-database
analytics
Machine learning
on Hadoop
It depends on where the data
initially stored and purposes
of using the data
+ HDFS is useful for
append-only and
archiving purposes
+ ETL Processing
(feature engineering)
+ RDBMS is reliable
as a transactional
data store

Which is better?
11
Scalability
Data
movement
In-database
analytics
Machine learning
on Hadoop
+ Small fraction
updates
+ Index-lookup for
online prediction

Which is better?
12
Scalability
Data
movement
Transactions
In-database
analytics
Machine learning
on Hadoop
+ Incremental
learning for each
training instance
- High latency
bottleneck in job
submitting process
+ Batch processing
Latency Throughput

Idea behind DB-Hadoop Hybrid approach
13
scalability
Data
movement
Transactions
Batch learning
on Hadoop
Incremental learning
and prediction in
a relational database
Just an illustration, you know
Latency Throughput

Inside the box
– How tocombine them
14
Postgres
Hadoop cluster
node
node
node
・・・
OLTP
transactions
Training data
Prediction model
Incremental
learning
implemented as a
database stored procedure
Trickle training data to Hadoop HDFS little by little
and bringing back prediction models periodicity
Batch
learning

15
Trickle
updates
Source
database
Trickle updates
in the queue periodically
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data sink
The Detailed Architecture
―DatatoPredictionCycle
Incremental
Learner

16
Trickle
updates
Source
database
Staging
table
Training
data sink
Prediction
model
Batch learning process
build a model
Incremental
Learner
up-to-date
model Export a prediction
model
Trickle updates

17
Trickle
updates
Source
database
Staging
table
Training
data sink
Prediction
model
build a model
Incremental
Learner
up-to-date
model
Select the latest one
Insert a new one Export a prediction
model
Trickle updates

18
Trickle
updates
Source
database
Staging
table
Training
data sink
Prediction
model
build a model
Incremental
Learner
up-to-date
model
model
Transactional
updates
Online prediction
User can control the flow considering requirements and performance
 Real-time prediction is possible using database triggers on the staging table
Trickle updates

19
Trickle
updates
Source
database
Staging
table
Training
data sink
Prediction
model
build a model
Incremental
Learner
up-to-date
model
model
The workflow consists of continuous and
independent processes
Trickle updates

Existing Approach for Parallel Batch Learning
―MachineLearningasUserDefinedAggregates(UDAF)
20
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
merge
tuple
<label, array<features >
array<weight>
array<sum of weight>,
array<count>
Training
table
Prediction model
UDAF
-1, <2,7, 9>
..
+1, <3,8>
final
merge
merge
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
 Bottleneck in the final merge
Scalability is limited by the maximum
fan-out of the final merge
 Scalar aggregates computing a
large single result are not suitable
for S/N settings
 Parallel aggregation (as one in
Google Dremel) is not supported
in Hadoop/MapReduce
Aggregate tree (parallel aggregation)
to merge prediction models Problems Observed
Even though MPP databases and Hive parallelize
user-defined aggregates, the above problems
prevent using it

Purely Relational Approach for Parallel Learning
 Implemented a trainer as a set-returning
function, instead of UDAF
The purely relational way that scales on MPP
and Hive/Hadoop
 Shuffle by feature to Reducers
 Run trainers independently on mappers
and aggregate the results on reducers
Embarrassingly parallel as # of mappers and
reducers is controllable
21
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
train train
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
Our solution for parallel machine
learning on Hadoop/Hive
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT trainLogistic(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
Parameter Mixing
K. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010.
Key points

Experimental Evaluation
1. Compared the performance of our batch learning scheme
to state-of-the-art machine learning techniques, namely
Bismarck and Vowpal Wabbit
2. Conducted a online prediction scenario to see the latency
and throughput of our incremental learning scheme
Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the
largest publically available datasets for machine learning, provided by a
commercial search engine provider
Experimental Environment
In-house 33 commodity servers (32 slaves nodes for Hadoop)
each equipped with 8 processors and 24 GB memory
22
Given a prediction model is created with 80% of training data by batch
learning, the rest of data (20%) is supplied for incremental learning
 The task is predicting Click-Through-Rates of search engine ads
 The training data is about 235 million records in 23 GB

Performance Evaluation of Batch Learning
Our batch learning scheme on Hive is 5 and 7.65 times faster than
Vowpal Wabbit and Bismarck, respectively
23
AUC value (Green
Bar) represents
prediction accuracy
5x
7.65x
Throughput: 2.3 million tuples/sec on 32 nodes
Latency: 96 sec for training 235 million records of 23 GB
CAUTION: you can find the detailed number and setting in our paper

Performance Analysis in the Evaluation
24
Source
database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
 Low latency (5 sec) under
moderate (70,000 tuples/sec)
updates
 96 sec for training
 Excellent throughput (2.3 million
tuples/sec) on 32 nodes
5 s
96 s

25
 Sqoop required 3 min 32 s to migrate a prediction model
80% model containing about 1.56 million records (323MB)
 Model conversion to a dense format, which is suited for online
learning/prediction on Postgres, required 58 seconds
Source
database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
5 s
96 s
212 s58 s
Non-trivial costs in model migration

26
Source
database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
 “Data migration time > Training time” justifies the rationale
behind in-database analytics
 The cost of moving data is critical for online prediction as
well as in Big Data analysis
Model migration costs could be amortized with our approach
Key observations

Conclusions
 DB-Hadoop hybrid architecture for online prediction
in which the prediction model needs to be updated in a low latency process
 Design principal for achieving scalable machine learning on
Hadoop/Hive
Good throughput and Scalability
Our Batch learning scheme on Hive is 5 and 7.65 times faster
than Vowpal Wabbit and Bismarck, respectively
Acceptably Small Latency
Possibly less than 5 s, under moderate transactional updates
27
Going hybrid brings low latency to Big Data analytics

Directions for Future Work
28
Source
database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
Online
testing
 Integrating online testing schemes
(e.g., Multi-armed Bandits and A/B testing)
to the prediction pipeline
 Develop a scheme to select the best
prediction model among past models
for each user in each session
Online
prediction

Evaluation of Incremental Learning
 Given a prediction model is created with 80% of training
data by batch learning, the rest of data (20% ) is supplied for
incremental learning
30
Built model with ..
elapsed time
(in sec)
Throughput
tuples/sec AUC
Batch only (80%) 96.33 2067418.3 0.7177
+0.1% updates (80.1%) 4.99 36155.4 0.7197
+1% updates (81%) 25.96 69812.8 0.7242
+10% updates (90%) 256.03 71278.1 0.7291
+20% updates (100%) 499.61 72901.4 0.7349
Batch only (100%) 102.52 2298010.8 0.7356

Directions for Future Work
31
Source
database
Staging
table
Training
data sink
Prediction
model
Incremental
Learner
up-to-date
model
Take a common setting for OLTP that database is partitioned
across servers (a.k.a. Database Sharding) into consideration

Special Thanks
Font
Lato by Łukasz Dziedzic
Symbols by the Noun Project
Data Analysis designed by Brennan Novak
Elephant designed by Ted Mitchner
Scale designed by Laurent Patain
Heavy Load designed by Olivier Guin
Receipt designed by Benjamin Orlovski
Gauge designed by Márcio Duarte
Stopwatch designed by Ilsur Aptukov
Box designed by Travis J. Lee
Sprint Cycle designed by Jeremy J Bristol
Dilbert characters by Scott Adams Inc.
10-12-10 and 7-29-12
32

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Ähnlich wie A Database-Hadoop Hybrid Approach to Scalable Machine Learning (20)

Mehr von Makoto Yui

Mehr von Makoto Yui (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Database-Hadoop Hybrid Approach to Scalable Machine Learning