SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
National Institute of Advanced Industrial Science
and Technology (AIST), Japan
Makoto YUI
m.yui@aist.go.jp, @myui
Hivemall: Scalable Machine Learning
Library for Apache Hive
Hadoop Summit 2014, San Jose
1 / 43
Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
2 / 43
What is Hivemall
• A collection of machine learning algorithms
implemented as Hive UDFs/UDTFs
• Classification & Regression
• Recommendation
• k-Nearest Neighbor Search
.. and more
• An open source project on Github
• Licensed under LGPL
• github.com/myui/hivemall (bit.ly/hivemall)
• 4 contributors
Hadoop Summit 2014, San Jose
3 / 43
Reactions to the release
Hadoop Summit 2014, San Jose
4 / 43
Reactions to the release
Hadoop Summit 2014, San Jose
5 / 43
Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Mahout?
Vowpal Wabbit?
(w/ Hadoop streaming)
Spark MLlib?
0xdata H2O? Cloudera Oryx?
Machine Learning frameworks out there
that run with Hadoop
Quick Poll:
How many people in this room are using them?
6 / 43
Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Existing distributed machine learning frameworks
are NOT easy to use
7 / 43
Hadoop Summit 2014, San Jose
Classification with Mahout
org/apache/mahout/classifier/sgd/TrainNewsGroups.java
Find the complete code at
bit.ly/news20-mahout
8 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
1. Ease of use
• No programming
• Every machine learning step is done within HiveQL
• No compilation/packaging overhead
• Easy for existing Hive users
• You can evaluate Hivemall within 5 minutes or so
• Installation is just as follows
9 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
2. Scalable to data
• Scalable to # of training/testing instances
• Scalable to # of features
• Built-in support for feature hashing
• Scalable to the size of prediction model
• Suppose there are 200 labels * 100 million
features ⇒ Requires 150GB
• Hivemall does not need a prediction model fit
in memory both in the training/prediction
• Feature engineering step is also scalable
and parallelized using Hive
10 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
3. Scalable to computing resources
• Exploiting the benefits of Hadoop &
Hive
• Provisioning the machine learning
service on Amazon Elastic MapReduce
• Provides an EMR bootstrap for the
automated setup
Find an example on
bit.ly/hivemall-emr
11 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Less configuration parameters
(no learning rate as one in SGD)
• CW, AROW[1], and SCW[2] are not yet
supported in the other ML frameworks
• Surprising fast convergence properties
(few iterations is enough)
1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 2009
2. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012
12 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
Algorithms
News20.binary
Classification Accuracy
Perceptron 0.9460
Passive-Aggressive
(a.k.a. Online-SVM)
0.9604
LibLinear 0.9636
LibSVM/TinySVM 0.9643
Confidence Weighted (CW) 0.9656
AROW [1] 0.9660
SCW [2] 0.9662
Better
4. Supports the state-of-the-art online
learning algorithms (for classification)
CW-variants are very smart online ML algorithm
13 / 43
Hadoop Summit 2014, San Jose
Why CW variants are so good?
Suppose a binary classification setting to classify
sentences positive or negative
→ learn the weight for each word (each word is a feature)
I like this authorPositive
I like this author, but found this book dullNegative
Label Feature Vector
Naïve update will reduce both at same rateWlike Wdull
CW-variants adjust weights at different rates
14 / 43
Hadoop Summit 2014, San Jose
Why CW variants are so good?
weight
weight
Adjust a weight
Adjust a weight &
confidence
0.6 0.80.6
0.80.6
At this confidence,
the weight is 0.5
Confidence
(covariance)
0.5
15 / 43
Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Fast convergence properties
• Perform small update where confidence
is enough
• Perform large update where confidence is
low (e.g., at the beginning)
• A few iterations are enough
16 / 43
Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
17 / 43
Hadoop Summit 2014, San Jose
What Hivemall can do
• Classification (both one- and multi-class)
 Perceptron
 Passive Aggressive (PA)
 Confidence Weighted (CW)
 Adaptive Regularization of Weight Vectors (AROW)
 Soft Confidence Weighted (SCW)
• Regression
 Logistic Regression using Stochastic Gradient Descent (SGD)
 PA Regression
 AROW Regression
• k-Nearest Neighbor & Recommendation
 Minhash and b-Bit Minhash (LSH variant)
 Brute-force search using similarity measures (cosine similarity)
• Feature engineering
 Feature hashing
 Feature scaling (normalization, z-score)
18 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation
19 / 43
Hadoop Summit 2014, San Jose
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
20 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering
21 / 43
Hadoop Summit 2014, San Jose
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
22 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training
23 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
24 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
25 / 43
Hadoop Summit 2014, San Jose
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction
27 / 43
Hadoop Summit 2014, San Jose
How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
28 / 43
Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
29 / 43
Implemented machine learning algorithms as User-
Defined Table generating Functions (UDTFs)
Hadoop Summit 2014, San Jose
How Hivemall works in the training
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
train train
 Friendly to the Hive relational
query engine
• Resulting prediction model is
a relation of feature and its
weight
 Embarrassingly parallel
• # of mapper and reducers are
configurable
 Bagging-like effect which helps
to reduce the variance of each
classifier/partition
30 / 43
Hadoop Summit 2014, San Jose
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
merge
tuple
<label, array<features >
array<weight>
array<sum of weight>,
array<count>
Training
table
Prediction model
-1, <2,7, 9>
..
+1, <3,8>
final
merge
merge
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
Why not UDAF (as one in MADLib)
4 ops in parallel
2 ops in parallel
No parallelism
Machine learning as an aggregate function
Bottleneck in the final merge
Throughput limited by its fan out
Memory
consumption
grows
Parallelism
decreases
31 / 43
How to deal with Iterations
Iterations are mandatory to get a good prediction
model
• However, MapReduce is not suited for iterations because
IN/OUT of MR job is through HDFS
• Spark avoid it by in-memory computation
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Input
32 / 43
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated MapReduce steps
to do gradient descent
For each node, loads data
in memory once
This is just a toy example! Why?
Training with Iterations in Spark
Logistic Regression example of Spark
Input to the gradient computation should be shuffled
for each iteration (without it, more iteration is required)
33 / 43
Hadoop Summit 2014, San Jose
What MLlib actually do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-batch Gradient Descent with Sampling
Iterations are mandatory for convergence because
each iteration uses only small fraction of data
GradientDescent.scala
bit.ly/spark-gd
sample subset of data (partitioned RDD)
averaging the subgradients over the sampled data
using Spark MapReduce
34 / 43
How to deal with Iterations in Hivemall
Hivemall provides the amplify UDTF to enumerate
iteration effects in machine learning without several
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY RANDOM
35 / 43
Map-only shuffling and amplifying
rand_amplify UDTF randomly shuffles the
input rows for each Map task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
36 / 43
Detailed plan w/ map-local shuffle
…
Shuffle
(distributed by feature)
Reducetask
Merge
Aggregate
Reduce write
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Reducetask
Merge
Aggregate
Reduce write
Scanned entries
are amplified and
then shuffled
Note this is pipeline op.
The Rand Amplifier operator is interleaved between
the table scan and the training operator
37 / 43
Hadoop Summit 2014, San Jose
Method
ELAPSED TIME
(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered by
(a.k.a. global shuffle)
479.855 0.746214
rand_amplifier
(a.k.a. map-local shuffle)
116.424 0.743392
Performance effects of amplifiers
For map-local shuffle, prediction accuracy
got improved with an acceptable overhead
38 / 43
Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
39 / 43
Experimental Evaluation
Compared the performance of our batch learning
scheme to state-of-the-art machine learning
techniques, namely Bismarck and Vowpal Wabbit
• Dataset
KDD Cup 2012, Track 2 dataset, which is one of the largest
publically available datasets for machine learning, provided
by a commercial search engine provider
• The training data is about 235 million records in 33 GB
• # of feature dimensions is about 54 million
• Task
Predicting Click-Through-Rates of search engine ads
• Experimental Environment
In-house 33 commodity servers (32 slaves nodes for Hadoop)
each equipped with 8 processors and 24 GB memory
40
bit.ly/hivemall-kdd-dataset
40 / 43
Hadoop Summit 2014, San Jose
116.4
596.67
493.81
755.24
0
100
200
300
400
500
600
700
800
Hivemall VW1 VW32 Bismarck
0.64
0.66
0.68
0.7
0.72
0.74
0.76
Hivemall VW1 VW32 Bismarck
Throughput: 2.3 million tuples/sec on 32 nodes
Latency: 96 sec for training 235 million records of 23 GB
Performance comparison
Prediction performance
(AUC) is good
Elapsed time (sec) for training
The lower, the better
41 / 43
Hadoop Summit 2014, San Jose
val training = MLUtils.loadLibSVMFile(sc,
"hdfs://host:8020/small/training_libsvmfmt", multiclass = false)
val model = LogisticRegressionWithSGD.train(training, numIterations)
..
How about Spark 1.0 MLlib
Works fine for small data (10k training examples in about 1.5 MB)
on 33 nodes with allocating 5 GB memory to each worker
LoC is small and easy to understand
However, Spark does not work for large dataset
(235 million training example of 2^24 feature dimensions in
about 33 GB)
Further investigation is required
42 / 43
Hadoop Summit 2014, San Jose
Conclusion
Hivemall is an open source library that provides a
collection of machine learning algorithms as Hive
UDFs/UDTFs
 Easy to use
 Scalable to computing resources
 Runs on Amazon EMR
 Support state of the art classification algorithms
 Plan to support Shark/Spark SQL
Project Site:
github.com/myui/hivemall or bit.ly/hivemall
Message of this talk: Please evaluate Hivemall by yourself.
5 minutes is enough for a quick start 
Slide available on
bit.ly/hivemall-slide
43 / 43

Weitere ähnliche Inhalte

Was ist angesagt?

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Yahoo Developer Network
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in HopsworksJim Dowling
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesPyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesJim Dowling
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 

Was ist angesagt? (20)

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in Hopsworks
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesPyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 

Andere mochten auch

Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Hive/Pigを使ったKDD'12 track2の広告クリック率予測
Hive/Pigを使ったKDD'12 track2の広告クリック率予測Hive/Pigを使ったKDD'12 track2の広告クリック率予測
Hive/Pigを使ったKDD'12 track2の広告クリック率予測Makoto Yui
 
天猫后端技术架构优化实践
天猫后端技术架构优化实践天猫后端技术架构优化实践
天猫后端技术架构优化实践drewz lin
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Etu Solution
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Kuo-Chun Su
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Etu Solution
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pubChao Zhu
 
豆瓣数据架构实践
豆瓣数据架构实践豆瓣数据架构实践
豆瓣数据架构实践Xupeng Yun
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017John Maeda
 

Andere mochten auch (18)

Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Hive/Pigを使ったKDD'12 track2の広告クリック率予測
Hive/Pigを使ったKDD'12 track2の広告クリック率予測Hive/Pigを使ったKDD'12 track2の広告クリック率予測
Hive/Pigを使ったKDD'12 track2の広告クリック率予測
 
天猫后端技术架构优化实践
天猫后端技术架构优化实践天猫后端技术架构优化实践
天猫后端技术架构优化实践
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
豆瓣数据架构实践
豆瓣数据架构实践豆瓣数据架构实践
豆瓣数据架构实践
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
 

Ähnlich wie Hivemall talk@Hadoop summit 2014, San Jose

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Cloudera data-analyst-training
Cloudera data-analyst-trainingCloudera data-analyst-training
Cloudera data-analyst-trainingStarman Anoa
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeIan Lumb
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAMakoto Yui
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placementsofia taylor
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placementIqbal Patel
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online trainingsrikanthhadoop
 
Hadoop online training in india
Hadoop online training  in indiaHadoop online training  in india
Hadoop online training in indiaMadhu Trainer
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley
 

Ähnlich wie Hivemall talk@Hadoop summit 2014, San Jose (20)

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Cloudera data-analyst-training
Cloudera data-analyst-trainingCloudera data-analyst-training
Cloudera data-analyst-training
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placement
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placement
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Hadoop online training in india
Hadoop online training  in indiaHadoop online training  in india
Hadoop online training in india
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 

Mehr von Makoto Yui

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceMakoto Yui
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-treesMakoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache HivemallMakoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorMakoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using HivemallMakoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myuiMakoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myuiMakoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113Makoto Yui
 

Mehr von Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 

Kürzlich hochgeladen

Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Kürzlich hochgeladen (20)

Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

Hivemall talk@Hadoop summit 2014, San Jose

  • 1. National Institute of Advanced Industrial Science and Technology (AIST), Japan Makoto YUI m.yui@aist.go.jp, @myui Hivemall: Scalable Machine Learning Library for Apache Hive Hadoop Summit 2014, San Jose 1 / 43
  • 2. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 2 / 43
  • 3. What is Hivemall • A collection of machine learning algorithms implemented as Hive UDFs/UDTFs • Classification & Regression • Recommendation • k-Nearest Neighbor Search .. and more • An open source project on Github • Licensed under LGPL • github.com/myui/hivemall (bit.ly/hivemall) • 4 contributors Hadoop Summit 2014, San Jose 3 / 43
  • 4. Reactions to the release Hadoop Summit 2014, San Jose 4 / 43
  • 5. Reactions to the release Hadoop Summit 2014, San Jose 5 / 43
  • 6. Hadoop Summit 2014, San Jose Motivation – Why a new ML framework? Mahout? Vowpal Wabbit? (w/ Hadoop streaming) Spark MLlib? 0xdata H2O? Cloudera Oryx? Machine Learning frameworks out there that run with Hadoop Quick Poll: How many people in this room are using them? 6 / 43
  • 7. Framework User interface Mahout Java API Programming Spark MLlib/MLI Scala API programming Scala Shell (REPL) H2O R programming GUI Cloudera Oryx Http REST API programming Vowpal Wabbit (w/ Hadoop streaming) C++ API programming Command Line Hadoop Summit 2014, San Jose Motivation – Why a new ML framework? Existing distributed machine learning frameworks are NOT easy to use 7 / 43
  • 8. Hadoop Summit 2014, San Jose Classification with Mahout org/apache/mahout/classifier/sgd/TrainNewsGroups.java Find the complete code at bit.ly/news20-mahout 8 / 43
  • 9. Hadoop Summit 2014, San Jose Why Hivemall 1. Ease of use • No programming • Every machine learning step is done within HiveQL • No compilation/packaging overhead • Easy for existing Hive users • You can evaluate Hivemall within 5 minutes or so • Installation is just as follows 9 / 43
  • 10. Hadoop Summit 2014, San Jose Why Hivemall 2. Scalable to data • Scalable to # of training/testing instances • Scalable to # of features • Built-in support for feature hashing • Scalable to the size of prediction model • Suppose there are 200 labels * 100 million features ⇒ Requires 150GB • Hivemall does not need a prediction model fit in memory both in the training/prediction • Feature engineering step is also scalable and parallelized using Hive 10 / 43
  • 11. Hadoop Summit 2014, San Jose Why Hivemall 3. Scalable to computing resources • Exploiting the benefits of Hadoop & Hive • Provisioning the machine learning service on Amazon Elastic MapReduce • Provides an EMR bootstrap for the automated setup Find an example on bit.ly/hivemall-emr 11 / 43
  • 12. Hadoop Summit 2014, San Jose Why Hivemall 4. Supports the state-of-the-art online learning algorithms (for classification) • Less configuration parameters (no learning rate as one in SGD) • CW, AROW[1], and SCW[2] are not yet supported in the other ML frameworks • Surprising fast convergence properties (few iterations is enough) 1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 2009 2. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012 12 / 43
  • 13. Hadoop Summit 2014, San Jose Why Hivemall Algorithms News20.binary Classification Accuracy Perceptron 0.9460 Passive-Aggressive (a.k.a. Online-SVM) 0.9604 LibLinear 0.9636 LibSVM/TinySVM 0.9643 Confidence Weighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662 Better 4. Supports the state-of-the-art online learning algorithms (for classification) CW-variants are very smart online ML algorithm 13 / 43
  • 14. Hadoop Summit 2014, San Jose Why CW variants are so good? Suppose a binary classification setting to classify sentences positive or negative → learn the weight for each word (each word is a feature) I like this authorPositive I like this author, but found this book dullNegative Label Feature Vector Naïve update will reduce both at same rateWlike Wdull CW-variants adjust weights at different rates 14 / 43
  • 15. Hadoop Summit 2014, San Jose Why CW variants are so good? weight weight Adjust a weight Adjust a weight & confidence 0.6 0.80.6 0.80.6 At this confidence, the weight is 0.5 Confidence (covariance) 0.5 15 / 43
  • 16. Hadoop Summit 2014, San Jose Why Hivemall 4. Supports the state-of-the-art online learning algorithms (for classification) • Fast convergence properties • Perform small update where confidence is enough • Perform large update where confidence is low (e.g., at the beginning) • A few iterations are enough 16 / 43
  • 17. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 17 / 43
  • 18. Hadoop Summit 2014, San Jose What Hivemall can do • Classification (both one- and multi-class)  Perceptron  Passive Aggressive (PA)  Confidence Weighted (CW)  Adaptive Regularization of Weight Vectors (AROW)  Soft Confidence Weighted (SCW) • Regression  Logistic Regression using Stochastic Gradient Descent (SGD)  PA Regression  AROW Regression • k-Nearest Neighbor & Recommendation  Minhash and b-Bit Minhash (LSH variant)  Brute-force search using similarity measures (cosine similarity) • Feature engineering  Feature hashing  Feature scaling (normalization, z-score) 18 / 43
  • 19. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 19 / 43
  • 20. Hadoop Summit 2014, San Jose Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 20 / 43
  • 21. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 21 / 43
  • 22. Hadoop Summit 2014, San Jose create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 22 / 43
  • 23. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 23 / 43
  • 24. Hadoop Summit 2014, San Jose How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 24 / 43
  • 25. Hadoop Summit 2014, San Jose How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 25 / 43
  • 26. Hadoop Summit 2014, San Jose create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble learning for stable prediction performance Just stack prediction models by union all 26 / 43
  • 27. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 27 / 43
  • 28. Hadoop Summit 2014, San Jose How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 28 / 43
  • 29. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 29 / 43
  • 30. Implemented machine learning algorithms as User- Defined Table generating Functions (UDTFs) Hadoop Summit 2014, San Jose How Hivemall works in the training +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature train train  Friendly to the Hive relational query engine • Resulting prediction model is a relation of feature and its weight  Embarrassingly parallel • # of mapper and reducers are configurable  Bagging-like effect which helps to reduce the variance of each classifier/partition 30 / 43
  • 31. Hadoop Summit 2014, San Jose train train +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> merge tuple <label, array<features > array<weight> array<sum of weight>, array<count> Training table Prediction model -1, <2,7, 9> .. +1, <3,8> final merge merge -1, <2,7, 9> .. +1, <3,8> train train array<weight> Why not UDAF (as one in MADLib) 4 ops in parallel 2 ops in parallel No parallelism Machine learning as an aggregate function Bottleneck in the final merge Throughput limited by its fan out Memory consumption grows Parallelism decreases 31 / 43
  • 32. How to deal with Iterations Iterations are mandatory to get a good prediction model • However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS • Spark avoid it by in-memory computation iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Input 32 / 43
  • 33. val data = spark.textFile(...).map(readPoint).cache() for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } Repeated MapReduce steps to do gradient descent For each node, loads data in memory once This is just a toy example! Why? Training with Iterations in Spark Logistic Regression example of Spark Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required) 33 / 43
  • 34. Hadoop Summit 2014, San Jose What MLlib actually do? Val data = .. for (i <- 1 to numIterations) { val sampled = val gradient = w -= gradient } Mini-batch Gradient Descent with Sampling Iterations are mandatory for convergence because each iteration uses only small fraction of data GradientDescent.scala bit.ly/spark-gd sample subset of data (partitioned RDD) averaging the subgradients over the sampled data using Spark MapReduce 34 / 43
  • 35. How to deal with Iterations in Hivemall Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY RANDOM 35 / 43
  • 36. Map-only shuffling and amplifying rand_amplify UDTF randomly shuffles the input rows for each Map task CREATE VIEW training_x3 as SELECT rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features) FROM training; 36 / 43
  • 37. Detailed plan w/ map-local shuffle … Shuffle (distributed by feature) Reducetask Merge Aggregate Reduce write Maptask Table scan Rand Amplifier Map write Logress UDTF Partial aggregate Maptask Table scan Rand Amplifier Map write Logress UDTF Partial aggregate Reducetask Merge Aggregate Reduce write Scanned entries are amplified and then shuffled Note this is pipeline op. The Rand Amplifier operator is interleaved between the table scan and the training operator 37 / 43
  • 38. Hadoop Summit 2014, San Jose Method ELAPSED TIME (sec) AUC Plain 89.718 0.734805 amplifier+clustered by (a.k.a. global shuffle) 479.855 0.746214 rand_amplifier (a.k.a. map-local shuffle) 116.424 0.743392 Performance effects of amplifiers For map-local shuffle, prediction accuracy got improved with an acceptable overhead 38 / 43
  • 39. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 39 / 43
  • 40. Experimental Evaluation Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit • Dataset KDD Cup 2012, Track 2 dataset, which is one of the largest publically available datasets for machine learning, provided by a commercial search engine provider • The training data is about 235 million records in 33 GB • # of feature dimensions is about 54 million • Task Predicting Click-Through-Rates of search engine ads • Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop) each equipped with 8 processors and 24 GB memory 40 bit.ly/hivemall-kdd-dataset 40 / 43
  • 41. Hadoop Summit 2014, San Jose 116.4 596.67 493.81 755.24 0 100 200 300 400 500 600 700 800 Hivemall VW1 VW32 Bismarck 0.64 0.66 0.68 0.7 0.72 0.74 0.76 Hivemall VW1 VW32 Bismarck Throughput: 2.3 million tuples/sec on 32 nodes Latency: 96 sec for training 235 million records of 23 GB Performance comparison Prediction performance (AUC) is good Elapsed time (sec) for training The lower, the better 41 / 43
  • 42. Hadoop Summit 2014, San Jose val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/small/training_libsvmfmt", multiclass = false) val model = LogisticRegressionWithSGD.train(training, numIterations) .. How about Spark 1.0 MLlib Works fine for small data (10k training examples in about 1.5 MB) on 33 nodes with allocating 5 GB memory to each worker LoC is small and easy to understand However, Spark does not work for large dataset (235 million training example of 2^24 feature dimensions in about 33 GB) Further investigation is required 42 / 43
  • 43. Hadoop Summit 2014, San Jose Conclusion Hivemall is an open source library that provides a collection of machine learning algorithms as Hive UDFs/UDTFs  Easy to use  Scalable to computing resources  Runs on Amazon EMR  Support state of the art classification algorithms  Plan to support Shark/Spark SQL Project Site: github.com/myui/hivemall or bit.ly/hivemall Message of this talk: Please evaluate Hivemall by yourself. 5 minutes is enough for a quick start  Slide available on bit.ly/hivemall-slide 43 / 43