2. ➢2015/04 Joined Treasure Data, Inc.
➢1st Research Engineer in Treasure Data
➢My mission in TD is developing ML-as-a-Service
(MLaaS)
➢2010/04-2015/03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
➢Worked on a large-scale Machine Learning project
and Parallel Databases
➢2009/03 Ph.D. in Computer Science from NAIST
➢XML native database and Parallel Database systems
Who am I ?
2
5. 1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. How to use Hivemall
Agenda
5
6. What is Hivemall
Scalable machine learning library built as a collection of Hive
UDFs, licensed under the Apache License v2
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
6
7. Won IDG’s InfoWorld 2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributeddata processing, data analytics,machine
learning,NoSQL databases,and the Hadoop ecosystem
(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award
7
9. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad(logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a
positive class
Factorization Machines is good
where features are sparse and
categorical ones
9
10. List of Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is
useful for recommending top-k items
10
12. Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
http://www.slideshare.net/masakazusano75/sano-hmm-2015051212
13. ØGender prediction of Ad click logs
• Scaleout Inc. and Fan commucations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
13
14. Industry use cases of Hivemall
Ø Value prediction of Real estates
• Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14
25. Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Survey on existing ML frameworks
Existing distributed machine learning frameworks
are NOT easy to use
25
30. Create external table e2006tfidf_train(
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
30
34. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
34
35. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
35
37. How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
37
39. Export Prediction Model to a RDBMS
Any RDBMS
TD export
Periodical export is very easy
in Treasure Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
39
Prediction
Model
40. Real-time Prediction on MySQL
SIGMOID(x) = 1.0 / (1.0 + exp(-x))
Prediction
Model
Label
Feature Vector
SELECT
sigmoid(sum(t.value * m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
prediction_model m ON (t.feature = m.feature)
Online prediction on MySQL
Index lookups are very
efficient in RDBMSs
40