What's new in Apache Hivemall v0.5.0

Hivemall v0.5.0
Research Engineer, Treasure Data
Makoto YUI @myui
@ApacheHivemall
12018/4/17 Hivemall meetup

v0.5.0
•
•
•
•
•
•

What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use

Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop

Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive

Hivemall on Apache Hive

Hivemall on Apache Spark Dataframe

Hivemall on SparkSQL

Hivemall on Apache Pig

Online Prediction by Apache Streaming

What’s new in v0.5.0?
11
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
v2.0/v2.1/v2.2
SparkSQL/Dataframe support,
Top-k data processing

12
Generic Classifier/Regressor
OLD Style New Style from v0.5.0

13
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization

2018/4/17 Hivemall meetup 14
-eta0 <arg> The initial learning rate [default 0.1]
-iter,--iterations <arg> The maximum number of iterations [default: 10]
-lambda <arg> Regularization term [default 0.0001]
-loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss,
SquaredHingeLoss, ModifiedHuberLoss, or
a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss,
SquaredEpsilonInsensitiveLoss, HuberLoss]
-mini_batch,--mini_batch_size <arg> Mini batch size [default: 1].
Expecting the value in range [1,100] or so.
-opt,--optimizer <arg> Optimizer to update weights
[default: adagrad, sgd, adadelta, adam]
-reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet]
Generic Classifier/Regressor Hyperparameters
Adagrad+RDA by the default

RandomForest in Hivemall
Ensemble of Decision Trees

Image borrowed from
http://alfredplpl.hatenablog.com/entry/2013/12/24/225420
What’s OOB in RandomForests?
uniform/stratified sampling

Stratified Sampling ( )
) https://bellcurve.jp/statistics/course/8007.html

What’s OOB in RandomForests?
)
http://alfredplpl.hatenablog.com/entry/2013/12/24/225420
学習に使っていないデータを
モデルの精度評価に利用

Training of RandomForest
19
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector!
train_randomforest_classifier(array<double|string> features, int label [, const string
options, const array<double> classWeights])

• Dense Vector (array<double>)
• Sparse Vector (array<string>) in a LIBSVM format
• feature := <index>[“:”<value>]
where index := <integer> starting with 1 (index = 0 is reserved for bias clause)
and value := <floating point> (default 1.0 if not provided)
Supported Feature Vector Format of Random Forests
1.0, 0.0, 3.0
1:1.0, 2:0.0, 3:3.0
1:1.0, 3:3.0
select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999",
"movieid#2331"));
["1828616:3.3","6238429:4.999","6238429"]
1:1.0, 3

Feature Engineering – Feature Hashing

Random Forests Taining Hyperparameters
-attrs,--attribute_types <arg> Comma separated attribute types (Q
for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C])
-depth,--max_depth <arg> The maximum number of the tree depth
[default: Integer.MAX_VALUE]
-leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes
[default: Integer.MAX_VALUE]
-min_samples_leaf <arg> The minimum number of samples in a
leaf node [default: 1]
-rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR]
-seed <arg> seed value in long [default: -1 (random)]
-splits,--min_split <arg> A node that has greater than or
equals to `min_split` examples will split [default: 2]
-stratified,--stratified_sampling Enable Stratified sampling for unbalanced data
-subsample <arg> Sampling rate in range (0.0,1.0]
-trees,--num_trees <arg> The number of trees for each task [default: 50]
-vars,--num_variables <arg> The number of random selected
features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length)
is considered if num_variable is (0.0,1.0]

Prediction of RandomForest
決定木の予測クラスの投票に基づく事後確率
OOBエラー率に基づくmodelの信憑性

24
Decision Tree Visualization

25
Decision Tree Visualization
http://viz-js.com/

Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins

Feature Engineering – Feature Binning
27

Evaluation Metrics

Map tiling functions

Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map tiling functions
Zoom=10
Zoom=15

31
SELECT count(distinct id) FROM data
Sketch and NLP functions
SELECT approx_count_distinct(id) FROM data
select tokenize_ja(“ ",
"normal", null, null, "https://s3.amazonaws.com/td-
hivemall/dist/kuromoji-user-dict-neologd.csv.gz");
[“ ”, "," "," "]

Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation

Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

Take this…

…and do this!

Efficient algorithm for finding change point and outliers from
timeseries data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.

• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation

Online mini-batch LDA

39
Probabilistic Latent Semantic Analysis - training

40
Probabilistic Latent Semantic Analysis - predict

ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü Merge Brickhouse UDFs
ü XGBoost support
ü LightGBM support
ü Gradient Boosting
Future work for v0.5.2 and later
41
PR#91
PR#116
PR#58
PR#111
PR#135

SELECT from_json(to_json(
ARRAY(
NAMED_STRUCT("country", "japan", "city", "tokyo"),
NAMED_STRUCT("country", "japan", "city", "osaka")
)
),'array<struct<city:string>>')
Brickhouse functions
https://github.com/klout/brickhouse

Prediction tracing of Decision Tree
Trace how predicted

44
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
Experimental
Not yet supported in TD

Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
Try our the first Apache release (v0.5.0)!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin

Any feature request or questions?
BTW, we are hiring!

Hivemall Digdag

Machine Learning Workflow using Digdag

What's new in Apache Hivemall v0.5.0

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie What's new in Apache Hivemall v0.5.0

Ähnlich wie What's new in Apache Hivemall v0.5.0 (20)

Mehr von Makoto Yui

Mehr von Makoto Yui (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

What's new in Apache Hivemall v0.5.0