Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
1. Introduction to Apache Hivemall v0.5.0:
Machine Learning on Hive/Spark
Makoto YUI @myui
ApacheCon North America 2018
Takashi Yamamuro @maropu
@ApacheHivemall
1). Principal Engineer,
2). Research Engineer,
1
2. Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
ApacheCon North America 2018
A quick walk-through of feature, usages, what's
new in v0.5.0, and future roadmaps
New top-k join enhancement, and a feature plan
for Supporting spark 2.3 and feature selection
2
3. We released the first Apache release
v0.5.0 on Mar 3rd, 2018 !
hivemall.incubator.apache.org
ApacheCon North America 2018
We plan to start voting for the 2nd Apache release (v0.5.2) in
the next month (Oct 2018).
3
4. Whatâs new in v0.5.0?
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
ApacheCon North America 2018 4
5. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use
ApacheCon North America 2018 5
6. Hivemall is easy and scalable âŚ
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
ApacheCon North America 2018 6
7. Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
ApacheCon North America 2018 7
8. Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemallâs Technology Stack
Amazon S3
ApacheCon North America 2018 8
14. Versatile
Hivemall is a Versatile library ..
Ăź Not only for Machine Learning
Ăź provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Donât Repeat Yourself!
Donât Repeat Yourself!
ApacheCon North America 2018 14
15. Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> BASE91
> UNBASE91
> NORMALIZE_UNICODE
> SPLIT_WORDS
> IS_STOPWORD
> TOKENIZE
> TOKENIZE_JA/CN
> TF/IDF
> SINGULARIZE
> TILE
> MAP_URL
> HAVERSINE_DISTANCE
ApacheCon North America 2018 15
JSON
> TO_JSON
> FROM_JSON
16. ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
16
17. ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
17
20. List of Supported Algorithms
Classification
â Perceptron
â Passive Aggressive (PA, PA1, PA2)
â Confidence Weighted (CW)
â Adaptive Regularization of Weight
Vectors (AROW)
â Soft Confidence Weighted (SCW)
â AdaGrad+RDA
â Factorization Machines
â RandomForest Classification
Regression
âLogistic Regression (SGD)
âAdaGrad (logistic loss)
âAdaDELTA (logistic loss)
âPA Regression
âAROW Regression
âFactorization Machines
âRandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
ApacheCon North America 2018 20
22. â˘Squared Loss
â˘Quantile Loss
â˘Epsilon Insensitive Loss
â˘Squared Epsilon Insensitive
Loss
â˘Huber Loss
Generic Classifier/Regressor
Available Loss functions
â˘HingeLoss
â˘LogLoss (synonym: logistic)
â˘SquaredHingeLoss
â˘ModifiedHuberLoss
⢠L1
⢠L2
⢠ElasticNet
⢠RDA
Other options
For Binary Classification:
For Regression:
⢠SGD
⢠AdaGrad
⢠AdaDelta
⢠ADAM
Optimizer
⢠Iteration support
⢠mini-batch
⢠Early stopping
Regularization
ApacheCon North America 2018 22
24. Training of RandomForest
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector input.
ApacheCon North America 2018 24
28. SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
ApacheCon North America 2018 28
29. Supported Algorithms for Recommendation
K-Nearest Neighbor
â Minhash and b-Bit Minhash
(LSH variant)
â Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
â Matrix Factorization
â Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
ApacheCon North America 2018 29
30. Other Supported Algorithms
Feature Engineering
âFeature Hashing
âFeature Scaling
(normalization, z-score)
â Feature Binning
â TF-IDF vectorizer
â Polynomial Expansion
â Amplifier
NLP
âBasic Englist text Tokenizer
âEnglish/Japanese/Chinese
Tokenizer
Evaluation metrics
âAUC, nDCG, logloss, precision
recall@K, and etc
ApacheCon North America 2018 30
32. Feature Engineering â Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
ApacheCon North America 2018 32
35. Other Supported Features
Anomaly Detection
âLocal Outlier Factor (LoF)
âChangeFinder
Clustering / Topic models
âOnline mini-batch LDA
âOnline mini-batch PLSA
Change Point Detection
âChangeFinder
âSingular Spectrum
Transformation
ApacheCon North America 2018 35
36. Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 36
39. Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
ApacheCon North America 2018 39
40. ⢠T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
⢠T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
ApacheCon North America 2018 40
44. Ăź Spark 2.3 support
Ăź Merged Brickhouse UDFs
Ăź Field-aware Factorization Machines
Ăź SLIM recommendation
Whatâs new in the coming v0.5.2
ApacheCon North America 2018
Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR
Prediction", Proc. RecSys. 2016.
State-of-the-art method for CTR prediction, often used algorithm in Kaggle
Very promising algorithm for top-k recommendation
44
45. Ăź Word2Vec support
Ăź Multi-class Logistic Regression
Ăź More efficient XGBoost support
Ăź LightGBM support
Ăź Gradient Boosting
Ăź Kafka KSQL UDF porting
Future work for v0.6 and later
PR#91
PR#116
ApacheCon North America 2018 45
49. (CopyrightŠ2018 NTT corp. All Rights Reserved.
-: , : 2
⢠- :1 . -:
⢠- 31 - 1:- 1 31
⢠31 :
$ E F > AA
> E E F $ E 9> $DF= :D AD : $> A
9 A - A A E F $F :$ L 1 A 4 ./ $ : "
9 A - A $ A F $9D M5E F ""$9D "
F 2D = )
50. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠( 0 2 244 0 24 40
10 0 1 00 0 0
⢠)0 10 0 2
⢠) E F F 1C :
⢠C .F5 C E * C E
⢠EC : 0 / C :
⢠: 2 C
⢠I
3 *0.0
FCE C C E 5 F EE ( 5 E H H
51. CopyrightŠ2018 NTT corp. All Rights Reserved.
â˘
⢠/ 1 5 5 *55 -51 13 5 / 1 :
5 5 53 5 A 5 3 5 A 2 .D ,
57. (CopyrightŠ2018 NTT corp. All Rights Reserved.
2 . .
// Downloads Spark v2.3 and launches a spark-shell with Hivemall
$ 5 D C= D: >> < CD
/C E 0 E 2C C E: 5 7 > 5D C EE 7 >
D6 > . > EC 0 - D C=$C 7$ C E > 5D "$> 7 )$EC $5 "
D6 > . EC 0 $ C E 6:
C E
> 5 > 7 F5> F>> 5> - ECF "
EFC D 6E C F>> 5> - ECF "
59. CopyrightŠ2018 NTT corp. All Rights Reserved.
- -.
= D ( L CF, = L D = E C O CF D
= D ( L
D EG> D, D
P 5 LM " ) O CABL )5 O CABL
P .
P 5 L CF:DGA A LM " D D
P )5 LM " O CABL
P . CF D
P
P 9 LM
L C ACF
60. (CopyrightŠ2018 NTT corp. All Rights Reserved.
. .-
E 6 6 6EF )
6 : * F EF : E F"DB = "# : 6F D E #
B FBD" : 6F D E #
6
B D = F=B E
: >B= " B : :" : 6F D # *** B " : 6F D # 0. . #
DB , " DB = #
6 "E= B= "E " = F $ 6 ###
61. CopyrightŠ2018 NTT corp. All Rights Reserved.
. - 4
N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB
N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT
N >G>) >NOB
NLG
S . .,: MJRFA NFD JFA >GPB " RBFDEO * MBAF OBA
S 6 :M>FI:> GB O
S . : 6 :. 61 JABG:> GB
S 6 O CB>OPMB ( CB>OPMB
S 6 = MJRFA
NOMF >MDFI
62. (CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠- . .
- . :
⢠, :
⢠.2 6 2 2 ) /- 2> 6 :=:C
-: : -
2 2+ :>C 6=2
C
6 ) C :> > 23 6 C 6
2 6) C :> > 23 6 C 6
) 3 6 > 23 6 2 6
2 2+ 62 C :C $" ".2 (" 6"), $" 6 ".2 (" "))
, = C6 C 6>C :6 62
63. (CopyrightŠ2018 NTT corp. All Rights Reserved.
â˘
⢠3 A J KN I=D KA J$ K = E K=J K > I = I
-
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD -6 J DP
64. (CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠: 3 : : .
⢠A J KN I=D KA J$ K = E K=J K 4 > I = I
>3 :1 : 3:> : . 13>> :
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K 4"
E K=J K 4 :P JA IC ADD -6 J DP
:> 3
- 2 : 3 1
3 2
65. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠::-
⢠.= AD= : A = A = > A A=> 5= 6 = >
- : :- -
: ) > A
: A=> +5 ( : 5A+5 :
: A A=> 6 A+5 : 5A+5 6 = >H ((( 6 A+5 6 = >H
: 5A+5 H 6 A+5 H = H
- >: A 55 A 5 - , ::
67. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠: :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
68. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠: :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Only joins top-K rows
73. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠: -: : :
: =: -:
⢠-7 1 73 1: 8 1 1-7 73
- 7 73 - - 1 8 1 1- 1 :1 1 87:
+ : : -:
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
74. CopyrightŠ2018 NTT corp. All Rights Reserved.
⢠: -: : :
: =: -:
⢠-7 1 4 47 1: 8 4 1 1-747
-4747 - - 1 8 1 1- 1 :1 1 487:
+ : : -:
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Data Extraction + Feature Selection
Join Pruning by Data Statistics
75. Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The 2nd Apache release (v0.5.2) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
ApacheCon North America 2018 75