1. Hivemall meets DigDag
Machine Learning Pipeline by SQL queries
Research Engineer, Treasure Data
Makoto YUI @myui
@ApacheHivemall
12018/2/17 HackerTackle
2. Ø 2015.04~ Research Engineer at Treasure Data, Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-a-service company
Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced
Industrial Science and Technology, Japan.
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
About me …
2018/2/17 HackerTackle 2
•
•
let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i))
slideshare.net/myui/icde2010-nbgclock
3. About me …
2018/2/17 HackerTackle 3
ü Ocaml (for/let, type inference)
ü Lisp (every object is a sequence/atomization)
ü XPath
influenced by
4. 2018/2/17 HackerTackle 4
We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
5. Plan of the talk
1. Introduction to Hivemall
2. ML workflow using Digdag
2018/2/17 HackerTackle 5
6. Hivemall entered Apache Incubator
on Sept 13, 2016
Since then, we invited 3 contributors as new committers (a
committer has been voted as PPMC). Currently, we are working
toward the first Apache release (v0.5.0).
hivemall.incubator.apache.org
62018/2/17 HackerTackle
8. 2018/2/17 HackerTackle
Industry use cases of Hivemall
Ø T-mobile.au
Ø Klout – influencer marketing
bit.ly/klout-hivemall
bit.ly/2whJCQj
Ø Subaru
8
https://www.treasuredata.co.jp/customers/subaru/
9. Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
9
http://www.slideshare.net/masakazusano75/sano-hmm-20150512
2018/2/17 HackerTackle
10. 2018/2/17 HackerTackle 10
Industry use cases of Hivemall
Minne (Japanese version of Etsy.com) uses Hivemall for Item
recommendation
https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
11. 11
ØGender prediction of Ad click logs
•Scaleout Inc. and Fan commutations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
2018/2/17 HackerTackle
12. 12
Industry use cases of Hivemall
Ø Value prediction of Real estates
•Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
2018/2/17 HackerTackle
13. 13
ØChurn Detection
•OISIX
Industry use cases of Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
2018/2/17 HackerTackle
Web
Mobile
User attributes
User action log
Claim histories
Referrers
Services used
Direct countermeasure
In-direct countermeasure
Giving points Call to care
Guide to SuccessUI Change
Data used for Prediction
Find customers likely to
churn using Hivemall
Feedback
Loop
Customers
likely to leave
14. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use
142018/2/17 HackerTackle
15. Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
152018/2/17 HackerTackle
16. Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
162018/2/17 HackerTackle
24. 24
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
2018/2/17 HackerTackle
25. Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
252018/2/17 HackerTackle
26. Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> TF/IDF
> TILE
> MAP_URL
262018/2/17 HackerTackle
27. 2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
27
28. 2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
28
31. 31
SELECT count(distinct id) FROM data
More useful functions (Sketch, NLP)
SELECT approx_count_distinct(id) FROM data
select tokenize_ja(“ ",
"normal", null, null, "https://s3.amazonaws.com/td-
hivemall/dist/kuromoji-user-dict-neologd.csv.gz");
[“ ”, "," "," "]
2018/2/17 HackerTackle
32. List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
322018/2/17 HackerTackle
38. 38
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
2018/2/17 HackerTackle
39. Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector Space
(Euclid/Cosine/Jaccard/Angular)
✓ DIMSUM (Cosine similarity)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
392018/2/17 HackerTackle
42. Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
422018/2/17 HackerTackle
43. Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
432018/2/17 HackerTackle
46. Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
462018/2/17 HackerTackle
47. ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü More efficient XGBoost support
ü LightGBM support
ü DecisionTree prediction tracing
ü Gradient Boosting
Future work for v0.5.2 and later
47
PR#91
PR#116
PR#58
PR#111
2018/2/17 HackerTackle
58. Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The first Apache release (v0.5.0) will appear soon!
We welcome your contributions to Apache Hivemall J
582018/2/17 HackerTackle
Digdag is a great workflow engine for managing complex ML pipelines