SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Hivemall meets DigDag
Machine Learning Pipeline by SQL queries
Research Engineer, Treasure Data
Makoto YUI @myui
@ApacheHivemall
12018/2/17 HackerTackle
Ø 2015.04~ Research Engineer at Treasure Data, Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-a-service company
Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced
Industrial Science and Technology, Japan.
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
About me …
2018/2/17 HackerTackle 2
•
•
let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i))
slideshare.net/myui/icde2010-nbgclock
About me …
2018/2/17 HackerTackle 3
ü Ocaml (for/let, type inference)
ü Lisp (every object is a sequence/atomization)
ü XPath
influenced by
2018/2/17 HackerTackle 4
We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
Plan of the talk
1. Introduction to Hivemall
2. ML workflow using Digdag
2018/2/17 HackerTackle 5
Hivemall entered Apache Incubator
on Sept 13, 2016
Since then, we invited 3 contributors as new committers (a
committer has been voted as PPMC). Currently, we are working
toward the first Apache release (v0.5.0).
hivemall.incubator.apache.org
62018/2/17 HackerTackle
2018/2/17 HackerTackle 7
2018/2/17 HackerTackle
Industry use cases of Hivemall
Ø T-mobile.au
Ø Klout – influencer marketing
bit.ly/klout-hivemall
bit.ly/2whJCQj
Ø Subaru
8
https://www.treasuredata.co.jp/customers/subaru/
Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
9
http://www.slideshare.net/masakazusano75/sano-hmm-20150512
2018/2/17 HackerTackle
2018/2/17 HackerTackle 10
Industry use cases of Hivemall
Minne (Japanese version of Etsy.com) uses Hivemall for Item
recommendation
https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
11
ØGender prediction of Ad click logs
•Scaleout Inc. and Fan commutations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
2018/2/17 HackerTackle
12
Industry use cases of Hivemall
Ø Value prediction of Real estates
•Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
2018/2/17 HackerTackle
13
ØChurn Detection
•OISIX
Industry use cases of Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
2018/2/17 HackerTackle
Web
Mobile
User attributes
User action log
Claim histories
Referrers
Services used
Direct countermeasure
In-direct countermeasure
Giving points Call to care
Guide to SuccessUI Change
Data used for Prediction
Find customers likely to
churn using Hivemall
Feedback
Loop
Customers
likely to leave
What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use
142018/2/17 HackerTackle
Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
152018/2/17 HackerTackle
Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
162018/2/17 HackerTackle
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
172018/2/17 HackerTackle
Hivemall on Apache Hive
182018/2/17 HackerTackle
Hivemall on Apache Spark Dataframe
192018/2/17 HackerTackle
Hivemall on SparkSQL
202018/2/17 HackerTackle
Hivemall on Apache Pig
212018/2/17 HackerTackle
Online Prediction by Apache Streaming
222018/2/17 HackerTackle
23
Generic Classifier/Regressor
OLD Style New Style from v0.5.0
2018/2/17 HackerTackle
24
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
2018/2/17 HackerTackle
Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
252018/2/17 HackerTackle
Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> TF/IDF
> TILE
> MAP_URL
262018/2/17 HackerTackle
2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
27
2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
28
Map tiling functions
292018/2/17 HackerTackle
Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map tiling functions
Zoom=10
Zoom=15
302018/2/17 HackerTackle
31
SELECT count(distinct id) FROM data
More useful functions (Sketch, NLP)
SELECT approx_count_distinct(id) FROM data
select tokenize_ja(“ ",
"normal", null, null, "https://s3.amazonaws.com/td-
hivemall/dist/kuromoji-user-dict-neologd.csv.gz");
[“ ”, "," "," "]
2018/2/17 HackerTackle
List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
322018/2/17 HackerTackle
RandomForest in Hivemall
Ensemble of Decision Trees
332018/2/17 HackerTackle
Training of RandomForest
34
Sparse Vector Input (Libsvm format) is
supported since v0.5.0 in addition Dense
Vector!
2018/2/17 HackerTackle
Prediction of RandomForest
352018/2/17 HackerTackle
36
Decision Tree Visualization
2018/2/17 HackerTackle
37
Decision Tree Visualization
2018/2/17 HackerTackle
38
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
2018/2/17 HackerTackle
Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector Space
(Euclid/Cosine/Jaccard/Angular)
✓ DIMSUM (Cosine similarity)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
392018/2/17 HackerTackle
Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc
402018/2/17 HackerTackle
Evaluation Metrics
412018/2/17 HackerTackle
Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
422018/2/17 HackerTackle
Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
432018/2/17 HackerTackle
Take this…
Anomaly/Change-point Detection by ChangeFinder
442018/2/17 HackerTackle
Anomaly/Change-point Detection by ChangeFinder
…and do this!
452018/2/17 HackerTackle
Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
462018/2/17 HackerTackle
ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü More efficient XGBoost support
ü LightGBM support
ü DecisionTree prediction tracing
ü Gradient Boosting
Future work for v0.5.2 and later
47
PR#91
PR#116
PR#58
PR#111
2018/2/17 HackerTackle
48
ML workflows often be really complex…
2018/2/17 HackerTackle 49
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
502018/2/17 HackerTackle
Hivemall Digdag
Technology Trends for 2017
2018/2/17 HackerTackle 51
https://www.thoughtworks.com/radar
2018/2/17 HackerTackle 52
Why Digdag?
Ø Manage workflows by codes (simple YAML syntax)
Ø REST API endpoints
Ø Parallel/Sequential execution
Ø SLA, error notification
Ø Secrets Managing
Ø Docker support
Ø TD, EMR, Bigquery/Slack operators
Ø Embedded Javascript engine
Programmer Friendly, Revision management
Plugin scheme for defining custom operator
2018/2/17 HackerTackle 53
Digdag features
SLA and error handling Nestable, Parallel/Sequential Execution
Embedded Javascript engine
542018/2/17 HackerTackle
Machine Learning Workflow using Digdag
552018/2/17 HackerTackle
Machine Learning Workflow using Digdag
2018/2/17 HackerTackle 56
Use case: CTR/CVR prediction
2018/2/17 HackerTackle 57
Workflow execution timeline
DEMO
Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The first Apache release (v0.5.0) will appear soon!
We welcome your contributions to Apache Hivemall J
582018/2/17 HackerTackle
Digdag is a great workflow engine for managing complex ML pipelines
Any feature request or questions?
592018/2/17 HackerTackle

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Was ist angesagt? (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
Hydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations AutomationHydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations Automation
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 
AI and ML 101
AI and ML 101AI and ML 101
AI and ML 101
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 

Ähnlich wie Hivemall meets Digdag @Hackertackle 2018-02-17

Ähnlich wie Hivemall meets Digdag @Hackertackle 2018-02-17 (20)

Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018
 
Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Mulvery Detail - English
Mulvery Detail - EnglishMulvery Detail - English
Mulvery Detail - English
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 

Mehr von Makoto Yui

Mehr von Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 

Kürzlich hochgeladen

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Kürzlich hochgeladen (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Hivemall meets Digdag @Hackertackle 2018-02-17

  • 1. Hivemall meets DigDag Machine Learning Pipeline by SQL queries Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/2/17 HackerTackle
  • 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as-a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh About me … 2018/2/17 HackerTackle 2 • • let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i)) slideshare.net/myui/icde2010-nbgclock
  • 3. About me … 2018/2/17 HackerTackle 3 ü Ocaml (for/let, type inference) ü Lisp (every object is a sequence/atomization) ü XPath influenced by
  • 4. 2018/2/17 HackerTackle 4 We Open-source! TD invented .. Streaming log collector Bulk data import/export Efficient binary serialization Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
  • 5. Plan of the talk 1. Introduction to Hivemall 2. ML workflow using Digdag 2018/2/17 HackerTackle 5
  • 6. Hivemall entered Apache Incubator on Sept 13, 2016 Since then, we invited 3 contributors as new committers (a committer has been voted as PPMC). Currently, we are working toward the first Apache release (v0.5.0). hivemall.incubator.apache.org 62018/2/17 HackerTackle
  • 8. 2018/2/17 HackerTackle Industry use cases of Hivemall Ø T-mobile.au Ø Klout – influencer marketing bit.ly/klout-hivemall bit.ly/2whJCQj Ø Subaru 8 https://www.treasuredata.co.jp/customers/subaru/
  • 9. Ø CTR prediction of Ad click logs • Freakout Inc., Fan communication, and more • Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 9 http://www.slideshare.net/masakazusano75/sano-hmm-20150512 2018/2/17 HackerTackle
  • 10. 2018/2/17 HackerTackle 10 Industry use cases of Hivemall Minne (Japanese version of Etsy.com) uses Hivemall for Item recommendation https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
  • 11. 11 ØGender prediction of Ad click logs •Scaleout Inc. and Fan commutations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall 2018/2/17 HackerTackle
  • 12. 12 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 2018/2/17 HackerTackle
  • 13. 13 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 2018/2/17 HackerTackle Web Mobile User attributes User action log Claim histories Referrers Services used Direct countermeasure In-direct countermeasure Giving points Call to care Guide to SuccessUI Change Data used for Prediction Find customers likely to churn using Hivemall Feedback Loop Customers likely to leave
  • 14. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 142018/2/17 HackerTackle
  • 15. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 152018/2/17 HackerTackle
  • 16. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 162018/2/17 HackerTackle
  • 17. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 172018/2/17 HackerTackle
  • 18. Hivemall on Apache Hive 182018/2/17 HackerTackle
  • 19. Hivemall on Apache Spark Dataframe 192018/2/17 HackerTackle
  • 21. Hivemall on Apache Pig 212018/2/17 HackerTackle
  • 22. Online Prediction by Apache Streaming 222018/2/17 HackerTackle
  • 23. 23 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/2/17 HackerTackle
  • 24. 24 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/2/17 HackerTackle
  • 25. Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself! 252018/2/17 HackerTackle
  • 26. Hivemall generic functions Array and Map Bit and compress String and NLP Brickhouse UDFs are merged in v0.5.2 release. We welcome contributing your generic UDFs to Hivemall Geo Spatial Top-k processing > TF/IDF > TILE > MAP_URL 262018/2/17 HackerTackle
  • 27. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes 27
  • 28. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J 28
  • 30. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 302018/2/17 HackerTackle
  • 31. 31 SELECT count(distinct id) FROM data More useful functions (Sketch, NLP) SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/2/17 HackerTackle
  • 32. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 322018/2/17 HackerTackle
  • 33. RandomForest in Hivemall Ensemble of Decision Trees 332018/2/17 HackerTackle
  • 34. Training of RandomForest 34 Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/2/17 HackerTackle
  • 38. 38 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/2/17 HackerTackle
  • 39. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) ✓ DIMSUM (Cosine similarity) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 392018/2/17 HackerTackle
  • 40. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc 402018/2/17 HackerTackle
  • 42. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 422018/2/17 HackerTackle
  • 43. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 432018/2/17 HackerTackle
  • 44. Take this… Anomaly/Change-point Detection by ChangeFinder 442018/2/17 HackerTackle
  • 45. Anomaly/Change-point Detection by ChangeFinder …and do this! 452018/2/17 HackerTackle
  • 46. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 462018/2/17 HackerTackle
  • 47. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü More efficient XGBoost support ü LightGBM support ü DecisionTree prediction tracing ü Gradient Boosting Future work for v0.5.2 and later 47 PR#91 PR#116 PR#58 PR#111 2018/2/17 HackerTackle
  • 48. 48 ML workflows often be really complex…
  • 49. 2018/2/17 HackerTackle 49 Real-world ML pipelines (could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict
  • 51. Technology Trends for 2017 2018/2/17 HackerTackle 51 https://www.thoughtworks.com/radar
  • 52. 2018/2/17 HackerTackle 52 Why Digdag? Ø Manage workflows by codes (simple YAML syntax) Ø REST API endpoints Ø Parallel/Sequential execution Ø SLA, error notification Ø Secrets Managing Ø Docker support Ø TD, EMR, Bigquery/Slack operators Ø Embedded Javascript engine Programmer Friendly, Revision management Plugin scheme for defining custom operator
  • 53. 2018/2/17 HackerTackle 53 Digdag features SLA and error handling Nestable, Parallel/Sequential Execution Embedded Javascript engine
  • 56. 2018/2/17 HackerTackle 56 Use case: CTR/CVR prediction
  • 57. 2018/2/17 HackerTackle 57 Workflow execution timeline DEMO
  • 58. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The first Apache release (v0.5.0) will appear soon! We welcome your contributions to Apache Hivemall J 582018/2/17 HackerTackle Digdag is a great workflow engine for managing complex ML pipelines
  • 59. Any feature request or questions? 592018/2/17 HackerTackle