Hivemall meets Digdag @Hackertackle 2018-02-17

Daten & Analysen
  1. 1. Hivemall meets DigDag Machine Learning Pipeline by SQL queries Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/2/17 HackerTackle
  2. 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as-a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh About me … 2018/2/17 HackerTackle 2 • • let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i)) slideshare.net/myui/icde2010-nbgclock
  3. 3. About me … 2018/2/17 HackerTackle 3 ü Ocaml (for/let, type inference) ü Lisp (every object is a sequence/atomization) ü XPath influenced by
  4. 4. 2018/2/17 HackerTackle 4 We Open-source! TD invented .. Streaming log collector Bulk data import/export Efficient binary serialization Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
  5. 5. Plan of the talk 1. Introduction to Hivemall 2. ML workflow using Digdag 2018/2/17 HackerTackle 5
  6. 6. Hivemall entered Apache Incubator on Sept 13, 2016 Since then, we invited 3 contributors as new committers (a committer has been voted as PPMC). Currently, we are working toward the first Apache release (v0.5.0). hivemall.incubator.apache.org 62018/2/17 HackerTackle
  7. 7. 2018/2/17 HackerTackle 7
  8. 8. 2018/2/17 HackerTackle Industry use cases of Hivemall Ø T-mobile.au Ø Klout – influencer marketing bit.ly/klout-hivemall bit.ly/2whJCQj Ø Subaru 8 https://www.treasuredata.co.jp/customers/subaru/
  9. 9. Ø CTR prediction of Ad click logs • Freakout Inc., Fan communication, and more • Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 9 http://www.slideshare.net/masakazusano75/sano-hmm-20150512 2018/2/17 HackerTackle
  10. 10. 2018/2/17 HackerTackle 10 Industry use cases of Hivemall Minne (Japanese version of Etsy.com) uses Hivemall for Item recommendation https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
  11. 11. 11 ØGender prediction of Ad click logs •Scaleout Inc. and Fan commutations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall 2018/2/17 HackerTackle
  12. 12. 12 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 2018/2/17 HackerTackle
  13. 13. 13 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 2018/2/17 HackerTackle Web Mobile User attributes User action log Claim histories Referrers Services used Direct countermeasure In-direct countermeasure Giving points Call to care Guide to SuccessUI Change Data used for Prediction Find customers likely to churn using Hivemall Feedback Loop Customers likely to leave
  14. 14. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 142018/2/17 HackerTackle
  15. 15. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 152018/2/17 HackerTackle
  16. 16. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 162018/2/17 HackerTackle
  17. 17. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 172018/2/17 HackerTackle
  18. 18. Hivemall on Apache Hive 182018/2/17 HackerTackle
  19. 19. Hivemall on Apache Spark Dataframe 192018/2/17 HackerTackle
  20. 20. Hivemall on SparkSQL 202018/2/17 HackerTackle
  21. 21. Hivemall on Apache Pig 212018/2/17 HackerTackle
  22. 22. Online Prediction by Apache Streaming 222018/2/17 HackerTackle
  23. 23. 23 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/2/17 HackerTackle
  24. 24. 24 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/2/17 HackerTackle
  25. 25. Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself! 252018/2/17 HackerTackle
  26. 26. Hivemall generic functions Array and Map Bit and compress String and NLP Brickhouse UDFs are merged in v0.5.2 release. We welcome contributing your generic UDFs to Hivemall Geo Spatial Top-k processing > TF/IDF > TILE > MAP_URL 262018/2/17 HackerTackle
  27. 27. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes 27
  28. 28. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J 28
  29. 29. Map tiling functions 292018/2/17 HackerTackle
  30. 30. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 302018/2/17 HackerTackle
  31. 31. 31 SELECT count(distinct id) FROM data More useful functions (Sketch, NLP) SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/2/17 HackerTackle
  32. 32. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 322018/2/17 HackerTackle
  33. 33. RandomForest in Hivemall Ensemble of Decision Trees 332018/2/17 HackerTackle
  34. 34. Training of RandomForest 34 Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/2/17 HackerTackle
  35. 35. Prediction of RandomForest 352018/2/17 HackerTackle
  36. 36. 36 Decision Tree Visualization 2018/2/17 HackerTackle
  37. 37. 37 Decision Tree Visualization 2018/2/17 HackerTackle
  38. 38. 38 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/2/17 HackerTackle
  39. 39. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) ✓ DIMSUM (Cosine similarity) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 392018/2/17 HackerTackle
  40. 40. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc 402018/2/17 HackerTackle
  41. 41. Evaluation Metrics 412018/2/17 HackerTackle
  42. 42. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 422018/2/17 HackerTackle
  43. 43. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 432018/2/17 HackerTackle
  44. 44. Take this… Anomaly/Change-point Detection by ChangeFinder 442018/2/17 HackerTackle
  45. 45. Anomaly/Change-point Detection by ChangeFinder …and do this! 452018/2/17 HackerTackle
  46. 46. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 462018/2/17 HackerTackle
  47. 47. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü More efficient XGBoost support ü LightGBM support ü DecisionTree prediction tracing ü Gradient Boosting Future work for v0.5.2 and later 47 PR#91 PR#116 PR#58 PR#111 2018/2/17 HackerTackle
  48. 48. 48 ML workflows often be really complex…
  49. 49. 2018/2/17 HackerTackle 49 Real-world ML pipelines (could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict
  50. 50. 502018/2/17 HackerTackle Hivemall Digdag
  51. 51. Technology Trends for 2017 2018/2/17 HackerTackle 51 https://www.thoughtworks.com/radar
  52. 52. 2018/2/17 HackerTackle 52 Why Digdag? Ø Manage workflows by codes (simple YAML syntax) Ø REST API endpoints Ø Parallel/Sequential execution Ø SLA, error notification Ø Secrets Managing Ø Docker support Ø TD, EMR, Bigquery/Slack operators Ø Embedded Javascript engine Programmer Friendly, Revision management Plugin scheme for defining custom operator
  53. 53. 2018/2/17 HackerTackle 53 Digdag features SLA and error handling Nestable, Parallel/Sequential Execution Embedded Javascript engine
  54. 54. 542018/2/17 HackerTackle Machine Learning Workflow using Digdag
  55. 55. 552018/2/17 HackerTackle Machine Learning Workflow using Digdag
  56. 56. 2018/2/17 HackerTackle 56 Use case: CTR/CVR prediction
  57. 57. 2018/2/17 HackerTackle 57 Workflow execution timeline DEMO
  58. 58. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The first Apache release (v0.5.0) will appear soon! We welcome your contributions to Apache Hivemall J 582018/2/17 HackerTackle Digdag is a great workflow engine for managing complex ML pipelines
  59. 59. Any feature request or questions? 592018/2/17 HackerTackle