7. Hivemallâs Vision: ML on SQL
2016/10/26 Hadoop Summit '16, Tokyo 7
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
â Machine Learning made easy for SQL developers
â Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
9. List of supported Algorithms
Classification
â Perceptron
â Passive Aggressive (PA, PA1, PA2)
â Confidence Weighted (CW)
â Adaptive Regularization of Weight
Vectors (AROW)
â Soft Confidence Weighted (SCW)
â AdaGrad+RDA
â Factorization Machines
â RandomForest Classification
2016/10/26 Hadoop Summit '16, Tokyo 9
Regression
âLogistic Regression (SGD)
âAdaGrad (logistic loss)
âAdaDELTA (logistic loss)
âPA Regression
âAROW Regression
âFactorization Machines
âRandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
12. 2016/10/26 Hadoop Summit '16, Tokyo 12
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
13. 2016/10/26 Hadoop Summit '16, Tokyo 13
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
32. Ăź Spark 2.0 support
Ăź XGBoost Integration
Ăź Field-aware Factorization Machines
Ăź Generalized Linear Model
⢠Optimizer framework including ADAM
⢠L1/L2 regularization
2016/10/26 Hadoop Summit '16, Tokyo 32
Other new features to come
33. CopyrightŠ2016 NTT corp. All Rights Reserved.
Hivemall on
Takeshi Yamamuro @ NTT
Hadoop Summit 2016
Tokyo, Japan
35. 35CopyrightŠ2016 NTT corp. All Rights Reserved.
WhatĘźs Spark?
⢠1. Unified Engine
⢠support end-to-end APs, e.g., MLlib and Streaming
⢠2. High-level APIs
⢠easy-to-use, rich optimization
⢠3. Integrate broadly
⢠storages, libraries, ...
36. 36CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Hivemall wrapper for Spark
⢠Wrapper implementations for DataFrame/SQL
⢠+ some utilities for easy-to-use in Spark
⢠The wrapper makes you...
⢠run most of Hivemall functions in Spark
⢠try examples easily in your laptop
⢠improve some function performance in Spark
WhatĘźs Hivemall on Spark?
37. 37CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Hivemall already has many fascinating ML
algorithms and useful utilities
⢠High barriers to add newer algorithms in MLlib
WhyĘźs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
38. 38CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package âPspark-2.0 âDskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
39. 39CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package âPspark-2.0 âDskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
-Pspark-1.6 for Spark v1.6
40. 40CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠1. Download a Spark binary
⢠2. Fetch training and test data
⢠3. Load these data in Spark
⢠4. Build a model
⢠5. Do predictions
Running an Example
41. 41CopyrightŠ2016 NTT corp. All Rights Reserved.
1. Download a Spark binary Running an Example
http://spark.apache.org/downloads.html
42. 42CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠E2006 tfidf regression dataset
⢠http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat
asets/regression.html#E2006-tfidf
2. Fetch training and test data Running an Example
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2
43. 43CopyrightŠ2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
// Create DataFrame from the bzipâd libsvm-formatted file
scala> val trainDf = sqlContext.sparkSession.read.format("libsvmâ)
.load(âE2006.train.bz2")
scala> trainDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
44. 44CopyrightŠ2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335âŚ
trainDf
Partition1 Partition2 Partition3 PartitionN
âŚ
âŚ
âŚ
Load in parallel because
bzip2 is splittable
45. 45CopyrightŠ2016 NTT corp. All Rights Reserved.
4. Build a model - DataFrame Running an Example
scala> paste:
val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("featureâ)
.agg("weight" -> "avgâ)
46. 46CopyrightŠ2016 NTT corp. All Rights Reserved.
4. Build a model - SQL Running an Example
scala> trainDf.createOrReplaceTempView("TrainTable")
scala> paste:
val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight
| FROM (
| SELECT train_logregr(features, label)
| AS (feature, weight)
| FROM TrainTable
| )
| GROUP BY feature
""".stripMargin)
47. 47CopyrightŠ2016 NTT corp. All Rights Reserved.
5. Do predictions - DataFrame Running an Example
scala> paste:
val df = testDf.select(rowid(), $"features")
.explode_vector($"features")
.cache
# Do predictions
df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER")
.groupBy("rowid")
.avg(sigmoid(sum($"weight" * $"value")))
48. 48CopyrightŠ2016 NTT corp. All Rights Reserved.
5. Do predictions - SQL Running an Example
scala> modelDf.createOrReplaceTempView(âModelTable")
scala> df.createOrReplaceTempView(âTestTableâ)
scala> paste:
sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted
| FROM TrainTable t
| LEFT OUTER JOIN ModelTable m
| ON t.feature = m.feature
| GROUP BY rowid
""".stripMargin)
49. 49CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))
scala> val sparkUdf = functions.udf(sigmoidFunc)
scala> df.select(sparkUdf($âvalueâ))
50. 50CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoid
scala> df.select(hiveUdf($âvalueâ))
51. 51CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.1) Compute a sigmoid function
Improve Some Functions in Spark
52. 52CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:
df.withColumn(
ârankâ,
rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
53. 53CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.2) Compute top-k for each key group
⢠Fixed the overhead issue for each_top_k
⢠See pr#353: âImplement EachTopK as a generator expressionâ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, âkeyâ, âscoreâ, âvalueâ)
54. 54CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠Spark has overheads to call Hive UD*Fs
⢠Hivemall heavily depends on them
⢠ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
55. 55CopyrightŠ2016 NTT corp. All Rights Reserved.
⢠supports under development
⢠fast implementation of the gradient tree boosting
⢠widely used in Kaggle competitions
⢠This integration will make you...
⢠load built models and predict in parallel
⢠build multiple models in parallel
for cross-validation
3rdâParty Library Integration