2. 19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6
>
>
Showing the first 1000 rows.
2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-
07T15:33:28.000+0000
0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-
07T22:25:46.000+0000
312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-
05T11:54:49.000+0000
DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-
02T06:58:08.000+0000
0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-
07T14:46:55.000+0000
Scatter plot for tip amount and fare amount
>
500m
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0
fare_amount
tip_amount
Showing sample based on the first 1000 rows.
Transformation of data with standard dataframe operations
>
The pipeline concept of Spark ML
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
taxiData.registerTempTable("ml_nyc_taxi")
%sql SELECT * FROM ml_nyc_taxi
%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val toVec = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }
val trainingData =
taxiData
.filter(toDouble(taxiData.col("tip_amount")) > 0.0)
.withColumn("label", toDouble(taxiData.col("tip_amount")))
.withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
3. 19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6
A Pipeline chains Transformers and Estimators
A Transformer can also be an estimator from a previous trained model
Important for easily
training with different model parameters e.g. for cross-validation
with different test and training data (train-validation split)
repeat the transformation steps before estimation
Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators
on Spark
SQL transformer:
Select and filter the relevant data
>
VectorAssembler:
Transform the data into labeled data as needed for ML estimators
>
+------------------+----------+
| label| features|
+------------------+----------+
|1.2000000476837158| [1.0,5.5]|
| 4.199999809265137|[1.0,20.5]|
| 5.900000095367432|[1.0,29.0]|
| 5.380000114440918|[1.0,21.0]|
| 1.399999976158142| [6.0,6.5]|
| 1.0| [1.0,5.0]|
| 1.25| [1.0,4.5]|
| 3.0|[6.0,26.0]|
| 1.0|[1.0,14.5]|
|1.2999999523162842| [1.0,6.5]|
| 1.899999976158142| [5.0,9.5]|
|1.6200000047683716| [1.0,6.5]|
| 1.899999976158142| [1.0,9.0]|
| 2.0|[1.0,22.0]|
| 6.0|[1.0,25.0]|
|3.5999999046325684|[1.0,17.5]|
|1.2000000476837158| [1.0,6.0]|
| 7.5|[1.0,24.5]|
Initialize the estimator
import org.apache.spark.ml.feature.SQLTransformer
val taxiDataSelector = new SQLTransformer().setStatement(
"SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")
val selectedTaxiData = taxiDataSelector.transform(taxiData)
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
val trainingDataAssembler = new VectorAssembler()
.setInputCols(Array("passenger_count", "fare_amount"))
.setOutputCol("features")
val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)
assembledTaxiData.select("label", "features").show()
4. 19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6
>
LogisticRegression parameters:
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =
1, it is an L1 penalty (default: 0.0, current: 0.8)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 10)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )
import org.apache.spark.ml.regression.LinearRegression
linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd
Split the data into training and test set
>
Setup the transformation and estimation PIPELINE
>
Use the pipeline to train the model
>
Predict with the trained model on the test data
>
5.00
10.0
15.0
20.0
25.0
30.0
35.0
5.00 10.0 15.0
prediction
label
Showing sample based on the first 1000 rows.
How to get started with Spark ML
Setup your Laptop (16+ GB RAM recommended)
import org.apache.spark.ml.regression.LinearRegression
// Create a LogisticRegression instance. This instance is an Estimator.
val linearRegressionEstimator = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n")
val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)
import org.apache.spark.ml.{Pipeline, PipelineModel}
val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))
// Learn a LogisticRegression model.
// val lrModel = linearRegressionEstimator.fit(trainingData)
val lrModel = pipeline.fit(trainingTaxiData)
display(lrModel.transform(testTaxiData)
.select("label", "prediction"))
5. 19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6
mac$ brew install spark
or get Databricks Community Edition Notebook (Wait List)
Get data
Join a ML competition and get BIG data from Kaggle
Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016
(https://github.com/amaboura/panama-papers-dataset-2016)
Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/
(https://zeppelin.incubator.apache.org/))
Throw some algorithms on it !
? have a coffee
? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-
guide.html)
? read the Kaggle competition forums and blog
Graphs from the Panama Papers