Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
2. Who am I?
Holden
â I prefer she/her for pronouns
â Co-author of the Learning Spark book
â Software Engineer at IBMâs Spark Technology Center
â @holdenkarau
â http://www.slideshare.net/hkarau
â https://www.linkedin.com/in/holdenkarau
3. What we are going to explore together!
â Who I think you all are
â Sparkâs two different ML APIs
â Running through a simple example with one
â A brief detour into some codegen funtimes
â Exercises!
â Model save/load
â Discussion of âservingâ options
4. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages
5. Who do I think you all are?
â Nice people*
â Some knowledge of Apache Spark core & maybe SQL
â Interested in using Spark for Machine Learning
â Familiar-ish with Scala or Java or Python
Amanda
7. But maybe time to upgrade...
â Spark 1.5+ (Spark 1.6 would be best!)
â (built with Hive support if building from source)
Amanda
8. Some pages to keep open:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
http://www.slideshare.net/hkarau
Download census data https://archive.ics.uci.
edu/ml/datasets/Adult
Dwight Sipler
9. Getting some data for working with:
â census data: https://archive.ics.uci.
edu/ml/datasets/Adult
â Goal: predict income > 50k
â Also included in the github repo
â Download that now if you havenât already
â We will add a header to the data
â http://pastebin.ca/3318687
PROTill
Westermayer
10. So what are the two APIs?
â Traditional and Pipeline
â Pipeline is the new shiny future which will fix all problems*
â Traditional API works on RDDs
â Data preparation work is generally done in traditional Spark
transformations
â Pipeline API works on DataFrames
â Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
â Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson
11. So what are DataFrames?
â Spark SQLâs version of RDDs of the world (its for more
than just SQL)
â Restricted data types, schema information, compile time
untyped
â Restricted operations (more relational style)
â Allow lots of fun extra optimizations
â Tungsten etc.
â Weâll talk more about them (& Datasets) when we do
the Spark SQL component of this workshop
12. Transformers, Estimators and Pipelines
â Transformers transform a DataFrame into another
â Estimators can be trained on a DataFrame to produce a
transformer
â Pipelines chain together multiple transformers and
estimators
13. Letâs start with loading some data
â Weâve got some CSV data, we could use textfile and
parse by hand
â spark-packages can save by providing the spark-csv
package by Hossein Falaki
â If we were building a Java project we can include maven coordinates
â For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson
14. Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
â option(âkeyâ, âvalueâ)
â spark-csv ones we will use are header & inferSchema
â format(âformatNameâ)
â built in formats include parquet, jdbc, etc. today we will use com.
databricks.spark.csv
â load(âpathâ)
Jess Johnson
15. Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
16. Lets explore training a Decision Tree
â Step 1: Data loading (done!)
â Step 2: Data prep (select features, etc.)
â Step 3: Train
â Step 4: Predict
17. Data prep / cleaning
â We need to predict a double (can be 0.0, 1.0, but type
must be double)
â We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
18. Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
19. So a bit more about that pipeline
â Each of our previous components has âfitâ & âtransformâ
stage
â Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
â Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
20. What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
21. Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
22. And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
23. Exercise 1:
Go from the index to something useful
â We could manually look up the labels and then write a
select statement
â Or we could look at the features on the
StringIndexerModel and use IndexToString
â Our pipeline has an array of stages we can use for this
24. Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction", outputCol="
prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select
("prediction-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
25. So what could we do for other types of
data?
â org.apache.spark.ml.feature has a lot of options
â HashingTF
â Tokenizer
â Word2Vec
â etc.
26. Exercise 2: Add more features to your tree
â Finished quickly? Help others!
â Or tell me if adding these features helped or notâŠ
â We can download a reserve âtestâ dataset but how would we know if
we couldnât do that?
cobra libre
27. And not just for getting data into doubles...
â Maybe a customers cat food preference only matters if
the owns_cats boolean is true
â Maybe the scale is _way_ off
â Maybe weâve got stop words
â Maybe we know one component has a non-linear
relation
â etc.
28. Cross-validation
because saving a test set is effort
â Automagically* fit your model params
â Because thinking is effort
â org.apache.spark.ml.tuning has the tools
â (not in Python yet so skipping for now)
Jonathan Kotta
29. Pipeline API has many models:
â org.apache.spark.ml.classification
â BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
â org.apache.spark.ml.regression
â DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
â org.apache.spark.ml.recommendation
â ALS
PROcarterse Follow
30. Exercise 3: Train a new model type
â Your choice!
â If you want to do regression - change what we are
predicting
31. So serving...
â Generally refers to using your model online
â Generating recommendations...
â In batch mode you can âjustâ save & use the Spark bits
â Sparkâs ânativeâ formats (often parquet w/metadata)
â Understood by Spark libraries and thats pretty much it
â If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
â Some models support PMML export
â https://github.com/jpmml/openscoring etc.
â We can also write our own export & serving by hand :(
Ambernectar 13
32. So what models are PMML exportable?
â Right now âoldâ style models
â KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
â However if we look in the code we can sometimes find converters to
the old style models and use this to export our ânewâ style model
â Waiting on https://issues.apache.
org/jira/browse/SPARK-11171 / https://github.
com/apache/spark/pull/9207 for pipeline models
â Not getting in for 2.0 :(
33. How to PMML export
toPMML
â returns a string or
â takes a path to local fs and saves results or
â takes a SparkContext & a distributed path and saves or
â takes a stream and writes result to stream
34. Optional* exercise time
â Take a model you trained and save it to PMML
â You will have to dig around in the Spark code to be able to do this
â Look at the file
â Load it into a serving system and try some predictions
â Note: PMML export currently only includes the model -
not any transformations beforehand
â Also: you might need to train a new model
â If you donât get it donât worry - hints to follow :)