Introduction to Spark ML Pipelines Workshop

Introduction to Spark ML
Machine learning at scale
ApacheCon 2016
Hella-Legit

Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau

What we are going to explore together!
● Who I think you all are
● Spark’s two different ML APIs
● Running through a simple example with one
● A brief detour into some codegen funtimes
● Exercises!
● Model save/load
● Discussion of “serving” options

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages

Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core & maybe SQL
● Interested in using Spark for Machine Learning
● Familiar-ish with Scala or Java or Python
Amanda

Skipping intro & set-up time :)

But maybe time to upgrade...
● Spark 1.5+ (Spark 1.6 would be best!)
○ (built with Hive support if building from source)
Amanda

Some pages to keep open:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
http://www.slideshare.net/hkarau
Download census data https://archive.ics.uci.
edu/ml/datasets/Adult
Dwight Sipler

Getting some data for working with:
● census data: https://archive.ics.uci.
edu/ml/datasets/Adult
● Goal: predict income > 50k
● Also included in the github repo
● Download that now if you haven’t already
● We will add a header to the data
○ http://pastebin.ca/3318687
PROTill
Westermayer

So what are the two APIs?
● Traditional and Pipeline
○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs
○ Data preparation work is generally done in traditional Spark
transformations
● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson

So what are DataFrames?
● Spark SQL’s version of RDDs of the world (its for more
than just SQL)
● Restricted data types, schema information, compile time
untyped
● Restricted operations (more relational style)
● Allow lots of fun extra optimizations
○ Tungsten etc.
● We’ll talk more about them (& Datasets) when we do
the Spark SQL component of this workshop

Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a DataFrame to produce a
transformer
● Pipelines chain together multiple transformers and
estimators

Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and
parse by hand
● spark-packages can save by providing the spark-csv
package by Hossein Falaki
○ If we were building a Java project we can include maven coordinates
○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson

Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use com.
databricks.spark.csv
● load(“path”)
Jess Johnson

Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson

Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict

Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung

Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung

So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey

What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.

Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)

And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)

Exercise 1:
Go from the index to something useful
● We could manually look up the labels and then write a
select statement
● Or we could look at the features on the
StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this

Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction", outputCol="
prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select
("prediction-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar

So what could we do for other types of
data?
● org.apache.spark.ml.feature has a lot of options
○ HashingTF
○ Tokenizer
○ Word2Vec
○ etc.

Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
cobra libre

And not just for getting data into doubles...
● Maybe a customers cat food preference only matters if
the owns_cats boolean is true
● Maybe the scale is _way_ off
● Maybe we’ve got stop words
● Maybe we know one component has a non-linear
relation
● etc.

Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta

Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow

Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
predicting

So serving...
● Generally refers to using your model online
○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits
● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it
○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
● Some models support PMML export
○ https://github.com/jpmml/openscoring etc.
● We can also write our own export & serving by hand :(
Ambernectar 13

So what models are PMML exportable?
● Right now “old” style models
○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model
● Waiting on https://issues.apache.
org/jira/browse/SPARK-11171 / https://github.
com/apache/spark/pull/9207 for pipeline models
● Not getting in for 2.0 :(

How to PMML export
toPMML
● returns a string or
● takes a path to local fs and saves results or
● takes a SparkContext & a distributed path and saves or
● takes a stream and writes result to stream

Optional* exercise time
● Take a model you trained and save it to PMML
○ You will have to dig around in the Spark code to be able to do this
● Look at the file
● Load it into a serving system and try some predictions
● Note: PMML export currently only includes the model -
not any transformations beforehand
● Also: you might need to train a new model
● If you don’t get it don’t worry - hints to follow :)

Introduction to Spark ML Pipelines Workshop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Spark ML Pipelines Workshop

Ähnlich wie Introduction to Spark ML Pipelines Workshop (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Spark ML Pipelines Workshop