SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Introduction to Spark ML
Machine learning at scale
ApacheCon 2016
Hella-Legit
Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
What we are going to explore together!
● Who I think you all are
● Spark’s two different ML APIs
● Running through a simple example with one
● A brief detour into some codegen funtimes
● Exercises!
● Model save/load
● Discussion of “serving” options
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages
Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core & maybe SQL
● Interested in using Spark for Machine Learning
● Familiar-ish with Scala or Java or Python
Amanda
Skipping intro & set-up time :)
But maybe time to upgrade...
● Spark 1.5+ (Spark 1.6 would be best!)
○ (built with Hive support if building from source)
Amanda
Some pages to keep open:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
http://www.slideshare.net/hkarau
Download census data https://archive.ics.uci.
edu/ml/datasets/Adult
Dwight Sipler
Getting some data for working with:
● census data: https://archive.ics.uci.
edu/ml/datasets/Adult
● Goal: predict income > 50k
● Also included in the github repo
● Download that now if you haven’t already
● We will add a header to the data
○ http://pastebin.ca/3318687
PROTill
Westermayer
So what are the two APIs?
● Traditional and Pipeline
○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs
○ Data preparation work is generally done in traditional Spark
transformations
● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson
So what are DataFrames?
● Spark SQL’s version of RDDs of the world (its for more
than just SQL)
● Restricted data types, schema information, compile time
untyped
● Restricted operations (more relational style)
● Allow lots of fun extra optimizations
○ Tungsten etc.
● We’ll talk more about them (& Datasets) when we do
the Spark SQL component of this workshop
Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a DataFrame to produce a
transformer
● Pipelines chain together multiple transformers and
estimators
Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and
parse by hand
● spark-packages can save by providing the spark-csv
package by Hossein Falaki
○ If we were building a Java project we can include maven coordinates
○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson
Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use com.
databricks.spark.csv
● load(“path”)
Jess Johnson
Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
Exercise 1:
Go from the index to something useful
● We could manually look up the labels and then write a
select statement
● Or we could look at the features on the
StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this
Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction", outputCol="
prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select
("prediction-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
So what could we do for other types of
data?
● org.apache.spark.ml.feature has a lot of options
○ HashingTF
○ Tokenizer
○ Word2Vec
○ etc.
Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not

○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
cobra libre
And not just for getting data into doubles...
● Maybe a customers cat food preference only matters if
the owns_cats boolean is true
● Maybe the scale is _way_ off
● Maybe we’ve got stop words
● Maybe we know one component has a non-linear
relation
● etc.
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta
Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
predicting
So serving...
● Generally refers to using your model online
○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits
● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it
○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
● Some models support PMML export
○ https://github.com/jpmml/openscoring etc.
● We can also write our own export & serving by hand :(
Ambernectar 13
So what models are PMML exportable?
● Right now “old” style models
○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model
● Waiting on https://issues.apache.
org/jira/browse/SPARK-11171 / https://github.
com/apache/spark/pull/9207 for pipeline models
● Not getting in for 2.0 :(
How to PMML export
toPMML
● returns a string or
● takes a path to local fs and saves results or
● takes a SparkContext & a distributed path and saves or
● takes a stream and writes result to stream
Optional* exercise time
● Take a model you trained and save it to PMML
○ You will have to dig around in the Spark code to be able to do this
● Look at the file
● Load it into a serving system and try some predictions
● Note: PMML export currently only includes the model -
not any transformations beforehand
● Also: you might need to train a new model
● If you don’t get it don’t worry - hints to follow :)

Weitere Àhnliche Inhalte

Was ist angesagt?

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 

Was ist angesagt? (20)

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 

Ähnlich wie Introduction to Spark ML Pipelines Workshop

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScriptJorg Janke
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 

Ähnlich wie Introduction to Spark ML Pipelines Workshop (20)

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

KĂŒrzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)jennyeacort
 
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·vhwb25kk
 
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degree
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degree
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeyuu sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€F sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

KĂŒrzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)
Call Us ➄97111√47426đŸ€łCall Girls in Aerocity (Delhi NCR)
 
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·
1:1ćźšćˆ¶(UQæŻ•äžšèŻïŒ‰æ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•äżźæ”čç•™äżĄć­ŠćŽ†èź€èŻćŽŸç‰ˆäž€æšĄäž€æ ·
 
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degree
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degree
æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degreeæŸłæŽČäž­ć€źæ˜†ćŁ«ć…°ć€§ć­ŠæŻ•äžšèŻæˆç»©ć•pdfç””ć­ç‰ˆćˆ¶äœœäżźæ”č#æŻ•äžšæ–‡ć‡­ćˆ¶äœœ#ć›žć›œć…„èŒ#diploma#degree
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€
ćŠžç†ć­ŠäœèŻäž­äœ›çœ—é‡ŒèŸŸć€§ć­ŠæŻ•äžšèŻ,UCFæˆç»©ć•ćŽŸç‰ˆäž€æŻ”äž€
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

Introduction to Spark ML Pipelines Workshop

  • 1. Introduction to Spark ML Machine learning at scale ApacheCon 2016 Hella-Legit
  • 2. Who am I? Holden ● I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Software Engineer at IBM’s Spark Technology Center ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
  • 3. What we are going to explore together! ● Who I think you all are ● Spark’s two different ML APIs ● Running through a simple example with one ● A brief detour into some codegen funtimes ● Exercises! ● Model save/load ● Discussion of “serving” options
  • 4. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 5. Who do I think you all are? ● Nice people* ● Some knowledge of Apache Spark core & maybe SQL ● Interested in using Spark for Machine Learning ● Familiar-ish with Scala or Java or Python Amanda
  • 6. Skipping intro & set-up time :)
  • 7. But maybe time to upgrade... ● Spark 1.5+ (Spark 1.6 would be best!) ○ (built with Hive support if building from source) Amanda
  • 8. Some pages to keep open: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline- workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci. edu/ml/datasets/Adult Dwight Sipler
  • 9. Getting some data for working with: ● census data: https://archive.ics.uci. edu/ml/datasets/Adult ● Goal: predict income > 50k ● Also included in the github repo ● Download that now if you haven’t already ● We will add a header to the data ○ http://pastebin.ca/3318687 PROTill Westermayer
  • 10. So what are the two APIs? ● Traditional and Pipeline ○ Pipeline is the new shiny future which will fix all problems* ● Traditional API works on RDDs ○ Data preparation work is generally done in traditional Spark transformations ● Pipeline API works on DataFrames ○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm ○ Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson
  • 11. So what are DataFrames? ● Spark SQL’s version of RDDs of the world (its for more than just SQL) ● Restricted data types, schema information, compile time untyped ● Restricted operations (more relational style) ● Allow lots of fun extra optimizations ○ Tungsten etc. ● We’ll talk more about them (& Datasets) when we do the Spark SQL component of this workshop
  • 12. Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators
  • 13. Let’s start with loading some data ● We’ve got some CSV data, we could use textfile and parse by hand ● spark-packages can save by providing the spark-csv package by Hossein Falaki ○ If we were building a Java project we can include maven coordinates ○ For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson
  • 14. Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com. databricks.spark.csv ● load(“path”) Jess Johnson
  • 15. Loading with sparkSQL & spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  • 16. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 17. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung
  • 18. Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 19. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  • 20. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 21. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  • 22. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 23. Exercise 1: Go from the index to something useful ● We could manually look up the labels and then write a select statement ● Or we could look at the features on the StringIndexerModel and use IndexToString ● Our pipeline has an array of stages we can use for this
  • 24. Solution: from pyspark.ml.feature import IndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol=" prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select ("prediction-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar
  • 25. So what could we do for other types of data? ● org.apache.spark.ml.feature has a lot of options ○ HashingTF ○ Tokenizer ○ Word2Vec ○ etc.
  • 26. Exercise 2: Add more features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not
 ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
  • 27. And not just for getting data into doubles... ● Maybe a customers cat food preference only matters if the owns_cats boolean is true ● Maybe the scale is _way_ off ● Maybe we’ve got stop words ● Maybe we know one component has a non-linear relation ● etc.
  • 28. Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta
  • 29. Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
  • 30. Exercise 3: Train a new model type ● Your choice! ● If you want to do regression - change what we are predicting
  • 31. So serving... ● Generally refers to using your model online ○ Generating recommendations... ● In batch mode you can “just” save & use the Spark bits ● Spark’s “native” formats (often parquet w/metadata) ○ Understood by Spark libraries and thats pretty much it ○ If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) ● Some models support PMML export ○ https://github.com/jpmml/openscoring etc. ● We can also write our own export & serving by hand :( Ambernectar 13
  • 32. So what models are PMML exportable? ● Right now “old” style models ○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression ○ However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model ● Waiting on https://issues.apache. org/jira/browse/SPARK-11171 / https://github. com/apache/spark/pull/9207 for pipeline models ● Not getting in for 2.0 :(
  • 33. How to PMML export toPMML ● returns a string or ● takes a path to local fs and saves results or ● takes a SparkContext & a distributed path and saves or ● takes a stream and writes result to stream
  • 34. Optional* exercise time ● Take a model you trained and save it to PMML ○ You will have to dig around in the Spark code to be able to do this ● Look at the file ● Load it into a serving system and try some predictions ● Note: PMML export currently only includes the model - not any transformations beforehand ● Also: you might need to train a new model ● If you don’t get it don’t worry - hints to follow :)