Slides deck used by Praveen Devarao for Apache Spark Machine Learning session organized by Bangalore Spark enthusiasts meetup group @ IBM campus on 10th September 2016
Demo notebook used can be found at https://gist.github.com/praveend/fe9a0c5eacd6b43ee210e88a374eb230
2. Agenda
• What
is
Machine
Learning?
• The
machine
learning
module
in
Spark
• SparkML
pipelines
• Extrac?on,
Selec?on
and
Tuning
• Demo
3. What
is
Machine
Learning?
• A
computer
program
is
said
to
learn
from
experience
E
with
respect
to
some
class
of
tasks
T
and
performance
measure
P
if
its
performance
at
tasks
in
T,
as
measured
by
P,
improves
with
experience
E
• Field
of
study
that
gives
computers
the
ability
to
learn
without
being
explicitly
programmed
4. How
is
it
achieved?
• Build
mathema?cal
models
for
given
tasks
• Represent
the
given
dataset
mathema?cally
• Apply
sta?s?c
methods
on
this
math
representa?on
• Tune
and
derive
a
model
that
can
perform
the
needed
task
5. Categories
of
ML
• Supervised
learning
• The
program
is
“trained”
on
a
pre-‐defined
set
of
“training
examples”,
which
then
facilitate
its
ability
to
reach
an
accurate
conclusion
when
given
new
data
• The
goal
is
to
learn
a
general
rule
that
maps
inputs
to
outputs
• Unsupervised
learning
• No
labels
are
given
to
the
learning
algorithm,
leaving
it
on
its
own
to
find
structure
(paOerns
and
rela?onships)
in
its
input
• Unsupervised
learning
can
be
a
goal
in
itself
(discovering
hidden
paOerns
in
data)
or
a
means
towards
an
end
(feature
learning)
7. SparkML
–
The
Machine
learning
module
of
Spark
• APIs
Based
on
Dataframes
• Distributed
collec?on
of
data
organized
as
columns
• Contains
commonly
used
ML
algorithms
• Classifica?on
• Regression
• Clustering
• Featuriza?on
-‐
feature
extrac?on,
transforma?on,
dimensionality
reduc?on,
and
selec?on
• Pipelines
-‐
tools
for
construc?ng,
evalua?ng,
and
tuning
• Persistence
of
models
and
pipelines
9. SparkML
Pipelines
• Transformer
:
Algorithm
to
transform
one
dataframe
to
another
• Es?mator
:
Algorithm
applied
on
dataframe
to
produce
a
transformer
• Parameters
:
Factors
affec?ng
the
Es?mators
• Pipeline
:
Chain
of
mul?ple
transformers
and
es?mators
that
forms
the
ML
flow
10. Extractors
• Algorithms
to
extract
features
from
raw
data
• TermFrequency-‐InverseDocumentFrequency
• Word2Vec:
• 2
layer
neural
network
that
converts
words
to
vectors
• CountVectorizer:
• Number
of
tokens
11. Transformers
and
Selectors
• Transformers
:
• Algorithms
for
scaling,
modifying
or
conver?ng
features
• Tokenizer
• StringIndexer
• VectorAssembler
• PCA
• Selectors
:
• Libraries
for
selec?ng
subset
of
larger
set
of
features
• Vector
Slicer
• RFormula
• ChiSqSelector
13. Model
evaluaEon
Techniques
• Evalua?on:
• F1
Score
Calculate
precision
and
recall
from
confusion
matrix
precision
=
True
Posi?ves
,
recall
=
True
Posi?ves
Predicted
Posi?ves
Actual
Posi?ves
• ROC
Predicted
PosiEve
Predicted
NegaEve
Actual
PosiEve
True
Posi?ve
False
Nega?ve
Actual
NegaEve
False
posi?ve
True
Nega?ve
Confusion
Matrix
14. SparkML
Evaluators
and
Tuning
• Evaluators:
• BinaryClassifica?onEvaluator
• areaUnderROC
&
areaUnderPR
• Mul?classClassifica?onEvaluator
• F1,
weightedPrecison,
WeightedRecall
• RegressionEvaluator
• MSE,
RMSE
• Model
Tuning
and
Selec?on:
• CrossValidator
• k
folds
(train,test)
dataset
pair
is
created
• Trains
and
evaluates
for
different
param
se_ngs
• Expensive
• TrainValida?onSplit
• 1
(train,test)
dataset
pair
is
created
• Trains
for
one
combina?on
of
the
params
only
• Less
expensive
than
cross-‐valida?on