SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Apache	
  Spark	
  Machine	
  Learning	
  
-­‐	
  Praveen	
  Devarao	
  
Agenda	
  
•  What	
  is	
  Machine	
  Learning?	
  
•  The	
  machine	
  learning	
  module	
  in	
  Spark	
  
	
  
•  SparkML	
  pipelines	
  
•  Extrac?on,	
  Selec?on	
  and	
  Tuning	
  
	
  
•  Demo	
  
What	
  is	
  Machine	
  Learning?	
  
•  A	
  computer	
  program	
  is	
  said	
  to	
  learn	
  from	
  experience	
  E	
  
with	
  respect	
  to	
  some	
  class	
  of	
  tasks	
  T	
  and	
  performance	
  
measure	
  P	
  if	
  its	
  performance	
  at	
  tasks	
  in	
  T,	
  as	
  measured	
  
by	
  P,	
  improves	
  with	
  experience	
  E	
  
•  Field	
  of	
  study	
  that	
  gives	
  computers	
  the	
  ability	
  to	
  learn	
  
without	
  being	
  explicitly	
  programmed	
  
How	
  is	
  it	
  achieved?	
  
•  Build	
  mathema?cal	
  models	
  for	
  given	
  tasks	
  
	
  
•  Represent	
  the	
  given	
  dataset	
  mathema?cally	
  
•  Apply	
  sta?s?c	
  methods	
  on	
  this	
  math	
  representa?on	
  
•  Tune	
  and	
  derive	
  a	
  model	
  that	
  can	
  perform	
  the	
  needed	
  task	
  
Categories	
  of	
  ML	
  
•  Supervised	
  learning	
  
•  The	
  program	
  is	
  “trained”	
  on	
  a	
  pre-­‐defined	
  set	
  of	
  “training	
  examples”,	
  which	
  
then	
  facilitate	
  its	
  ability	
  to	
  reach	
  an	
  accurate	
  conclusion	
  when	
  given	
  new	
  
data	
  
•  The	
  goal	
  is	
  to	
  learn	
  a	
  general	
  rule	
  that	
  maps	
  inputs	
  to	
  outputs	
  
•  Unsupervised	
  learning	
  
•  No	
  labels	
  are	
  given	
  to	
  the	
  learning	
  algorithm,	
  leaving	
  it	
  on	
  its	
  own	
  to	
  find	
  
structure	
  (paOerns	
  and	
  rela?onships)	
  in	
  its	
  input	
  
•  Unsupervised	
  learning	
  can	
  be	
  a	
  goal	
  in	
  itself	
  (discovering	
  hidden	
  paOerns	
  in	
  
data)	
  or	
  a	
  means	
  towards	
  an	
  end	
  (feature	
  learning)	
  
Categories	
  of	
  ML	
  
f1	
  
f2	
  
f1	
  
f2	
  
Supervised	
   Un-­‐Supervised	
  
SparkML	
  –	
  The	
  Machine	
  learning	
  module	
  of	
  Spark	
  
•  APIs	
  Based	
  on	
  Dataframes	
  
•  Distributed	
  collec?on	
  of	
  data	
  organized	
  as	
  columns	
  
•  Contains	
  commonly	
  used	
  ML	
  algorithms	
  
•  Classifica?on	
  
•  Regression	
  
•  Clustering	
  
•  Featuriza?on	
  -­‐	
  	
  feature	
  extrac?on,	
  transforma?on,	
  dimensionality	
  
reduc?on,	
  and	
  selec?on	
  
•  Pipelines	
  -­‐	
  	
  	
  tools	
  for	
  construc?ng,	
  evalua?ng,	
  and	
  tuning	
  
•  Persistence	
  of	
  models	
  and	
  pipelines	
  
Machine	
  Learning	
  process	
  
SparkML	
  Pipelines	
  
•  Transformer	
  :	
  	
  Algorithm	
  to	
  transform	
  one	
  dataframe	
  to	
  another	
  	
  
•  Es?mator	
  :	
  Algorithm	
  applied	
  on	
  dataframe	
  to	
  produce	
  a	
  transformer	
  
•  Parameters	
  :	
  Factors	
  affec?ng	
  the	
  Es?mators	
  
•  Pipeline	
  :	
  Chain	
  of	
  mul?ple	
  transformers	
  and	
  es?mators	
  that	
  forms	
  the	
  ML	
  flow	
  
Extractors	
  
•  Algorithms	
  to	
  extract	
  features	
  from	
  raw	
  data	
  
•  TermFrequency-­‐InverseDocumentFrequency	
  
•  Word2Vec:	
  	
  
•  2	
  layer	
  neural	
  network	
  that	
  converts	
  words	
  to	
  vectors	
  
•  CountVectorizer:	
  
•  Number	
  of	
  tokens	
  
	
  
Transformers	
  and	
  Selectors	
  
•  Transformers	
  :	
  
•  Algorithms	
  for	
  scaling,	
  modifying	
  or	
  conver?ng	
  features	
  
•  Tokenizer	
  
•  StringIndexer	
  
•  VectorAssembler	
  
•  PCA	
  
•  Selectors	
  :	
  
•  Libraries	
  for	
  selec?ng	
  subset	
  of	
  larger	
  set	
  of	
  features	
  
•  Vector	
  Slicer	
  
•  RFormula	
  
•  ChiSqSelector	
  
Break!!	
  
Model	
  evaluaEon	
  Techniques	
  
•  Evalua?on:	
  
•  F1	
  Score	
  
	
  Calculate	
  precision	
  and	
  recall	
  from	
  confusion	
  matrix	
  
precision	
  =	
  	
  	
  	
  	
  	
  	
  True	
  Posi?ves	
  	
  	
  	
  	
  	
  	
  ,	
  recall	
  =	
  	
  	
  	
  	
  True	
  Posi?ves	
  	
  	
  	
  
	
   	
  	
  	
  Predicted	
  Posi?ves	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Actual	
  Posi?ves	
  
	
  
	
  
•  ROC	
  
	
  
Predicted	
  
PosiEve	
  
Predicted	
  
NegaEve	
  
Actual	
  
PosiEve	
  
True	
  
Posi?ve	
  
False	
  
Nega?ve	
  
Actual	
  
NegaEve	
  
False	
  
posi?ve	
  
True	
  
Nega?ve	
  
Confusion	
  Matrix	
  
SparkML	
  Evaluators	
  and	
  Tuning	
  
•  Evaluators:	
  
•  BinaryClassifica?onEvaluator	
  
•  areaUnderROC	
  &	
  areaUnderPR	
  
•  Mul?classClassifica?onEvaluator	
  
•  F1,	
  weightedPrecison,	
  WeightedRecall	
  
•  RegressionEvaluator	
  
•  MSE,	
  RMSE	
  
•  Model	
  Tuning	
  and	
  Selec?on:	
  
•  CrossValidator	
  
•  k	
  folds	
  (train,test)	
  dataset	
  pair	
  is	
  created	
  
•  Trains	
  and	
  evaluates	
  for	
  different	
  param	
  se_ngs	
  
•  Expensive	
  
•  TrainValida?onSplit	
  
•  1	
  (train,test)	
  dataset	
  pair	
  is	
  created	
  
•  Trains	
  for	
  one	
  combina?on	
  of	
  the	
  params	
  only	
  
•  Less	
  expensive	
  than	
  cross-­‐valida?on	
  
Demo	
  
Thank	
  You	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Matlab brochure
Matlab  brochureMatlab  brochure
Matlab brochure
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
1710 track3 zhu
1710 track3 zhu1710 track3 zhu
1710 track3 zhu
 
Top 5 matlab courses
Top 5 matlab coursesTop 5 matlab courses
Top 5 matlab courses
 
Mbd dd
Mbd ddMbd dd
Mbd dd
 
Using MDE for the Formal Verification of Embedded Systems Modeled by UML Se...
Using MDE for the Formal Verification of Embedded  Systems Modeled by UML Se...Using MDE for the Formal Verification of Embedded  Systems Modeled by UML Se...
Using MDE for the Formal Verification of Embedded Systems Modeled by UML Se...
 
Matlab (Presentation on MATLAB)
Matlab (Presentation on MATLAB)Matlab (Presentation on MATLAB)
Matlab (Presentation on MATLAB)
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Matlab-Homework-Projects-UK
Matlab-Homework-Projects-UKMatlab-Homework-Projects-UK
Matlab-Homework-Projects-UK
 
IPL: An Integration Property Language for Multi-Model Cyber-Physical Systems
IPL: An Integration Property Language for Multi-Model Cyber-Physical SystemsIPL: An Integration Property Language for Multi-Model Cyber-Physical Systems
IPL: An Integration Property Language for Multi-Model Cyber-Physical Systems
 

Andere mochten auch

Andere mochten auch (7)

Mahout
MahoutMahout
Mahout
 
R_datamining
R_dataminingR_datamining
R_datamining
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 

Ähnlich wie Apache Spark Machine Learning

Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Databricks
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Ähnlich wie Apache Spark Machine Learning (20)

Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptx
 
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Machine Learning Platform Life-Cycle Management
Machine Learning Platform Life-Cycle ManagementMachine Learning Platform Life-Cycle Management
Machine Learning Platform Life-Cycle Management
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Apache Spark Machine Learning

  • 1. Apache  Spark  Machine  Learning   -­‐  Praveen  Devarao  
  • 2. Agenda   •  What  is  Machine  Learning?   •  The  machine  learning  module  in  Spark     •  SparkML  pipelines   •  Extrac?on,  Selec?on  and  Tuning     •  Demo  
  • 3. What  is  Machine  Learning?   •  A  computer  program  is  said  to  learn  from  experience  E   with  respect  to  some  class  of  tasks  T  and  performance   measure  P  if  its  performance  at  tasks  in  T,  as  measured   by  P,  improves  with  experience  E   •  Field  of  study  that  gives  computers  the  ability  to  learn   without  being  explicitly  programmed  
  • 4. How  is  it  achieved?   •  Build  mathema?cal  models  for  given  tasks     •  Represent  the  given  dataset  mathema?cally   •  Apply  sta?s?c  methods  on  this  math  representa?on   •  Tune  and  derive  a  model  that  can  perform  the  needed  task  
  • 5. Categories  of  ML   •  Supervised  learning   •  The  program  is  “trained”  on  a  pre-­‐defined  set  of  “training  examples”,  which   then  facilitate  its  ability  to  reach  an  accurate  conclusion  when  given  new   data   •  The  goal  is  to  learn  a  general  rule  that  maps  inputs  to  outputs   •  Unsupervised  learning   •  No  labels  are  given  to  the  learning  algorithm,  leaving  it  on  its  own  to  find   structure  (paOerns  and  rela?onships)  in  its  input   •  Unsupervised  learning  can  be  a  goal  in  itself  (discovering  hidden  paOerns  in   data)  or  a  means  towards  an  end  (feature  learning)  
  • 6. Categories  of  ML   f1   f2   f1   f2   Supervised   Un-­‐Supervised  
  • 7. SparkML  –  The  Machine  learning  module  of  Spark   •  APIs  Based  on  Dataframes   •  Distributed  collec?on  of  data  organized  as  columns   •  Contains  commonly  used  ML  algorithms   •  Classifica?on   •  Regression   •  Clustering   •  Featuriza?on  -­‐    feature  extrac?on,  transforma?on,  dimensionality   reduc?on,  and  selec?on   •  Pipelines  -­‐      tools  for  construc?ng,  evalua?ng,  and  tuning   •  Persistence  of  models  and  pipelines  
  • 9. SparkML  Pipelines   •  Transformer  :    Algorithm  to  transform  one  dataframe  to  another     •  Es?mator  :  Algorithm  applied  on  dataframe  to  produce  a  transformer   •  Parameters  :  Factors  affec?ng  the  Es?mators   •  Pipeline  :  Chain  of  mul?ple  transformers  and  es?mators  that  forms  the  ML  flow  
  • 10. Extractors   •  Algorithms  to  extract  features  from  raw  data   •  TermFrequency-­‐InverseDocumentFrequency   •  Word2Vec:     •  2  layer  neural  network  that  converts  words  to  vectors   •  CountVectorizer:   •  Number  of  tokens    
  • 11. Transformers  and  Selectors   •  Transformers  :   •  Algorithms  for  scaling,  modifying  or  conver?ng  features   •  Tokenizer   •  StringIndexer   •  VectorAssembler   •  PCA   •  Selectors  :   •  Libraries  for  selec?ng  subset  of  larger  set  of  features   •  Vector  Slicer   •  RFormula   •  ChiSqSelector  
  • 13. Model  evaluaEon  Techniques   •  Evalua?on:   •  F1  Score    Calculate  precision  and  recall  from  confusion  matrix   precision  =              True  Posi?ves              ,  recall  =          True  Posi?ves                Predicted  Posi?ves                                          Actual  Posi?ves       •  ROC     Predicted   PosiEve   Predicted   NegaEve   Actual   PosiEve   True   Posi?ve   False   Nega?ve   Actual   NegaEve   False   posi?ve   True   Nega?ve   Confusion  Matrix  
  • 14. SparkML  Evaluators  and  Tuning   •  Evaluators:   •  BinaryClassifica?onEvaluator   •  areaUnderROC  &  areaUnderPR   •  Mul?classClassifica?onEvaluator   •  F1,  weightedPrecison,  WeightedRecall   •  RegressionEvaluator   •  MSE,  RMSE   •  Model  Tuning  and  Selec?on:   •  CrossValidator   •  k  folds  (train,test)  dataset  pair  is  created   •  Trains  and  evaluates  for  different  param  se_ngs   •  Expensive   •  TrainValida?onSplit   •  1  (train,test)  dataset  pair  is  created   •  Trains  for  one  combina?on  of  the  params  only   •  Less  expensive  than  cross-­‐valida?on