SlideShare ist ein Scribd-Unternehmen logo
1 von 20
C H A P T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B
Learning Spark
by Holden Karau et. al.
Overview: Machine Learning with MLlib
 System Requirements
 Machine Learning Basics
 Data Types
 Algorithms
 Feature Extraction
 Statistics
 Classification and Regression
 Clustering
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation
 Tips and Performance Considerations
 Pipeline API
 Conclusion
11.1 Overview
 MLlib’s design and philosophy are simple: it lets you
invoke various algorithms on distributed datasets,
representing all data as RDDs.
 It contains only parallel algorithms that run well on
clusters
 In Spark 1.0 and 1.1, MLlib’s interface is relatively
low-level
 In Spark 1.2, MLlib gains an additional pipeline API
for building such pipelines.
11.2 System Requirements
 MLlib requires some linear algebra libraries to be
installed on your machines.
 gfortran runtime library
 to use MLlib in Python, you will need NumPy
 python-numpy or numpy package through your package manager
on Linux
 or by using a third-party scientific Python installation like
Anaconda.
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
11.3 Machine Learning Basics
 Machine learning algorithms attempt to make
predictions or decisions based on training data.
 All learning algorithms require defining a set of features
for each item
 Most algorithms are defined only for numerical features
 specifically, a vector of numbers representing the value for each
feature
 Once data is represented as feature vectors, most
machine learning algorithms optimize a well-defined
mathematical function based on these vectors
 Finally, most learning algorithms have multiple
parameters that can affect results
11.3 Machine Learning Basics (cont.)
11.4 Data Types
 MLlib contains a few specific data types, located in
the:
 org.apache.spark.mllib package (Java/Scala)
 pyspark.mllib (Python).
 The main ones are:
 Vector
 LabeledPoint
 Rating
 Various Model classes
11.5 Algorithms
 Feature Extraction
 mllib.feature package
 Statistics
 mllib.stat.Statistics class.
 Classification and Regression
 use the LabeledPoint class (resides in the mllib.regression package.)
 Clustering
 K-means, as well as a variant called K-means||
 Collaborative Filtering and Recommendation
 mllib.recommendation.ALS class
 Dimensionality Reduction
 Model Evaluation
11.5.1 Feature Extraction
 TF-IDF (Term Frequency–Inverse Document Frequency)
 computes two statistics for each term in each document:
 the term frequency (TF)
 the inverse document frequency (IDF)
 MLlib has two algorithms that compute TF-IDF: HashingTF and IDF
 Scaling
 Normalization
 Word2Vec
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation
11.5.2 Statistics
 MLlib offers several widely used statistic functions
that work directly on RDDs
 Statistics.colStats(rdd)
 Statistics.corr(rdd, method)
 Statistics.corr(rdd1, rdd2, method)
 Statistics.chiSqTest(rdd)
 Apart from these methods, RDDs containing
numeric data offer several basic statistics such as
mean(), stdev(), and sum()
 RDDs support sample() and sampleByKey() to build
simple and stratified samples of data.
11.5.3 Classification and Regression
 Classification and regression are two common forms
of supervised learning
 The difference between them:
 in classification, the variable is discrete
 in regression, the variable predicted is continuous
 MLlib includes a variety of methods:
 Linear regression
 Logistic regression
 Support Vector Machines
 Naive Bayes
 Decision trees and random forests
11.5.4 Clustering
 Clustering is the unsupervised learning task that
involves grouping objects into clusters of high
similarity
 MLlib includes the popular K-means algorithm for
clustering, as well as a variant called K-means||
 Kmeans|| is similar to the K-means++ initialization
procedure often used in singlenode settings.
 To invoke K-means:
 create a mllib.clustering.KMeans object (in Java/Scala)
 or calling KMeans.train (in Python).
11.5.5 Collaborative Filtering and
Recommendation
 Collaborative filtering:
 is a technique for recommender systems
 is attractive
 MLlib includes an implementation of Alternating
Least Squares (ALS)
 It is located in the mllib.recommendation.ALS class.
 To use ALS, you need to give it an RDD of
mllib.recommendation.Rating objects
 there are two variants of ALS: for explicit ratings (the default)
and for implicit ratings
11.5.6 Dimensionality Reduction
 Principal component analysis (PCA)
 the mapping to the lower-dimensional space is done such that the
variance of the data in the lower-dimensional representation is
maximized,
 PCA is currently available only in Java and Scala (as of MLlib 1.2).
 Singular value decomposition (SVD)
 The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:
 U is an orthonormal matrix, whose columns are called left singular
vectors.
 Σ is a diagonal matrix with nonnegative diagonals in descending
order, whose diagonals are called singular values.
 V is an orthonormal matrix, whose columns are called right singular
vectors.
11.5.7 Model Evaluation
 In Spark 1.2, MLlib contains an experimental set of
model evaluation functions, though only in Java and
Scala.
 In future versions of Spark, the pipeline API is
expected to include evaluation functions in all
languages.
11.6 Tips and Performance Considerations
 Preparing Features
 Scale your input features.
 Featurize text correctly.
 Label classes correctly.
 Configuring Algorithms
 Caching RDDs to Reuse
 try persist(StorageLevel.DISK_ONLY).
 Recognizing Sparsity
 Level of Parallelism
11.7 Pipeline API
 Starting in Spark 1.2
 This API is similar to the pipeline API in SciKit-Learn.
 a pipeline is a series of algorithms (either feature
transformation or model fitting) that transform a dataset.
 Each stage of the pipeline may have parameters
 The pipeline API uses a uniform representation of
datasets throughout, which is SchemaRDDs from
Spark SQL
 The pipeline API is still experimental at the time of
writing
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
11.8 Conclusion
 the library ties directly to Spark’s other APIs
 letting you work on RDDs and get back results you
can use in other Spark functions.
 MLlib is one of the most actively developed parts of
Spark, so it is still evolving.

Weitere ähnliche Inhalte

Was ist angesagt?

Download It
Download ItDownload It
Download It
butest
 

Was ist angesagt? (19)

Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Download It
Download ItDownload It
Download It
 

Ähnlich wie Learning spark ch11 - Machine Learning with MLlib

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 

Ähnlich wie Learning spark ch11 - Machine Learning with MLlib (20)

Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Unit 1 notes.pdf
Unit 1 notes.pdfUnit 1 notes.pdf
Unit 1 notes.pdf
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
 
Insight into java 1.8, OOP VS FP
Insight into java 1.8, OOP VS FPInsight into java 1.8, OOP VS FP
Insight into java 1.8, OOP VS FP
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 

Mehr von phanleson

Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
phanleson
 

Mehr von phanleson (20)

Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposes
 
SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19
 
Lecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service DevelopmentLecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service Development
 
Lecture 15 - Technical Details
Lecture 15 - Technical DetailsLecture 15 - Technical Details
Lecture 15 - Technical Details
 
Lecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange PatternsLecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange Patterns
 
Lecture 9 - SOA in Context
Lecture 9 - SOA in ContextLecture 9 - SOA in Context
Lecture 9 - SOA in Context
 
Lecture 07 - Business Process Management
Lecture 07 - Business Process ManagementLecture 07 - Business Process Management
Lecture 07 - Business Process Management
 
Lecture 04 - Loose Coupling
Lecture 04 - Loose CouplingLecture 04 - Loose Coupling
Lecture 04 - Loose Coupling
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Kürzlich hochgeladen (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 

Learning spark ch11 - Machine Learning with MLlib

  • 1. C H A P T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B Learning Spark by Holden Karau et. al.
  • 2. Overview: Machine Learning with MLlib  System Requirements  Machine Learning Basics  Data Types  Algorithms  Feature Extraction  Statistics  Classification and Regression  Clustering  Collaborative Filtering and Recommendation  Dimensionality Reduction  Model Evaluation  Tips and Performance Considerations  Pipeline API  Conclusion
  • 3. 11.1 Overview  MLlib’s design and philosophy are simple: it lets you invoke various algorithms on distributed datasets, representing all data as RDDs.  It contains only parallel algorithms that run well on clusters  In Spark 1.0 and 1.1, MLlib’s interface is relatively low-level  In Spark 1.2, MLlib gains an additional pipeline API for building such pipelines.
  • 4. 11.2 System Requirements  MLlib requires some linear algebra libraries to be installed on your machines.  gfortran runtime library  to use MLlib in Python, you will need NumPy  python-numpy or numpy package through your package manager on Linux  or by using a third-party scientific Python installation like Anaconda.
  • 5. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 6. 11.3 Machine Learning Basics  Machine learning algorithms attempt to make predictions or decisions based on training data.  All learning algorithms require defining a set of features for each item  Most algorithms are defined only for numerical features  specifically, a vector of numbers representing the value for each feature  Once data is represented as feature vectors, most machine learning algorithms optimize a well-defined mathematical function based on these vectors  Finally, most learning algorithms have multiple parameters that can affect results
  • 7. 11.3 Machine Learning Basics (cont.)
  • 8. 11.4 Data Types  MLlib contains a few specific data types, located in the:  org.apache.spark.mllib package (Java/Scala)  pyspark.mllib (Python).  The main ones are:  Vector  LabeledPoint  Rating  Various Model classes
  • 9. 11.5 Algorithms  Feature Extraction  mllib.feature package  Statistics  mllib.stat.Statistics class.  Classification and Regression  use the LabeledPoint class (resides in the mllib.regression package.)  Clustering  K-means, as well as a variant called K-means||  Collaborative Filtering and Recommendation  mllib.recommendation.ALS class  Dimensionality Reduction  Model Evaluation
  • 10. 11.5.1 Feature Extraction  TF-IDF (Term Frequency–Inverse Document Frequency)  computes two statistics for each term in each document:  the term frequency (TF)  the inverse document frequency (IDF)  MLlib has two algorithms that compute TF-IDF: HashingTF and IDF  Scaling  Normalization  Word2Vec  Collaborative Filtering and Recommendation  Dimensionality Reduction  Model Evaluation
  • 11. 11.5.2 Statistics  MLlib offers several widely used statistic functions that work directly on RDDs  Statistics.colStats(rdd)  Statistics.corr(rdd, method)  Statistics.corr(rdd1, rdd2, method)  Statistics.chiSqTest(rdd)  Apart from these methods, RDDs containing numeric data offer several basic statistics such as mean(), stdev(), and sum()  RDDs support sample() and sampleByKey() to build simple and stratified samples of data.
  • 12. 11.5.3 Classification and Regression  Classification and regression are two common forms of supervised learning  The difference between them:  in classification, the variable is discrete  in regression, the variable predicted is continuous  MLlib includes a variety of methods:  Linear regression  Logistic regression  Support Vector Machines  Naive Bayes  Decision trees and random forests
  • 13. 11.5.4 Clustering  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity  MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means||  Kmeans|| is similar to the K-means++ initialization procedure often used in singlenode settings.  To invoke K-means:  create a mllib.clustering.KMeans object (in Java/Scala)  or calling KMeans.train (in Python).
  • 14. 11.5.5 Collaborative Filtering and Recommendation  Collaborative filtering:  is a technique for recommender systems  is attractive  MLlib includes an implementation of Alternating Least Squares (ALS)  It is located in the mllib.recommendation.ALS class.  To use ALS, you need to give it an RDD of mllib.recommendation.Rating objects  there are two variants of ALS: for explicit ratings (the default) and for implicit ratings
  • 15. 11.5.6 Dimensionality Reduction  Principal component analysis (PCA)  the mapping to the lower-dimensional space is done such that the variance of the data in the lower-dimensional representation is maximized,  PCA is currently available only in Java and Scala (as of MLlib 1.2).  Singular value decomposition (SVD)  The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:  U is an orthonormal matrix, whose columns are called left singular vectors.  Σ is a diagonal matrix with nonnegative diagonals in descending order, whose diagonals are called singular values.  V is an orthonormal matrix, whose columns are called right singular vectors.
  • 16. 11.5.7 Model Evaluation  In Spark 1.2, MLlib contains an experimental set of model evaluation functions, though only in Java and Scala.  In future versions of Spark, the pipeline API is expected to include evaluation functions in all languages.
  • 17. 11.6 Tips and Performance Considerations  Preparing Features  Scale your input features.  Featurize text correctly.  Label classes correctly.  Configuring Algorithms  Caching RDDs to Reuse  try persist(StorageLevel.DISK_ONLY).  Recognizing Sparsity  Level of Parallelism
  • 18. 11.7 Pipeline API  Starting in Spark 1.2  This API is similar to the pipeline API in SciKit-Learn.  a pipeline is a series of algorithms (either feature transformation or model fitting) that transform a dataset.  Each stage of the pipeline may have parameters  The pipeline API uses a uniform representation of datasets throughout, which is SchemaRDDs from Spark SQL  The pipeline API is still experimental at the time of writing
  • 19. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 20. 11.8 Conclusion  the library ties directly to Spark’s other APIs  letting you work on RDDs and get back results you can use in other Spark functions.  MLlib is one of the most actively developed parts of Spark, so it is still evolving.

Hinweis der Redaktion

  1. Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. First, all libraries and higher- level components in the stack benefit from improvements at the lower layers. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.