SlideShare a Scribd company logo
1 of 32
Apache Spark
MLlib and Machine Learning on Spark
Petr Zapletal Cake Solutions
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and Machine Learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
Table of contents
● Machine Learning Introduction
● Spark ML Support - MLlib
● Machine Learning Techniques
● Tips & Considerations
● ML Pipelines
● Q & A
Machine Learning
● Subfield of Artificial Intelligence (AI)
● Construction & Study of systems that can learn from
data
● Computers act without being explicitly programmed
● Can be seen as building blocks to make computers
behave more intelligently
Machine Learning
Terminology
● Features
o each item is described by number of features
● Samples
o sample is an item to process
o document, picture, row in db, graph, ...
● Feature vector
o n-dimensional vector of numerical features representing some sample
● Labelled data
o data with known classification results
Terminology
Categories
● Supervised learning
o labelled data are available
● Unsupervised learning
o No labelled data is available
● Semi-supervised learning
o mix of Supervised and Unsupervised learning
o usually small part of data is labelled
● Reinforcement learning
o model is continuously learn and relearn based on the actions and the
effects/rewards from that actions.
o reward feedback
Applications
● Speech recognition
● Effective web search
● Recommendation systems
● Computer vision
● Information retrieval
● Spam filtering
● Computational finance
● Fraud detection
● Medical diagnosis
● Stock market analysis
● Structural health monitoring
● ...
MLlib Introduction
● Spark’s scalable machine learning library
● Common learning algorithms and utilities
Benefits of MLlib
● Part of Spark
● Integrated workflow
● Scala, Java & Python API
● Broad coverage of applications & algorithms
● Rapid improvements in speed & robustness
● Ongoing development & Large community
● Easy to use, well documented
Typical Steps in ML Pipeline
Supported Algorithms
Data Types
● Vector
o both dense and sparse vectors
● LabeledPoint
o labelled data point for supervised learning
● Rating
o rating of a product by a user, used for recommendation
● Various Models
o result of a training algorithm
o used for predicting unknown data
● Matrices
Feature Extraction & Basic Statistics
● Several classes for common operations
● Scaling, normalization, statistical summary, correlation, …
● Numeric RDD operations, sampling, …
● Random generators
● Words extractions (TF-IDF)
o generating feature vectors from text documents/web pages
Classification
● Classify samples into predefined category
● Supervised learning
● Binary classification (SVMs, logistic regression)
● Multiclass Classification (decision trees, naive Bayes)
● Spam x non-spam, fruit x logo, ...
Regression
● Predict value from observations, many techniques
● Predicted values are continuous
● Supervised learning
● Linear least squares, Lasso, ridge regression, decision trees
● House prices, stock exchange, power consumption, height of person, ...
Linear Regression Example
● Method run trains model
● Parameters are set with setters setNumInterations and setIntercept
● Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
Clustering
● Grouping objects into groups (~ clusters) of high similarity
● Unsupervised learning -> groups are not predefined
● Number of clusters must be defined
● K-means, Gaussian Mixture Model (EM algorithm), Power Iteration
Clustering (PIC), Latent Dirichlet Allocation(LDA)
Collaborative Filtering
● Used for recommender systems
● Creates and analyses matrix of ratings, predicts missing entries
● Explicit (given rating) vs implicit (views, clicks, likes, shares, ...) feedback
● Alternating least squares (ALS)
Dimensionality Reduction
● Process of reducing number of variables under consideration
● Performance needs, removing non-informative dimensions, plotting, ....
● Principal Component Analysis (PCA) - ignoring non-informative dims
● Singular Value Decomposition (SVD)
o factorizes matrix into 3 descriptive matrices
o storage save, noise reduction
Tips
● Preparing features
o each algorithm is only as good as input features
o probably the most important step in ML
o correct scaling, labeling for each algorithm
● Algorithm configuration
o performance greatly varies according to params
● Caching RDD for reuse
o most of the algorithms are iterative
o input dataset should be cached (cache() method) before passing into
MLlib algorithm
● Recognizing sparsity
Overfitting
● Model is overtrained to the testing data
● Model describes random errors or noise instead of underlying relationship
● Results in poor predictive performance
Data Partitioning
● Supervised learning
● Partitioning labelled data
● Labelled data
o Training set
 set of samples used for learning
 experiments with algorithm parameters
o Test set
 testing fitted model
 must not tune model any further
● Common separation - 70/30
Performance
● 10-100x faster than Hadoop & Mahout
Steady Performance Gains
ML Pipelines
ML Pipelines
Pipeline API
● Pipeline is a series of algorithms (feature transformation, model fitting, ...)
● Easy workflow construction
● Distribution of parameters into each stage
● MLlib is easier to use
● Uses uniform dataset representation - SchemaRDD from SparkSQL
○ multiple named columns (similar to SQL table)
Demo
Conclusion
● What is Machine Learning
● Machine Learning Use Cases & Techniques
● Spark’s Machine Learning library - MLlib
● Tips for using MLlib and Spark
Questions

More Related Content

What's hot

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

What's hot (20)

Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 

Viewers also liked

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 

Viewers also liked (20)

Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and Reporting7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and Reporting
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, NumentaHierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
IBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategy
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 

Similar to MLlib and Machine Learning on Spark

Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 

Similar to MLlib and Machine Learning on Spark (20)

Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...
 
Data science
Data scienceData science
Data science
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data Quality
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
Lecture 1 and 2
Lecture 1 and 2Lecture 1 and 2
Lecture 1 and 2
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 

More from Petr Zapletal

More from Petr Zapletal (12)

Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019
 
Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018
 
Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018
 
Real World Serverless
Real World ServerlessReal World Serverless
Real World Serverless
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nyc
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 

Recently uploaded

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

MLlib and Machine Learning on Spark

  • 1. Apache Spark MLlib and Machine Learning on Spark Petr Zapletal Cake Solutions
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and Machine Learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model 7) Deployment
  • 3. Table of contents ● Machine Learning Introduction ● Spark ML Support - MLlib ● Machine Learning Techniques ● Tips & Considerations ● ML Pipelines ● Q & A
  • 4. Machine Learning ● Subfield of Artificial Intelligence (AI) ● Construction & Study of systems that can learn from data ● Computers act without being explicitly programmed ● Can be seen as building blocks to make computers behave more intelligently
  • 6. Terminology ● Features o each item is described by number of features ● Samples o sample is an item to process o document, picture, row in db, graph, ... ● Feature vector o n-dimensional vector of numerical features representing some sample ● Labelled data o data with known classification results
  • 8. Categories ● Supervised learning o labelled data are available ● Unsupervised learning o No labelled data is available ● Semi-supervised learning o mix of Supervised and Unsupervised learning o usually small part of data is labelled ● Reinforcement learning o model is continuously learn and relearn based on the actions and the effects/rewards from that actions. o reward feedback
  • 9. Applications ● Speech recognition ● Effective web search ● Recommendation systems ● Computer vision ● Information retrieval ● Spam filtering ● Computational finance ● Fraud detection ● Medical diagnosis ● Stock market analysis ● Structural health monitoring ● ...
  • 10. MLlib Introduction ● Spark’s scalable machine learning library ● Common learning algorithms and utilities
  • 11. Benefits of MLlib ● Part of Spark ● Integrated workflow ● Scala, Java & Python API ● Broad coverage of applications & algorithms ● Rapid improvements in speed & robustness ● Ongoing development & Large community ● Easy to use, well documented
  • 12. Typical Steps in ML Pipeline
  • 14. Data Types ● Vector o both dense and sparse vectors ● LabeledPoint o labelled data point for supervised learning ● Rating o rating of a product by a user, used for recommendation ● Various Models o result of a training algorithm o used for predicting unknown data ● Matrices
  • 15. Feature Extraction & Basic Statistics ● Several classes for common operations ● Scaling, normalization, statistical summary, correlation, … ● Numeric RDD operations, sampling, … ● Random generators ● Words extractions (TF-IDF) o generating feature vectors from text documents/web pages
  • 16. Classification ● Classify samples into predefined category ● Supervised learning ● Binary classification (SVMs, logistic regression) ● Multiclass Classification (decision trees, naive Bayes) ● Spam x non-spam, fruit x logo, ...
  • 17. Regression ● Predict value from observations, many techniques ● Predicted values are continuous ● Supervised learning ● Linear least squares, Lasso, ridge regression, decision trees ● House prices, stock exchange, power consumption, height of person, ...
  • 18. Linear Regression Example ● Method run trains model ● Parameters are set with setters setNumInterations and setIntercept ● Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
  • 19. Clustering ● Grouping objects into groups (~ clusters) of high similarity ● Unsupervised learning -> groups are not predefined ● Number of clusters must be defined ● K-means, Gaussian Mixture Model (EM algorithm), Power Iteration Clustering (PIC), Latent Dirichlet Allocation(LDA)
  • 20. Collaborative Filtering ● Used for recommender systems ● Creates and analyses matrix of ratings, predicts missing entries ● Explicit (given rating) vs implicit (views, clicks, likes, shares, ...) feedback ● Alternating least squares (ALS)
  • 21. Dimensionality Reduction ● Process of reducing number of variables under consideration ● Performance needs, removing non-informative dimensions, plotting, .... ● Principal Component Analysis (PCA) - ignoring non-informative dims ● Singular Value Decomposition (SVD) o factorizes matrix into 3 descriptive matrices o storage save, noise reduction
  • 22. Tips ● Preparing features o each algorithm is only as good as input features o probably the most important step in ML o correct scaling, labeling for each algorithm ● Algorithm configuration o performance greatly varies according to params ● Caching RDD for reuse o most of the algorithms are iterative o input dataset should be cached (cache() method) before passing into MLlib algorithm ● Recognizing sparsity
  • 23. Overfitting ● Model is overtrained to the testing data ● Model describes random errors or noise instead of underlying relationship ● Results in poor predictive performance
  • 24. Data Partitioning ● Supervised learning ● Partitioning labelled data ● Labelled data o Training set  set of samples used for learning  experiments with algorithm parameters o Test set  testing fitted model  must not tune model any further ● Common separation - 70/30
  • 25. Performance ● 10-100x faster than Hadoop & Mahout
  • 29. Pipeline API ● Pipeline is a series of algorithms (feature transformation, model fitting, ...) ● Easy workflow construction ● Distribution of parameters into each stage ● MLlib is easier to use ● Uses uniform dataset representation - SchemaRDD from SparkSQL ○ multiple named columns (similar to SQL table)
  • 30. Demo
  • 31. Conclusion ● What is Machine Learning ● Machine Learning Use Cases & Techniques ● Spark’s Machine Learning library - MLlib ● Tips for using MLlib and Spark

Editor's Notes

  1. "Reinforcement learning (RL) and supervised learning are usually portrayed as distinct methods of learning from experience. RL methods are often applied to problems involving sequential dynamics and optimization of a scalar performance objective, with online exploration of the effects of actions. Supervised learning methods, on the other hand, are frequently used for problems involving static input-output mappings and minimization of a vector error signal, with no explicit dependence on how training examples are gathered. As discussed by Barto and Dietterich (this volume), the key feature distinguishing RL and supervised learning is whether training information from the environment serves as an evaluation signal or as an error signal…"
  2. spark-1.3.0-snapshot
  3. “Term Frequency—Inverse Document Frequency, or TF-IDF, is a simple way to generate feature vectors from text documents (e.g. web pages). It computes two statistics for each term in each document: the term frequency, TF, which is the number of times the term occurs in that document, and the inverse document frequency, IDF, which measures how (in)frequently a term occurs across the whole document corpus. The product of these values, TF \times IDF, shows how relevant a term is to a specific document (i.e. if it is common in that specific document but rare in the whole corpus).”
  4. logistic regression -> datas are labeled 1 or 0 -> classification
  5. A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency. http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Linear least squares is one of the mathematics/statistical problem solving methods, using least squares algorithmic technique to increase solution approximation accuracy, corresponding with a particular problem's complexity: lasso (least absolute shrinkage and selection operator) - version of least squares
  6. http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/mllib-optimization.html Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.[1][2] SGD is a great general-purpose optimization algorithm, and it is easy to implement. I would generally use it first, before trying something more complicated. I believe SGD is just as good as, if not superior, to L-BFGS in the not highly varying (and sometimes even convex) optimization surfaces common in current NLP models. (I would nonetheless be interested in a controlled comparison between SGD and the L-BFGS using the Berkeley cache-flushing trick.)
  7. https://github.com/apache/spark/blob/master/docs/mllib-clustering.md The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarties as edge properties, described in Lin and Cohen, Power Iteration Clustering. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via power iteration and uses it to cluster vertices. Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm as follows: Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset. Topics and documents both exist in a feature space, where feature vectors are vectors of word counts. Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated. LDA takes in a collection of documents as vectors of word counts. It learns clustering using expectation-maximizationon the likelihood function. After fitting on the documents, LDA provides: Topics: Inferred topics, each of which is a probability distribution over terms (words). Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
  8. http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
  9. http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set