SlideShare ist ein Scribd-Unternehmen logo
1 von 72
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Apache Spark
Large-scale recommendations with Apache Spark and Python
Christian S. Perone
christian.perone@gmail.com
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
AGENDA
INTRODUCTION
Big Data
The Elephant
APACHE SPARK
Apache Spark Introduction
Resilient Distributed Datasets
Data Frames
Spark and Machine Learning
COLLABORATIVE FILTERING
Introduction
Factorization
Practice time
Q&A
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHO AM I
Christian S. Perone
Machine Learning/Software Engineer
Blog
http://blog.christianperone.com
Open-source projects
https://github.com/perone
Twitter @tarantulae
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section I
INTRODUCTION
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
We want to being able to handle data, query, build models, make
predictions, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
Every real distributed machine learning (ML) researcher/engineer knows
that MR is bad. ML algorithms are iterative and MR is not suited for
iterative algorithms, which is due to unnecessary frequent I/O (...).
—Kenneth Tran, On the imminent decline of MapReduce – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The Mahout community decided to move its codebase onto modern data
processing systems that offer a richer programming model and more
efficient execution than Hadoop MapReduce. Mahout will therefore reject
new MapReduce algorithm implementations from now on (...).
—Mahaut, Goodbye MapReduce – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section II
APACHE SPARK
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
We will focus in the Python API.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, ïŹlter, group,
etc)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, ïŹlter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, ïŹlter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
Controllable persistence for reuse (including caching in RAM)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - TRANSFORMATIONS VS ACTIONS
The operations that can be applied on the RDDs have two main
types:
TRANSFORMATIONS
These are the lazy operations to create new RDDs based on other
RDDs. Example:
map, ïŹlter, union, distinct, etc.
ACTIONS
These are the operations that actually does some computation and get the
results or write to disk. Example:
count, collect, ïŹrst
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - JOB EXECUTION
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
Creating a RDD from a ïŹle:
>>> rdd = sc.textFile("data.txt")
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100] 
"GET /x.html HTTP/1.1"', (...)]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100] 
"GET /x.html HTTP/1.1"', (...)]
Breaking down:
>>> filter_rdd = rdd_log.filter(lambda l: 'x.html' in l)
>>> filter_rdd.count()
238
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
That’s why DataFrames are so awesome.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
Optimizer is able to look inside of operations.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
Filter by a column:
>>> df.filter(df["User"]=="Perone").count()
120
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
MLlib will still support the RDD-based API with bug ïŹxes.
No more new features to the RDD-based API.
In the Spark 2.x releases, will add features to the DataFrames-based
API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.2), the
RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - STACK
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classiïŹcation, regression,
clustering, and collaborative ïŹltering
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classiïŹcation, regression,
clustering, and collaborative ïŹltering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classiïŹcation, regression,
clustering, and collaborative ïŹltering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classiïŹcation, regression,
clustering, and collaborative ïŹltering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classiïŹcation, regression,
clustering, and collaborative ïŹltering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
Utilities: linear algebra, statistics, data handling, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
>>> result = model.transform(documentDF)
>>> result.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'],
result=DenseVector([-0.0168, 0.0042, -0.0308]))]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section III
COLLABORATIVE FILTERING
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative ïŹltering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative ïŹltering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative ïŹltering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative ïŹltering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative ïŹltering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
Cold start
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
OPTIMIZATION
minx,y
u,i
(rui − xT
u yi)2
+ λ(
u
xu
2
+
i
yi
2
)
* omitted biases
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
LET’S DO IT
Practice time !
Notebook at: https://github.com/perone/spark-als-intro
Load/parse data
Pandas integration, sampling, plotting
Spark SQL
Split data (train/test)
Build model
Train model
Evaluate model
Have fun !
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section IV
Q&A
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Q&A

Weitere Àhnliche Inhalte

Was ist angesagt?

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkMohamed hedi Abidi
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkRobert Sanders
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 

Was ist angesagt? (20)

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 

Andere mochten auch

Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Southside Green - Opening Presentation
Southside Green - Opening PresentationSouthside Green - Opening Presentation
Southside Green - Opening PresentationM. Damon Weiss
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Python - Introdução Båsica
Python - Introdução BåsicaPython - Introdução Båsica
Python - Introdução BåsicaChristian Perone
 
Convolution Neural Networks
Convolution Neural NetworksConvolution Neural Networks
Convolution Neural NetworksAhmedMahany
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015Jia-Bin Huang
 
20170220 pielke-sr-climate-combined
20170220 pielke-sr-climate-combined20170220 pielke-sr-climate-combined
20170220 pielke-sr-climate-combinedFabius Maximus
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”delimitry
 
From Data to Argument
From Data to ArgumentFrom Data to Argument
From Data to Argumentrahulbot
 
OSCon - Performance vs Scalability
OSCon - Performance vs ScalabilityOSCon - Performance vs Scalability
OSCon - Performance vs ScalabilityGleicon Moraes
 
Monografia pós Graduação Cristiano Moreti
Monografia pós Graduação Cristiano MoretiMonografia pós Graduação Cristiano Moreti
Monografia pós Graduação Cristiano MoretiCristiano Moreti
 
Architecture by Accident
Architecture by AccidentArchitecture by Accident
Architecture by AccidentGleicon Moraes
 
Celery for internal API in SOA infrastructure
Celery for internal API in SOA infrastructureCelery for internal API in SOA infrastructure
Celery for internal API in SOA infrastructureRoman Imankulov
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...Masumi Shirakawa
 
C++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing featuresC++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing featuresChristian Perone
 
Lean Startup Basics @ FinTechMeetup Frankfurt
Lean Startup Basics @ FinTechMeetup FrankfurtLean Startup Basics @ FinTechMeetup Frankfurt
Lean Startup Basics @ FinTechMeetup FrankfurtPaul Herwarth von Bittenfeld
 

Andere mochten auch (20)

Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Southside Green - Opening Presentation
Southside Green - Opening PresentationSouthside Green - Opening Presentation
Southside Green - Opening Presentation
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Python - Introdução Båsica
Python - Introdução BåsicaPython - Introdução Båsica
Python - Introdução Båsica
 
Convolution Neural Networks
Convolution Neural NetworksConvolution Neural Networks
Convolution Neural Networks
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
20170220 pielke-sr-climate-combined
20170220 pielke-sr-climate-combined20170220 pielke-sr-climate-combined
20170220 pielke-sr-climate-combined
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Differential evolution
Differential evolutionDifferential evolution
Differential evolution
 
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”
Python dict: ĐżŃ€ĐŸŃˆĐ»ĐŸĐ”, ĐœĐ°ŃŃ‚ĐŸŃŃ‰Đ”Đ”, Đ±ŃƒĐŽŃƒŃ‰Đ”Đ”
 
From Data to Argument
From Data to ArgumentFrom Data to Argument
From Data to Argument
 
OSCon - Performance vs Scalability
OSCon - Performance vs ScalabilityOSCon - Performance vs Scalability
OSCon - Performance vs Scalability
 
Monografia pós Graduação Cristiano Moreti
Monografia pós Graduação Cristiano MoretiMonografia pós Graduação Cristiano Moreti
Monografia pós Graduação Cristiano Moreti
 
Architecture by Accident
Architecture by AccidentArchitecture by Accident
Architecture by Accident
 
Celery for internal API in SOA infrastructure
Celery for internal API in SOA infrastructureCelery for internal API in SOA infrastructure
Celery for internal API in SOA infrastructure
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
C++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing featuresC++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing features
 
Lean Startup Basics @ FinTechMeetup Frankfurt
Lean Startup Basics @ FinTechMeetup FrankfurtLean Startup Basics @ FinTechMeetup Frankfurt
Lean Startup Basics @ FinTechMeetup Frankfurt
 

Ähnlich wie Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Big data clustering
Big data clusteringBig data clustering
Big data clusteringJagadeesan A S
 
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick WendellApache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline servingStepan Pushkarev
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 

Ähnlich wie Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python (20)

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick WendellApache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 

Mehr von Christian Perone

Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionChristian Perone
 
Bayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studiesBayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studiesChristian Perone
 
Uncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep LearningUncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep LearningChristian Perone
 
PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hoodChristian Perone
 
Machine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learnMachine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learnChristian Perone
 

Mehr von Christian Perone (6)

PyTorch 2 Internals
PyTorch 2 InternalsPyTorch 2 Internals
PyTorch 2 Internals
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
 
Bayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studiesBayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studies
 
Uncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep LearningUncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep Learning
 
PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
 
Machine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learnMachine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learn
 

KĂŒrzlich hochgeladen

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

KĂŒrzlich hochgeladen (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

  • 1. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Apache Spark Large-scale recommendations with Apache Spark and Python Christian S. Perone christian.perone@gmail.com
  • 2. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A AGENDA INTRODUCTION Big Data The Elephant APACHE SPARK Apache Spark Introduction Resilient Distributed Datasets Data Frames Spark and Machine Learning COLLABORATIVE FILTERING Introduction Factorization Practice time Q&A
  • 3. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHO AM I Christian S. Perone Machine Learning/Software Engineer Blog http://blog.christianperone.com Open-source projects https://github.com/perone Twitter @tarantulae
  • 4. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section I INTRODUCTION
  • 5. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ?
  • 6. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based
  • 7. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content
  • 8. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming
  • 9. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming Internet of Things (IoT)
  • 10. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming Internet of Things (IoT) We want to being able to handle data, query, build models, make predictions, etc.
  • 11. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014
  • 12. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014 (...) we don’t really use MapReduce anymore. —Urs Hölzle, Google I/O Keynote (see context) – 2014
  • 13. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014 (...) we don’t really use MapReduce anymore. —Urs Hölzle, Google I/O Keynote (see context) – 2014 Every real distributed machine learning (ML) researcher/engineer knows that MR is bad. ML algorithms are iterative and MR is not suited for iterative algorithms, which is due to unnecessary frequent I/O (...). —Kenneth Tran, On the imminent decline of MapReduce – 2014
  • 14. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on (...). —Mahaut, Goodbye MapReduce – 2014
  • 15. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section II APACHE SPARK
  • 16. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop.
  • 17. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs
  • 18. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs Spark has a rich API and bindings for Scala/Python/Java/R, including an iterative shell for Python and Scala.
  • 19. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs Spark has a rich API and bindings for Scala/Python/Java/R, including an iterative shell for Python and Scala. We will focus in the Python API.
  • 20. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset).
  • 21. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster
  • 22. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, ïŹlter, group, etc)
  • 23. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, ïŹlter, group, etc) These RDDs can be rebuild upon failure and they are lazy
  • 24. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, ïŹlter, group, etc) These RDDs can be rebuild upon failure and they are lazy Controllable persistence for reuse (including caching in RAM)
  • 25. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS
  • 26. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - TRANSFORMATIONS VS ACTIONS The operations that can be applied on the RDDs have two main types: TRANSFORMATIONS These are the lazy operations to create new RDDs based on other RDDs. Example: map, ïŹlter, union, distinct, etc. ACTIONS These are the operations that actually does some computation and get the results or write to disk. Example: count, collect, ïŹrst
  • 27. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - JOB EXECUTION
  • 28. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4]
  • 29. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4] Creating a RDD from a list: >>> data = [1, 2, 3, 4, 5, 6, 7, 8] >>> rdd = sc.parallelize(data)
  • 30. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4] Creating a RDD from a list: >>> data = [1, 2, 3, 4, 5, 6, 7, 8] >>> rdd = sc.parallelize(data) Creating a RDD from a ïŹle: >>> rdd = sc.textFile("data.txt")
  • 31. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238
  • 32. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238 Collecting the interesting lines: >>> rdd_log = sc.textFile('nginx_access.log') >>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect() >>> lines ['201.140.8.128 [19/Jun/2012:09:17:31 +0100] "GET /x.html HTTP/1.1"', (...)]
  • 33. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238 Collecting the interesting lines: >>> rdd_log = sc.textFile('nginx_access.log') >>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect() >>> lines ['201.140.8.128 [19/Jun/2012:09:17:31 +0100] "GET /x.html HTTP/1.1"', (...)] Breaking down: >>> filter_rdd = rdd_log.filter(lambda l: 'x.html' in l) >>> filter_rdd.count() 238
  • 34. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do.
  • 35. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do. They also miss some important optimizations, specially for PySpark.
  • 36. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do. They also miss some important optimizations, specially for PySpark. That’s why DataFrames are so awesome.
  • 37. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation.
  • 38. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions).
  • 39. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions). Can load data from JSON/Parquet/libsvm/etc.
  • 40. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions). Can load data from JSON/Parquet/libsvm/etc. Optimizer is able to look inside of operations.
  • 41. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4]
  • 42. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4] Creating a DataFrame from a JSON: >>> df = spark.read.json("example.json")
  • 43. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4] Creating a DataFrame from a JSON: >>> df = spark.read.json("example.json") Filter by a column: >>> df.filter(df["User"]=="Perone").count() 120
  • 44. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API.
  • 45. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API. Extensible query optimizer.
  • 46. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API. Extensible query optimizer.
  • 47. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 48. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 49. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2))) tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right }
  • 50. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - SPARK.ML VS SPARK.MLLIB As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. —http://spark.apache.org/docs/latest/ml-guide.html
  • 51. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - SPARK.ML VS SPARK.MLLIB As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. —http://spark.apache.org/docs/latest/ml-guide.html MLlib will still support the RDD-based API with bug ïŹxes. No more new features to the RDD-based API. In the Spark 2.x releases, will add features to the DataFrames-based API to reach feature parity with the RDD-based API. After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated. The RDD-based API is expected to be removed in Spark 3.0.
  • 52. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - STACK
  • 53. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classiïŹcation, regression, clustering, and collaborative ïŹltering
  • 54. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classiïŹcation, regression, clustering, and collaborative ïŹltering Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • 55. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classiïŹcation, regression, clustering, and collaborative ïŹltering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • 56. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classiïŹcation, regression, clustering, and collaborative ïŹltering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and loading models, Pipelines, etc.
  • 57. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classiïŹcation, regression, clustering, and collaborative ïŹltering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and loading models, Pipelines, etc. Utilities: linear algebra, statistics, data handling, etc.
  • 58. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ]
  • 59. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
  • 60. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])] >>> word2Vec = Word2Vec(vectorSize=3, minCount=0, ... inputCol="text", outputCol="result") >>> model = word2Vec.fit(documentDF)
  • 61. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])] >>> word2Vec = Word2Vec(vectorSize=3, minCount=0, ... inputCol="text", outputCol="result") >>> model = word2Vec.fit(documentDF) >>> result = model.transform(documentDF) >>> result.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'], result=DenseVector([-0.0168, 0.0042, -0.0308]))]
  • 62. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section III COLLABORATIVE FILTERING
  • 63. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative ïŹltering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users.
  • 64. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative ïŹltering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items)
  • 65. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative ïŹltering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata
  • 66. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative ïŹltering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata Suffers from “new item” problem
  • 67. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative ïŹltering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata Suffers from “new item” problem Cold start
  • 68. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A EXPLICIT FACTORIZATION Approximate the ratings matrix: ( ) (x y )() ?231 1??4 32?? 532? Items Users Christian AC/DCBackinBlack ≈
  • 69. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A EXPLICIT FACTORIZATION Approximate the ratings matrix: ( ) (x y )() ?231 1??4 32?? 532? Items Users Christian AC/DCBackinBlack ≈ OPTIMIZATION minx,y u,i (rui − xT u yi)2 + λ( u xu 2 + i yi 2 ) * omitted biases
  • 70. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A LET’S DO IT Practice time ! Notebook at: https://github.com/perone/spark-als-intro Load/parse data Pandas integration, sampling, plotting Spark SQL Split data (train/test) Build model Train model Evaluate model Have fun !
  • 71. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section IV Q&A
  • 72. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Q&A