SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
WITH
Gökhan Atıl
GÖKHAN ATIL
➤ Database Administrator
➤ Oracle ACE Director (2016)

ACE (2011)
➤ 10g/11g and R12 Oracle Certified Professional (OCP)
➤ Co-author of Expert Oracle Enterprise Manager 12c
➤ Founding Member and Vice President of TROUG
➤ Blogger (since 2008) gokhanatil.com
➤ Twitter: @gokhanatil
2
APACHE SPARK WITH PYTHON
➤ Introduction to Apache Spark
➤ Why Python (PySpark) instead of Scala?
➤ Spark RDD
➤ SQL and DataFrames
➤ Spark Streaming
➤ Spark Graphx
➤ Spark MLlib (Machine Learning)
3
INTRODUCTION TO APACHE SPARK
➤ A fast and general engine for large-scale data processing
➤ Top-Level Apache Project since 2014.
➤ Response to limitations in the MapReduce
➤ Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
➤ Implemented in Scala programming language, supports
Java, Scala, Python, R
➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud
4
DOWNLOAD AND RUN ON YOUR PC
➤ https://spark.apache.org/downloads.html
➤ Extract and Spark is ready:
tar -xzf spark-2.3.0-bin-hadoop2.7.tgz
spark-2.3.0-bin-hadoop2.7/bin/spark
➤ You can also use PIP:
pip install pyspark
5
PYSPARK AND SPARK-SUBMIT
➤ PySpark is the interface that gives access to Spark using the
Python programming language
➤ The spark-submit script in Spark’s bin directory is used to
launch applications on a cluster
spark-submit example1.py
6
WHY PYTHON INSTEAD OF SCALA?
➤ If you know Scala, then use Scala!
➤ Learning curve: Python is comparatively easier to learn
➤ Easy to use: Code readability, maintainability and familiarity is
far better with Python
➤ Libraries: Python comes with great libraries for data analysis,
statistics and visualization (numpy, pandas, matplotlib etc...)
➤ Performance:  Scala is faster then Python but if your Python
code just calls Spark libraries, the differences in performance is
minimal (*)
7
Reminder: Any new feature added in Spark API will be
available in Scala first
RESILIENT DISTRIBUTED DATASET (RDD)
➤ RDDs are the core data structure in Spark
➤ Distributed, resilient, immutable, can store unstructured and
structured data, lazy evaluated
8
node 1
RDD
partition 1
node 2
RDD
partition 2
node 3
RDD
partition 3
RDD
RDD TRANSFORMATIONS AND ACTIONS
sc.textFile(*)
9
RDD T1 T2 T3 ACTION
LAZY EVALUATATION
SPARK CONTEXT
.collect().reduceByKey(*).filter(*).map(*)
TRANSFORMATIONS ACTIONS
➤ map
➤ filter
➤ flatMap
➤ mapPartitions
➤ reduceByKey
➤ union
➤ intersection
➤ join
10
➤ collect
➤ count
➤ first
➤ take
➤ takeSample
➤ takeOrdered
➤ saveAsTextFile
➤ foreach
HOW TO CREATE RDD IN PYSPARK
➤ Referencing a dataset in an external storage system:
rdd = sc.textFile( ... )
➤ Parallelizing already existing collection:
rdd = sc.parallelize( ... )
➤ Creating RDD from already existing RDDs:
rdd2 = rdd1.map( ... )
11
USERS.CSV (MOVIELENS DATABASE)
id | age | gender | occupation | zip
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
12
M = 670
F = 273
EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
print sc.textFile( "users.csv" ) 
.map( lambda x: (x.split("|")[2], 1) ) 
.reduceByKey(lambda x,y:x+y).collect()
sc.stop()
13
M, 1
M, 1
F, 1
M, 1
[(u'M', 670), (u'F', 273)]
SPARK SQL AND DATAFRAMES
14
Catalyst
RDD
DataFrames/DataSetsSQL
SPARKSQL
MLlib GraphFrames
Structured
Streaming
➤ Spark SQL is Apache Spark's module for working with
structured data
DATAFRAMES AND DATASETS
➤ DataFrame is a distributed collection of "structured" data, organized
into named columns.
➤ Spark DataSets are statically typed, while Python is a dynamically
typed programming language so Python supports only DataFrames.
15
EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.groupby( "gender" ).count().show()
sc.stop()
16
DATAFRAME VERSUS RDD
17
?
CATALYST OPTIMIZER
➤ Spark SQL uses Catalyst optimizer to optimize query plans.
➤ Supports cost-based optimization since Spark 2.2
18
SQL
DataFrame
DataSet
Query
Plan
Optimized
Query Plan
RDD
Code Generation
CONVERSION BETWEEN RDD AND DATAFRAME
➤ An RDD can be converted to DataFrame using
createDataFrame or toDF method:
rdd = sc.parallelize([("osman",21),("ahmet",25)])
df = rdd.toDF( "name STRING, age INT" )
df.show()
➤ You can access underlying RDD of a DataFrame using rdd
property:
df.rdd.collect()
[Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)]
19
EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.createOrReplaceTempView( "users" )
spark.sql( "select count(*) from users" ).show()
spark.sql( "select case when age < 25 then '-25' 
when age between 25 and 39 then '25-40' 
when age >= 40 then '40+' end age_group, 
count(*) from users group by age_group order by 1" ).show()
20
EXAMPLE #4: READ AND WRITE DATA
df = spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" )
df.write.saveAsTable("users")
df .write.save("users.json", format="json", mode="overwrite")
spark.sql("SELECT gender, count(*) FROM 
json.`users.json` GROUP BY gender").show()
21
HIVE
SPARK STREAMING (DSTREAMS)
➤ Scalable, high-throughput, fault-tolerant stream processing of
live data streams
➤ Supports: File, Socket, Kafka, Flume, Kinesis
➤ Spark Streaming receives live input data streams and divides
the data into batches
22
EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS)
ssc = StreamingContext(sc, 1)
stream_data = ssc.textFileStream("file:///tmp/stream") 
.map( lambda x: x.split(","))
stream_data.pprint()
ssc.start()
ssc.awaitTermination()
23
EXAMPLE #5: OUTPUT
24
STRUCTURED STREAMING
➤ Stream processing engine built on the Spark SQL engine
➤ Supports File and Kafka sources for production; Socket and
Rate sources for testing
25
EXAMPLE #6: STRUCTURED STREAMING
stream_data = spark.readStream 
.load( format="csv",path="/tmp/stream/*.csv",
schema="name string, points int" ) 
.groupBy("name").sum("points").orderBy( "sum(points)",
ascending=0 )
stream_data.writeStream.start( format="console",
outputMode="complete" ).awaitTermination()
26
EXAMPLE #6: OUTPUT
27
GRAPHX (GRAPHFRAMES)
➤ GraphX is a new component in Spark for graphs and graph-
parallel computation.
28
EXAMPLE #7: GRAPHFRAMES
vertex =
spark.createDataFrame([
(1, "Ahmet"),
(2, "Mehmet"),
(3, "Cengiz"),
(4, "Osman")],
["id", "name"])
edges =
spark.createDataFrame([
( 1, 2, "friend" ),
( 2, 1, "friend" ),
( 2, 3, "friend" ),
( 3, 2, "friend" ),
( 2, 4, "friend" ),
( 4, 2, "friend" ),
( 3, 4, "friend" ),
( 4, 3, "friend" )],
["src","dst", "relation"])
29
EXAMPLE #7: GRAPHFRAMES
pyspark --packages graphframes:graphframes:0.5.0-spark2.1-
s_2.11
import graphframes as gf
g = gf.GraphFrame(vertex, edges)
g.shortestPaths([4]).show()
30
1
2
3
4
MLLIB (MACHINE LEARNING)
➤ Supports common ML Algorithms such as classification,
regression, clustering, and collaborative filtering
➤ Featurization:
➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...)
➤ Transformation (Tokenizer, StopWordsRemover ...)
➤ Selection (VectorSlicer, RFormula ... )
➤ Pipelines: combine multiple algorithms into a single pipeline,
or workflow
➤ DataFrame-based API is primary API
31
EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS)
def parseratings( x ):
v = x.split("::")
return (int(v[0]), int(v[1]), float(v[2]))
ratings = sc.textFile("ratings.dat").map(parseratings) 
.toDF( ["user", "id", "rating"] )
als = ALS(userCol="user", itemCol="id", ratingCol="rating")
model = als.fit(ratings)
model.recommendForAllUsers(10).show()
32
EXAMPLE #8 OUTPUT
33
Blog: www.gokhanatil.com Twitter: @gokhanatil

Weitere ähnliche Inhalte

Was ist angesagt?

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 

Was ist angesagt? (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark
SparkSpark
Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache spark
Apache sparkApache spark
Apache spark
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 

Ähnlich wie Introduction to Spark with Python

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 

Ähnlich wie Introduction to Spark with Python (20)

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Spark core
Spark coreSpark core
Spark core
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 

Mehr von Gokhan Atil

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
SQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulSQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulGokhan Atil
 
EM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIEM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIGokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsGokhan Atil
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAsGokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsGokhan Atil
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIGokhan Atil
 
EMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyEMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyGokhan Atil
 
Oracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseOracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseGokhan Atil
 
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluTROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluGokhan Atil
 
Oracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGOracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGGokhan Atil
 
Oracle 12c Database In-Memory
Oracle 12c Database In-MemoryOracle 12c Database In-Memory
Oracle 12c Database In-MemoryGokhan Atil
 
Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Gokhan Atil
 
Enterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsEnterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsGokhan Atil
 
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cUsing APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cGokhan Atil
 

Mehr von Gokhan Atil (15)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
SQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulSQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day Istanbul
 
EM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIEM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLI
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAs
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLI
 
EMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyEMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG Germany
 
Oracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseOracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash Course
 
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluTROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
 
Oracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGOracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIG
 
Oracle 12c Database In-Memory
Oracle 12c Database In-MemoryOracle 12c Database In-Memory
Oracle 12c Database In-Memory
 
Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?
 
Enterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsEnterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH Analytics
 
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cUsing APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
 

Kürzlich hochgeladen

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 

Kürzlich hochgeladen (20)

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 

Introduction to Spark with Python

  • 2. GÖKHAN ATIL ➤ Database Administrator ➤ Oracle ACE Director (2016)
 ACE (2011) ➤ 10g/11g and R12 Oracle Certified Professional (OCP) ➤ Co-author of Expert Oracle Enterprise Manager 12c ➤ Founding Member and Vice President of TROUG ➤ Blogger (since 2008) gokhanatil.com ➤ Twitter: @gokhanatil 2
  • 3. APACHE SPARK WITH PYTHON ➤ Introduction to Apache Spark ➤ Why Python (PySpark) instead of Scala? ➤ Spark RDD ➤ SQL and DataFrames ➤ Spark Streaming ➤ Spark Graphx ➤ Spark MLlib (Machine Learning) 3
  • 4. INTRODUCTION TO APACHE SPARK ➤ A fast and general engine for large-scale data processing ➤ Top-Level Apache Project since 2014. ➤ Response to limitations in the MapReduce ➤ Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ➤ Implemented in Scala programming language, supports Java, Scala, Python, R ➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud 4
  • 5. DOWNLOAD AND RUN ON YOUR PC ➤ https://spark.apache.org/downloads.html ➤ Extract and Spark is ready: tar -xzf spark-2.3.0-bin-hadoop2.7.tgz spark-2.3.0-bin-hadoop2.7/bin/spark ➤ You can also use PIP: pip install pyspark 5
  • 6. PYSPARK AND SPARK-SUBMIT ➤ PySpark is the interface that gives access to Spark using the Python programming language ➤ The spark-submit script in Spark’s bin directory is used to launch applications on a cluster spark-submit example1.py 6
  • 7. WHY PYTHON INSTEAD OF SCALA? ➤ If you know Scala, then use Scala! ➤ Learning curve: Python is comparatively easier to learn ➤ Easy to use: Code readability, maintainability and familiarity is far better with Python ➤ Libraries: Python comes with great libraries for data analysis, statistics and visualization (numpy, pandas, matplotlib etc...) ➤ Performance:  Scala is faster then Python but if your Python code just calls Spark libraries, the differences in performance is minimal (*) 7 Reminder: Any new feature added in Spark API will be available in Scala first
  • 8. RESILIENT DISTRIBUTED DATASET (RDD) ➤ RDDs are the core data structure in Spark ➤ Distributed, resilient, immutable, can store unstructured and structured data, lazy evaluated 8 node 1 RDD partition 1 node 2 RDD partition 2 node 3 RDD partition 3 RDD
  • 9. RDD TRANSFORMATIONS AND ACTIONS sc.textFile(*) 9 RDD T1 T2 T3 ACTION LAZY EVALUATATION SPARK CONTEXT .collect().reduceByKey(*).filter(*).map(*)
  • 10. TRANSFORMATIONS ACTIONS ➤ map ➤ filter ➤ flatMap ➤ mapPartitions ➤ reduceByKey ➤ union ➤ intersection ➤ join 10 ➤ collect ➤ count ➤ first ➤ take ➤ takeSample ➤ takeOrdered ➤ saveAsTextFile ➤ foreach
  • 11. HOW TO CREATE RDD IN PYSPARK ➤ Referencing a dataset in an external storage system: rdd = sc.textFile( ... ) ➤ Parallelizing already existing collection: rdd = sc.parallelize( ... ) ➤ Creating RDD from already existing RDDs: rdd2 = rdd1.map( ... ) 11
  • 12. USERS.CSV (MOVIELENS DATABASE) id | age | gender | occupation | zip 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344 8|36|M|administrator|05201 12 M = 670 F = 273
  • 13. EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV from pyspark import SparkContext sc = SparkContext.getOrCreate() print sc.textFile( "users.csv" ) .map( lambda x: (x.split("|")[2], 1) ) .reduceByKey(lambda x,y:x+y).collect() sc.stop() 13 M, 1 M, 1 F, 1 M, 1 [(u'M', 670), (u'F', 273)]
  • 14. SPARK SQL AND DATAFRAMES 14 Catalyst RDD DataFrames/DataSetsSQL SPARKSQL MLlib GraphFrames Structured Streaming ➤ Spark SQL is Apache Spark's module for working with structured data
  • 15. DATAFRAMES AND DATASETS ➤ DataFrame is a distributed collection of "structured" data, organized into named columns. ➤ Spark DataSets are statically typed, while Python is a dynamically typed programming language so Python supports only DataFrames. 15
  • 16. EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .groupby( "gender" ).count().show() sc.stop() 16
  • 18. CATALYST OPTIMIZER ➤ Spark SQL uses Catalyst optimizer to optimize query plans. ➤ Supports cost-based optimization since Spark 2.2 18 SQL DataFrame DataSet Query Plan Optimized Query Plan RDD Code Generation
  • 19. CONVERSION BETWEEN RDD AND DATAFRAME ➤ An RDD can be converted to DataFrame using createDataFrame or toDF method: rdd = sc.parallelize([("osman",21),("ahmet",25)]) df = rdd.toDF( "name STRING, age INT" ) df.show() ➤ You can access underlying RDD of a DataFrame using rdd property: df.rdd.collect() [Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)] 19
  • 20. EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .createOrReplaceTempView( "users" ) spark.sql( "select count(*) from users" ).show() spark.sql( "select case when age < 25 then '-25' when age between 25 and 39 then '25-40' when age >= 40 then '40+' end age_group, count(*) from users group by age_group order by 1" ).show() 20
  • 21. EXAMPLE #4: READ AND WRITE DATA df = spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) df.write.saveAsTable("users") df .write.save("users.json", format="json", mode="overwrite") spark.sql("SELECT gender, count(*) FROM json.`users.json` GROUP BY gender").show() 21 HIVE
  • 22. SPARK STREAMING (DSTREAMS) ➤ Scalable, high-throughput, fault-tolerant stream processing of live data streams ➤ Supports: File, Socket, Kafka, Flume, Kinesis ➤ Spark Streaming receives live input data streams and divides the data into batches 22
  • 23. EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS) ssc = StreamingContext(sc, 1) stream_data = ssc.textFileStream("file:///tmp/stream") .map( lambda x: x.split(",")) stream_data.pprint() ssc.start() ssc.awaitTermination() 23
  • 25. STRUCTURED STREAMING ➤ Stream processing engine built on the Spark SQL engine ➤ Supports File and Kafka sources for production; Socket and Rate sources for testing 25
  • 26. EXAMPLE #6: STRUCTURED STREAMING stream_data = spark.readStream .load( format="csv",path="/tmp/stream/*.csv", schema="name string, points int" ) .groupBy("name").sum("points").orderBy( "sum(points)", ascending=0 ) stream_data.writeStream.start( format="console", outputMode="complete" ).awaitTermination() 26
  • 28. GRAPHX (GRAPHFRAMES) ➤ GraphX is a new component in Spark for graphs and graph- parallel computation. 28
  • 29. EXAMPLE #7: GRAPHFRAMES vertex = spark.createDataFrame([ (1, "Ahmet"), (2, "Mehmet"), (3, "Cengiz"), (4, "Osman")], ["id", "name"]) edges = spark.createDataFrame([ ( 1, 2, "friend" ), ( 2, 1, "friend" ), ( 2, 3, "friend" ), ( 3, 2, "friend" ), ( 2, 4, "friend" ), ( 4, 2, "friend" ), ( 3, 4, "friend" ), ( 4, 3, "friend" )], ["src","dst", "relation"]) 29
  • 30. EXAMPLE #7: GRAPHFRAMES pyspark --packages graphframes:graphframes:0.5.0-spark2.1- s_2.11 import graphframes as gf g = gf.GraphFrame(vertex, edges) g.shortestPaths([4]).show() 30 1 2 3 4
  • 31. MLLIB (MACHINE LEARNING) ➤ Supports common ML Algorithms such as classification, regression, clustering, and collaborative filtering ➤ Featurization: ➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...) ➤ Transformation (Tokenizer, StopWordsRemover ...) ➤ Selection (VectorSlicer, RFormula ... ) ➤ Pipelines: combine multiple algorithms into a single pipeline, or workflow ➤ DataFrame-based API is primary API 31
  • 32. EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS) def parseratings( x ): v = x.split("::") return (int(v[0]), int(v[1]), float(v[2])) ratings = sc.textFile("ratings.dat").map(parseratings) .toDF( ["user", "id", "rating"] ) als = ALS(userCol="user", itemCol="id", ratingCol="rating") model = als.fit(ratings) model.recommendForAllUsers(10).show() 32