SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
PYSPARK IN PRACTICE
PYDATA LONDON 2016
Ronert Obst Senior Data Scientist
Dat Tran Data Scientist
 
 
0
AGENDA
○ Short introduction
○ Data structures
○ Configuration and performance
○ Unit testing with PySpark
○ Data pipeline management and workflows
○ Online learning with PySpark streaming
○ Operationalisation
WHAT IS PYSPARK?
DATA STRUCTURES
DATA STRUCTURES
○ RDDs
○ DataFrames
○ DataSets
DATAFRAMES
○ Built on top of RDD
○ Include metadata
○ Turns PySpark API calls into query plan
○ Less flexibel than RDD
○ Python UDFs impact performance, use builtin functions whenever
possible
○ HiveContext ftw
PYTHON DATAFRAME PERFORMANCE
Source: Databricks - Spark in Production (Aaron Davidson)
CONFIGURATION AND
PERFORMANCE
CONFIGURATION PRECEDENCE
1. Programatically (SparkConf())
2. Command line (spark-submit --master yarn --executor-
memory 20G ...)
3. Configuration files (spark-defaults.conf, spark-env.sh)
SPARK 1.6 MEMORY MANAGEMENT
Source: Spark Memory Management (Alexey Grishchenko) (https://0x0fff.com/spark-
memory-management/)
WORKED EXAMPLE
CLUSTER HARDWARE:
○ 6 nodes with 16 cores and 64 GB RAM each
YARN CONFIG:
○ yarn.nodemanager.resource.memory-mb: 61 GB
○ yarn.nodemanager.resource.cpu-vcores: 15
SPARK-DEFAULTS.CONF
○ num-executors: 4 executors per node * 6 nodes = 24
○ executor-cores: 6
○ executor-memory: 11600 MB
PYTHON OOM
More Python memory, e.g. scikit-learn
○ Reduce concurrency (4 executors per node with 4 executor cores)
○ Set spark.python.worker.memory = 1024 MB (default is 512 MB)
○ Leaves spark.executor.memory = 10600 MB
OTHER USEFUL SETTINGS
○ Save substantial memory: spark.rdd.compress = true
○ Relaunch stragglers: spark.speculation = true
○ Fix Python 3 hashseed issue:
spark.executorEnv.PYTHONHASHSEED = 0
TUNING PERFORMANCE
○ Re-use RDDs with cache()(set storage level to MEMORY_AND_DISK)
○ Avoid groupByKey() on RDDs
○ Distribute by key data =
sql_context.read.parquet(hdfs://...).repartition(N_PARTITIONS,
"key")
○ Use treeReduceand treeAggregateinstead of reduceand aggregate
○ sc.broadcast(broadcast_var)small tables containing frequently accessed
data
UNIT TESTING WITH PYSPARK
MOST DATA SCIENTIST WRITE CODE THAT
JUST WORKS
BUGS ARE EVERYWHERE...
TDD IS THE SOLUTION!
THE REALITY CAN BE DIFFICULT...
EXPLORATION PHASE
PRODUCTION PHASE
# Import script modules here
import clustering
class ClusteringTest(unittest.TestCase):
def setUp(self):
"""Create a single node Spark application."""
conf = SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "1")
conf.set("spark.app.name", "nosetest")
self.sc = SparkContext(conf=conf)
self.mock_df = self.mock_data()
def tearDown(self):
"""Stop the SparkContext."""
self.sc.stop()
TEST:
IMPLEMENTATION:
Full example:
def test_assign_cluster(self):
"""Check if rows are labeled are as expected."""
input_df = clustering.convert_df(self.sc, self.mock_df)
scaled_df = clustering.rescale_df(input_df)
label_df = clustering.assign_cluster(scaled_df)
self.assertEqual(label_df.map(lambda x: x.label).collect(), [0, 0, 0,
1, 1])
def assign_cluster(data):
"""Train kmeans on rescaled data and then label the rescaled data."""
kmeans = KMeans(k=2, seed=1, featuresCol="features_scaled", predictio
nCol="label")
model = kmeans.fit(data)
label_df = model.transform(data)
return label_df
https://github.com/datitran/spark-tdd-example
(https://github.com/datitran/spark-tdd-example)
DATA PIPELINE MANAGEMENT
AND WORKFLOWS
PIPELINE MANAGEMENT CAN BE
DIFFICULT...
IN THE PAST...
CUSTOM LOGGERS
○ 100% Python
○ Web dashboard
○ Parallel workers
○ Central scheduler
○ Many templates for various tasks e.g. Spark, SQL etc.
○ Well configured logger
○ Parameters
EXAMPLE SPARK TASK
class ClusteringTask(SparkSubmitTask):
"""Assign cluster label to some data."""
date = luigi.DateParameter()
num_k = luigi.IntParameter()
name = "Clustering with PySpark"
app = "../clustering.py"
def app_options(self):
return [self.num_k]
def requires(self):
return [FeatureEngineeringTask(date=self.date)]
def output(self):
return HdfsTarget("hdfs://...")
EXAMPLE CONFIG FILE
client.cfg
[resources]
spark: 2
[spark]
master: yarn
executor-cores: 3
num-executors: 11
executor-memory: 20G
WITH LUIGI
CAVEATS
○ No built-in job trigger
○ Documentation could be better
○ Logging can be too much when having multiple workers
○ Duplicated parameters in all downstream tasks
ONLINE LEARNING WITH
PYSPARK STREAMING
ONLINE LEARNING WITH PYSPARK
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
def parse(lp):
label = float(lp[lp.find('(') + 1: lp.find(',')])
vec = Vectors.dense(lp[lp.find('[') + 1: lp.find(']')].split(','))
return LabeledPoint(label, vec)
stream = scc.socketTextStream("localhost", 9999).map(parse)
numFeatures = 3
model = StreamingLinearRegressionWithSGD()
model.setInitialWeights([0.0, 0.0, 0.0])
model.trainOn(stream)
print(model.predictOn(stream.map(lambda lp: (lp.label, lp.features))))
ssc.start()
ssc.awaitTermination()
OPERATIONALISATION
RESTFUL API EXAMPLE WITH FLASK
import pandas as pd
from flask import Flask, request
app = Flask(__name__)
# Load model stored on redis or hdfs
app.model = load_model()
def predict(model, features):
"""Use the trained cluster centers to assign cluster to new feature
s."""
differences = pd.DataFrame(
model.values - features.values, columns=model.columns)
differences_square = differences.apply(lambda x: x**2)
col_sums = differences_square.sum(axis=1)
label = col_sums.idxmin()
return int(label)
@app.route("/clustering", methods=["POST"])
def clustering():
"""
curl -i -H "Content-Type: application/json" -X POST -d @example.json
"x.x.x.x:5000/clustering"
"""
features_json = request.json
features_df = pd.DataFrame(features_json)
return str(predict(load_model(), feature_df))
DEPLOY YOUR APPS WITH CF PUSH
○ Platform as a Service (PaaS)
○ "Here is my source code. Run it on the cloud for me. I do not care
how." (Onsi Fakhouri)
○ Trial version:
○ Python Cloud Foundry examples:
https://run.pivotal.io/ (https://run.pivotal.io/)
https://github.com/ihuston/python-
cf-examples (https://github.com/ihuston/python-cf-examples)
PySpark in practice slides
PySpark in practice slides

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

Was ist angesagt? (20)

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 

Ähnlich wie PySpark in practice slides

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Living the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonLiving the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonParis Container Day
 
Nomad Multi-Cloud
Nomad Multi-CloudNomad Multi-Cloud
Nomad Multi-CloudNic Jackson
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 

Ähnlich wie PySpark in practice slides (20)

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
Living the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonLiving the Nomadic life - Nic Jackson
Living the Nomadic life - Nic Jackson
 
Nomad Multi-Cloud
Nomad Multi-CloudNomad Multi-Cloud
Nomad Multi-Cloud
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 

Kürzlich hochgeladen

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 

Kürzlich hochgeladen (20)

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 

PySpark in practice slides

  • 1. PYSPARK IN PRACTICE PYDATA LONDON 2016 Ronert Obst Senior Data Scientist Dat Tran Data Scientist     0
  • 2. AGENDA ○ Short introduction ○ Data structures ○ Configuration and performance ○ Unit testing with PySpark ○ Data pipeline management and workflows ○ Online learning with PySpark streaming ○ Operationalisation
  • 5. DATA STRUCTURES ○ RDDs ○ DataFrames ○ DataSets
  • 6. DATAFRAMES ○ Built on top of RDD ○ Include metadata ○ Turns PySpark API calls into query plan ○ Less flexibel than RDD ○ Python UDFs impact performance, use builtin functions whenever possible ○ HiveContext ftw
  • 7. PYTHON DATAFRAME PERFORMANCE Source: Databricks - Spark in Production (Aaron Davidson)
  • 9. CONFIGURATION PRECEDENCE 1. Programatically (SparkConf()) 2. Command line (spark-submit --master yarn --executor- memory 20G ...) 3. Configuration files (spark-defaults.conf, spark-env.sh)
  • 10. SPARK 1.6 MEMORY MANAGEMENT Source: Spark Memory Management (Alexey Grishchenko) (https://0x0fff.com/spark- memory-management/)
  • 11. WORKED EXAMPLE CLUSTER HARDWARE: ○ 6 nodes with 16 cores and 64 GB RAM each YARN CONFIG: ○ yarn.nodemanager.resource.memory-mb: 61 GB ○ yarn.nodemanager.resource.cpu-vcores: 15 SPARK-DEFAULTS.CONF ○ num-executors: 4 executors per node * 6 nodes = 24 ○ executor-cores: 6 ○ executor-memory: 11600 MB
  • 12. PYTHON OOM More Python memory, e.g. scikit-learn ○ Reduce concurrency (4 executors per node with 4 executor cores) ○ Set spark.python.worker.memory = 1024 MB (default is 512 MB) ○ Leaves spark.executor.memory = 10600 MB
  • 13. OTHER USEFUL SETTINGS ○ Save substantial memory: spark.rdd.compress = true ○ Relaunch stragglers: spark.speculation = true ○ Fix Python 3 hashseed issue: spark.executorEnv.PYTHONHASHSEED = 0
  • 14. TUNING PERFORMANCE ○ Re-use RDDs with cache()(set storage level to MEMORY_AND_DISK) ○ Avoid groupByKey() on RDDs ○ Distribute by key data = sql_context.read.parquet(hdfs://...).repartition(N_PARTITIONS, "key") ○ Use treeReduceand treeAggregateinstead of reduceand aggregate ○ sc.broadcast(broadcast_var)small tables containing frequently accessed data
  • 15. UNIT TESTING WITH PYSPARK
  • 16. MOST DATA SCIENTIST WRITE CODE THAT JUST WORKS
  • 18. TDD IS THE SOLUTION!
  • 19. THE REALITY CAN BE DIFFICULT...
  • 21. PRODUCTION PHASE # Import script modules here import clustering class ClusteringTest(unittest.TestCase): def setUp(self): """Create a single node Spark application.""" conf = SparkConf() conf.set("spark.executor.memory", "1g") conf.set("spark.cores.max", "1") conf.set("spark.app.name", "nosetest") self.sc = SparkContext(conf=conf) self.mock_df = self.mock_data() def tearDown(self): """Stop the SparkContext.""" self.sc.stop()
  • 22. TEST: IMPLEMENTATION: Full example: def test_assign_cluster(self): """Check if rows are labeled are as expected.""" input_df = clustering.convert_df(self.sc, self.mock_df) scaled_df = clustering.rescale_df(input_df) label_df = clustering.assign_cluster(scaled_df) self.assertEqual(label_df.map(lambda x: x.label).collect(), [0, 0, 0, 1, 1]) def assign_cluster(data): """Train kmeans on rescaled data and then label the rescaled data.""" kmeans = KMeans(k=2, seed=1, featuresCol="features_scaled", predictio nCol="label") model = kmeans.fit(data) label_df = model.transform(data) return label_df https://github.com/datitran/spark-tdd-example (https://github.com/datitran/spark-tdd-example)
  • 24. PIPELINE MANAGEMENT CAN BE DIFFICULT...
  • 27. ○ 100% Python ○ Web dashboard ○ Parallel workers ○ Central scheduler ○ Many templates for various tasks e.g. Spark, SQL etc. ○ Well configured logger ○ Parameters
  • 28. EXAMPLE SPARK TASK class ClusteringTask(SparkSubmitTask): """Assign cluster label to some data.""" date = luigi.DateParameter() num_k = luigi.IntParameter() name = "Clustering with PySpark" app = "../clustering.py" def app_options(self): return [self.num_k] def requires(self): return [FeatureEngineeringTask(date=self.date)] def output(self): return HdfsTarget("hdfs://...")
  • 29. EXAMPLE CONFIG FILE client.cfg [resources] spark: 2 [spark] master: yarn executor-cores: 3 num-executors: 11 executor-memory: 20G
  • 31. CAVEATS ○ No built-in job trigger ○ Documentation could be better ○ Logging can be too much when having multiple workers ○ Duplicated parameters in all downstream tasks
  • 33. ONLINE LEARNING WITH PYSPARK from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.regression import StreamingLinearRegressionWithSGD def parse(lp): label = float(lp[lp.find('(') + 1: lp.find(',')]) vec = Vectors.dense(lp[lp.find('[') + 1: lp.find(']')].split(',')) return LabeledPoint(label, vec) stream = scc.socketTextStream("localhost", 9999).map(parse) numFeatures = 3 model = StreamingLinearRegressionWithSGD() model.setInitialWeights([0.0, 0.0, 0.0]) model.trainOn(stream) print(model.predictOn(stream.map(lambda lp: (lp.label, lp.features)))) ssc.start() ssc.awaitTermination()
  • 35. RESTFUL API EXAMPLE WITH FLASK import pandas as pd from flask import Flask, request app = Flask(__name__) # Load model stored on redis or hdfs app.model = load_model() def predict(model, features): """Use the trained cluster centers to assign cluster to new feature s.""" differences = pd.DataFrame( model.values - features.values, columns=model.columns) differences_square = differences.apply(lambda x: x**2) col_sums = differences_square.sum(axis=1) label = col_sums.idxmin() return int(label)
  • 36. @app.route("/clustering", methods=["POST"]) def clustering(): """ curl -i -H "Content-Type: application/json" -X POST -d @example.json "x.x.x.x:5000/clustering" """ features_json = request.json features_df = pd.DataFrame(features_json) return str(predict(load_model(), feature_df))
  • 37. DEPLOY YOUR APPS WITH CF PUSH ○ Platform as a Service (PaaS) ○ "Here is my source code. Run it on the cloud for me. I do not care how." (Onsi Fakhouri) ○ Trial version: ○ Python Cloud Foundry examples: https://run.pivotal.io/ (https://run.pivotal.io/) https://github.com/ihuston/python- cf-examples (https://github.com/ihuston/python-cf-examples)