SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Getting the best
Performance with PySpark
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out next year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
What is going to be covered:
● What I think I might know about you
● A quick background of how PySpark works
● RDD re-use (caching, persistence levels, and checkpointing)
● Working with key/value data
○ Why group key is evil and what we can do about it
● When Spark SQL can be amazing and wonderful
● A brief introduction to Datasets (new in Spark 1.6)
● Calling Scala code from Python with Spark
Torsten Reuschling
Or….
Huang
Yun
Chung
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Know some Apache Spark
● Want to scale your Apache Spark jobs
● Don’t overly mind a grab-bag of topics
Lori Erickson
A detour into PySpark’s internals
Photo by Bill Ward
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
Lets look at some old stand bys:
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
grouped.mapValues(lambda counts: sum(counts))
warnings = rdd.filter(lambda x: x.lower.find("warning") != -1).
count()
Tomomi
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_AND_DISK
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
“Normal” Word count w/RDDs
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
pipelined
inside of the
same python
executor
GroupByKey
reduceByKey
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_T, T, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
Well there is a bit of magic in the shuffle….
● We can reuse shuffle files
● But it can (and does) explode*
Sculpture by Flaming Lotus Girls
Photo by Zaskoda
Our saviour from serialization: DataFrames
● For the most part keeps data in the JVM
○ Notable exception is UDFs written in Python
● Takes our python calls and turns it into a query plan
● If we need more than the native operations in Spark’s
DataFrames
● be wary of Distributed Systems bringing claims of
usability….
So what are Spark DataFrames?
● More than SQL tables
● Not Pandas or R DataFrames
● Semi-structured (have schema information)
● tabular
● work on expression instead of lambdas
○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x.
happy == true))
Where can Spark SQL benefit perf?
● Structured or semi-structured data
● OK with having less* complex operations available to us
● We may only need to operate on a subset of the data
○ The fastest data to process isn’t even read
● Remember that non-magic cat? Its got some magic** now
○ In part from peeking inside of boxes
● non-JVM (aka Python & R) users: saved from double serialization cost! :)
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila
Why is Spark SQL good for those things?
● Space efficient columnar cached representation
● Able to push down operations to the data store
● Optimizer is able to look inside of our operations
○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
Matti Mattila
How much faster can it be? (Scala)
Andrew Skudder
How much faster can it be? (Python)
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.
Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
Or we can make a UDF
def function(x):
# Some magic
sqlContext.registerFunction(“name”, function,
IntegerType())
Buuuut….
● Our UDFs will be “slow” (e.g. require data copy from
executor and back)
Mixing Python & JVM code FTW:
● DataFrames are an example of pushing our processing
to the JVM
● Python UDFS & maps lose this benefit
● But we can write Scala UDFS and call them from
Python
○ py4j error messages can be difficult to understand :(
● Trickier with RDDs since stores pickled objects
Exposing functions to be callable from
Python:
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column = new Column
(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
Calling the functions with py4j*:
● The SparkContext has a reference to the jvm (_jvm)
● Many Python objects which are wrappers of JVM
objects have _j[objtype] to get the JVM object
○ rdd._jrdd
○ df._jdf
○ sc._jsc
● These are all private and may change
*The py4j bridge only exists on the driver**
** Not exactly true but close enough
e.g.:
def register_sql_extensions(sql_ctx):
scala_sql_context = sql_ctx._ssql_ctx
spark_ctx = sql_ctx._sc
(spark_ctx._jvm.com.sparklingpandas.functions
.registerUdfs(scala_sql_context))
More things to keep in mind with DFs (in Python)
● Schema serialized as json from JVM
● toPandas is essentially collect
● joins can result in the cross product
○ big data x big data =~ out of memory
● Pre 2.0: Use the HiveContext
○ you don’t need a hive install
○ more powerful UDFs, window functions, etc.
DataFrames aren’t quite as lazy...
● Keep track of schema information
● Loading JSON data involves looking at the data
● Before if we tried to load non-existent data wouldn’t fail
right away, now fails right away
Photo by Dan G
The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● Very early work going on to use Jython for simple UDFs
(e.g. 2.7 compat & no native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 65% faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
The “future*”: Faster interchange
● Faster interchange between Python and Spark (e.g.
Tungsten + Apache Arrow)? (SPARK-13391 & SPARK-
13534)
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
Spark Testing Resources
● Libraries
○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
○ Java: spark-testing-base (unit)
○ Python: spark-testing-base (unittest2), pyspark.test (pytest)
● Strata San Jose Talk (up on YouTube)
● Blog posts
○ Unit Testing Spark with Java by Jesse Anderson
○ Making Apache Spark Testing Easy with Spark Testing Base
○ Unit testing Apache Spark with py.test
raider of gin
Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Kay Ousterhout’s work
○ http://www.eecs.berkeley.edu/~keo/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
raider of gin
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Chapter 9(ish) - Going Beyond Scala
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
k thnx bye!
If you care about Spark testing and
don’t hate surveys: http://bit.
ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
PySpark Users: Have some simple
UDFs you wish ran faster you are
willing to share?:
http://bit.ly/pySparkUDF

Weitere ähnliche Inhalte

Was ist angesagt?

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 

Was ist angesagt? (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 

Ähnlich wie Getting The Best Performance With PySpark

Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 

Ähnlich wie Getting The Best Performance With PySpark (20)

Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Kürzlich hochgeladen (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Getting The Best Performance With PySpark

  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out next year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3. What is going to be covered: ● What I think I might know about you ● A quick background of how PySpark works ● RDD re-use (caching, persistence levels, and checkpointing) ● Working with key/value data ○ Why group key is evil and what we can do about it ● When Spark SQL can be amazing and wonderful ● A brief introduction to Datasets (new in Spark 1.6) ● Calling Scala code from Python with Spark Torsten Reuschling
  • 5. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Jon Ross
  • 6. Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● Know some Apache Spark ● Want to scale your Apache Spark jobs ● Don’t overly mind a grab-bag of topics Lori Erickson
  • 7. A detour into PySpark’s internals Photo by Bill Ward
  • 8. Spark in Scala, how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this
  • 9. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 10. So how does that impact PySpark? ● Data from Spark worker serialized and piped to Python worker ○ Multiple iterator-to-iterator transformations are still pipelined :) ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● etc.
  • 11. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 12. Lets look at some old stand bys: words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() grouped.mapValues(lambda counts: sum(counts)) warnings = rdd.filter(lambda x: x.lower.find("warning") != -1). count() Tomomi
  • 13. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_AND_DISK ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 14. What is key skew and why do we care? ● Keys aren’t evenly distributed ○ Sales by zip code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 15. groupByKey - just how evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  • 16. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 17. “Normal” Word count w/RDDs lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD These are still pipelined inside of the same python executor
  • 20. So what did we do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 21. Can just the shuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 22. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  • 23. 100% less explosions (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R) Jennifer Williams
  • 24. Well there is a bit of magic in the shuffle…. ● We can reuse shuffle files ● But it can (and does) explode* Sculpture by Flaming Lotus Girls Photo by Zaskoda
  • 25. Our saviour from serialization: DataFrames ● For the most part keeps data in the JVM ○ Notable exception is UDFs written in Python ● Takes our python calls and turns it into a query plan ● If we need more than the native operations in Spark’s DataFrames ● be wary of Distributed Systems bringing claims of usability….
  • 26. So what are Spark DataFrames? ● More than SQL tables ● Not Pandas or R DataFrames ● Semi-structured (have schema information) ● tabular ● work on expression instead of lambdas ○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x. happy == true))
  • 27. Where can Spark SQL benefit perf? ● Structured or semi-structured data ● OK with having less* complex operations available to us ● We may only need to operate on a subset of the data ○ The fastest data to process isn’t even read ● Remember that non-magic cat? Its got some magic** now ○ In part from peeking inside of boxes ● non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  • 28. Why is Spark SQL good for those things? ● Space efficient columnar cached representation ● Able to push down operations to the data store ● Optimizer is able to look inside of our operations ○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  • 29. How much faster can it be? (Scala) Andrew Skudder
  • 30. How much faster can it be? (Python) Andrew Skudder *Note: do not compare absolute #s with previous graph - different dataset sizes because I forgot to write it down when I made the first one.
  • 31. Word count w/Dataframes df = sqlCtx.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 32. Or we can make a UDF def function(x): # Some magic sqlContext.registerFunction(“name”, function, IntegerType())
  • 33. Buuuut…. ● Our UDFs will be “slow” (e.g. require data copy from executor and back)
  • 34. Mixing Python & JVM code FTW: ● DataFrames are an example of pushing our processing to the JVM ● Python UDFS & maps lose this benefit ● But we can write Scala UDFS and call them from Python ○ py4j error messages can be difficult to understand :( ● Trickier with RDDs since stores pickled objects
  • 35. Exposing functions to be callable from Python: // functions we want to be callable from python object functions { def kurtosis(e: Column): Column = new Column (Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) } }
  • 36. Calling the functions with py4j*: ● The SparkContext has a reference to the jvm (_jvm) ● Many Python objects which are wrappers of JVM objects have _j[objtype] to get the JVM object ○ rdd._jrdd ○ df._jdf ○ sc._jsc ● These are all private and may change *The py4j bridge only exists on the driver** ** Not exactly true but close enough
  • 37. e.g.: def register_sql_extensions(sql_ctx): scala_sql_context = sql_ctx._ssql_ctx spark_ctx = sql_ctx._sc (spark_ctx._jvm.com.sparklingpandas.functions .registerUdfs(scala_sql_context))
  • 38. More things to keep in mind with DFs (in Python) ● Schema serialized as json from JVM ● toPandas is essentially collect ● joins can result in the cross product ○ big data x big data =~ out of memory ● Pre 2.0: Use the HiveContext ○ you don’t need a hive install ○ more powerful UDFs, window functions, etc.
  • 39. DataFrames aren’t quite as lazy... ● Keep track of schema information ● Loading JSON data involves looking at the data ● Before if we tried to load non-existent data wouldn’t fail right away, now fails right away Photo by Dan G
  • 40. The “future*”: Awesome UDFs ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● Very early work going on to use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 65% faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 41.
  • 42. The “future*”: Faster interchange ● Faster interchange between Python and Spark (e.g. Tungsten + Apache Arrow)? (SPARK-13391 & SPARK- 13534) ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 43. Spark Testing Resources ● Libraries ○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) ○ Java: spark-testing-base (unit) ○ Python: spark-testing-base (unittest2), pyspark.test (pytest) ● Strata San Jose Talk (up on YouTube) ● Blog posts ○ Unit Testing Spark with Java by Jesse Anderson ○ Making Apache Spark Testing Easy with Spark Testing Base ○ Unit testing Apache Spark with py.test raider of gin
  • 44. Additional Spark Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Kay Ousterhout’s work ○ http://www.eecs.berkeley.edu/~keo/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau raider of gin
  • 45. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 46. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 47. And the next book….. First four chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Chapter 9(ish) - Going Beyond Scala Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 48. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  • 49. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit. ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF