SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Improving PySpark
Performance
Spark performance beyond the JVM
PyData Amsterdam 2016
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer at IBM
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
What is going to be covered:
● What I think I might know about you
● A brief look at Spark’s components
● A quick glance at PySpark’s architecture & its performance implications
● When Spark SQL can be amazing and wonderful
● How to mix JVM & Python code for Spark awesomeness
● Working with key/value data
○ Why group key is evil and what we can do about it
Who I think you wonderful humans are?
● Nice* people
● Know Python
● Don’t mind pictures of cats
● Know some Apache Spark
○ Don’t know Spark? No worries
○ This isn’t an intro talk but my last PyData talk was one and its on youtube with some of my
other talks at http://bit.ly/holdenSparkVideos
○ Will still cover enough for everyone to follow along
● Want to make more performant PySpark Jobs
● Don’t overly mind a grab-bag of topics
Lori Erickson
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
What is Spark?
● General purpose distributed system
○ With a really nice API
○ And integrated libraries for many common tasks
● Apache project (one of the most active)
● Must faster than Hadoop Map/Reduce
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages
A detour into PySpark’s internals
Photo by Bill Ward
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Word count w/RDDs
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
pipelined
inside of the
same python
executor
Our saviour from serialization: DataFrames
● For the most part keeps data in the JVM
○ Notable exception is UDFs written in Python
● Takes our python calls and turns it into a query plan
● If we need more than the native operations in Spark’s
DataFrames
● be wary of Distributed Systems bringing claims of
usability….
So what are Spark DataFrames?
● More than SQL tables
● Not Pandas or R DataFrames
● Semi-structured (have schema information)
● tabular
● work on expression instead of lambdas
○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x.
happy == true))
Just how fast are DataFrames? (scala)
Well let's try word count on DataFrames
● We need a way to split the words
● Could convert to an RDD
● Or extend with a UDF
● Or if we were in 2.0 (or Scala) we could use Datasets
DataFrames to RDDs and back
● map lets us work per-row
df.map(lambda row: row.text)
● Converting back
○ infer_schema
○ specify the schema
Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
Or we can make a UDF
def function(x):
# Some magic
sqlContext.registerFunction(“name”, function,
IntegerType())
Mixing Python & JVM code FTW:
● DataFrames are an example of pushing our processing
to the JVM
● Python UDFS & maps lose this benefit
● But we can write Scala UDFS and call them from
Python
○ py4j error messages can be difficult to understand :(
● Trickier with RDDs since stores pickled objects
Exposing functions to be callable from
Python:
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column = new Column
(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
Calling the functions with py4j*:
● The SparkContext has a reference to the jvm (_jvm)
● Many Python objects which are wrappers of JVM
objects have _j[objtype] to get the JVM object
○ rdd._jrdd
○ df._jdf
○ sc._jsc
● These are all private and may change
*The py4j bridge only exists on the driver**
** Not exactly true but close enough
e.g.:
def register_sql_extensions(sql_ctx):
scala_sql_context = sql_ctx._ssql_ctx
spark_ctx = sql_ctx._sc
(spark_ctx._jvm.com.sparklingpandas.functions
.registerUdfs(scala_sql_context))
More things to keep in mind with DFs (in Python)
● Schema serialized as json from JVM
● toPandas is essentially collect
● joins can result in the cross product
○ big data x big data =~ out of memory
● Use the HiveContext
○ you don’t need a hive install
○ more powerful UDFs, window functions, etc.
DataFrames aren’t quite as lazy...
● Keep track of schema information
● Loading JSON data involves looking at the data
● Before if we tried to load non-existent data wouldn’t fail
right away, now fails right away
Why is groupByKey so slow anyways?
● Well the answer is it sort of depends on what our
distribution looks like
● If we’ve got highly skewed data this can cause all sorts
of problems, not just with groupByKey
● groupByKey is just the canary in the coal mine
LadyJillybea
n
Considerations for Key/Value Data
● What does the distribution of keys look like?
● What type of aggregations do we need to do?
● Do we want our data in any particular order?
● Are we joining with another RDD?
● Whats our partitioner?
○ If we don’t have an explicit one: what is the partition structure?
eleda 1
What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by postal code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
Let’s see what it looks like when we run the two
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions
// Evil group by key version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
val evilWordCounts = grouped.mapValues(_.sum)
evilWordCounts.take(5)
// Less evil version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.take(5)
GroupByKey
reduceByKey
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
Everyone* needs reduce, let’s make it faster!
● reduce & aggregate have “tree” versions
● we already had free map-side reduction
● but now we can get even better!**
**And we might be able to make even cooler versions
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_AND_DISK, MEMORY_AND_DISK, OFF_HEAP
● There is also _SER versions in Scala, but Python RDDs are already pickled
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
Richard Gillin
Using other libraries
● built ins
○ just import!*
■ Except for Hive, compile with -PHive & then import
● spark-packages
○ --packages
● generic python
○ pre-install on workers (pssh, puppet, etc.)
○ add it with --zip-files
○ sc.addPyFile
○ CDH + Continuum + Anaconda =~ sanity
And the next book…..
Still being written - signup to be notified when it is available:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
● Examples are Scala centric
○ Sorry!
○ but will try and port the examples to Python in repo once finished
w/Scala & Java
○ does cover interaction with other languages
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Office hours - I’ll answer your questions :)*
● IRL this Monday
○ Booking.com offices @ 6pm - http://www.meetup.com/Amsterdam-
Spark/events/228667345/
● On-line in the future
○ Follow me on twitter or linkedin and I’ll post when it’s going on
*If I can - if I don’t know I’ll try and figure it out but no guarantees
k thnx bye!
If you care about Spark testing and
don’t hate surveys: http://bit.
ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Cat wave photo by Quinn Dombrowski
Preview: bringing codegen to Spark ML
● Based on Spark SQL’s code generation
○ First draft using quasiquotes
○ Switch to janino for Java compilation
● Initial draft for Gradient Boosted Trees
○ Based on DB’s work
○ First draft with QuasiQuotes
■ Moved to Java for speed
○ See SPARK-10387 for the details
Jon
@Override
public double call(Vector input) throws
Exception {
if (input.apply(1) <= 1.0) {
return 0.1;
} else {
if (input.apply(0) <= 0.5) {
return 0.0;
} else {
return 2.0;
}
}
}
(1, 1.0)
0.1 (0, 0.5)
0.0 2.0
What the generated code looks like: Glenn Simmons

Weitere ähnliche Inhalte

Was ist angesagt?

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 

Was ist angesagt? (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 

Andere mochten auch

Visual Resume 2
Visual Resume 2Visual Resume 2
Visual Resume 2
cachaf2
 
Plate tectonic Jan Barmuta
Plate tectonic Jan BarmutaPlate tectonic Jan Barmuta
Plate tectonic Jan Barmuta
Jan Barmuta
 
Incentive.trip.mailer
Incentive.trip.mailerIncentive.trip.mailer
Incentive.trip.mailer
liam65
 
подготовка учащихся к гиа
подготовка учащихся к гиаподготовка учащихся к гиа
подготовка учащихся к гиа
malachowa
 

Andere mochten auch (20)

Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
議價規則
議價規則議價規則
議價規則
 
Curso
CursoCurso
Curso
 
081254
081254081254
081254
 
Visual Resume 2
Visual Resume 2Visual Resume 2
Visual Resume 2
 
Plate tectonic Jan Barmuta
Plate tectonic Jan BarmutaPlate tectonic Jan Barmuta
Plate tectonic Jan Barmuta
 
081254
081254081254
081254
 
081254
081254081254
081254
 
Imperialismo
ImperialismoImperialismo
Imperialismo
 
Bs 2
Bs   2Bs   2
Bs 2
 
Incentive.trip.mailer
Incentive.trip.mailerIncentive.trip.mailer
Incentive.trip.mailer
 
งาน
งานงาน
งาน
 
подготовка учащихся к гиа
подготовка учащихся к гиаподготовка учащихся к гиа
подготовка учащихся к гиа
 
Women's Golf Report Apr2012
Women's Golf Report Apr2012Women's Golf Report Apr2012
Women's Golf Report Apr2012
 
March 2012 3-6 FA Newsletter
March 2012 3-6 FA NewsletterMarch 2012 3-6 FA Newsletter
March 2012 3-6 FA Newsletter
 
Master in Financial Markets and Risk Evaluation (FIRE)
Master in Financial Markets and Risk Evaluation (FIRE)Master in Financial Markets and Risk Evaluation (FIRE)
Master in Financial Markets and Risk Evaluation (FIRE)
 
Imperialismo
ImperialismoImperialismo
Imperialismo
 
Memantaskan diri
Memantaskan diriMemantaskan diri
Memantaskan diri
 

Ähnlich wie Improving PySpark performance: Spark Performance Beyond the JVM

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 

Ähnlich wie Improving PySpark performance: Spark Performance Beyond the JVM (20)

Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 

Kürzlich hochgeladen

Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Kürzlich hochgeladen (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 

Improving PySpark performance: Spark Performance Beyond the JVM

  • 1. Improving PySpark Performance Spark performance beyond the JVM PyData Amsterdam 2016
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Software Engineer at IBM ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3. What is going to be covered: ● What I think I might know about you ● A brief look at Spark’s components ● A quick glance at PySpark’s architecture & its performance implications ● When Spark SQL can be amazing and wonderful ● How to mix JVM & Python code for Spark awesomeness ● Working with key/value data ○ Why group key is evil and what we can do about it
  • 4. Who I think you wonderful humans are? ● Nice* people ● Know Python ● Don’t mind pictures of cats ● Know some Apache Spark ○ Don’t know Spark? No worries ○ This isn’t an intro talk but my last PyData talk was one and its on youtube with some of my other talks at http://bit.ly/holdenSparkVideos ○ Will still cover enough for everyone to follow along ● Want to make more performant PySpark Jobs ● Don’t overly mind a grab-bag of topics Lori Erickson
  • 5. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 6. What is Spark? ● General purpose distributed system ○ With a really nice API ○ And integrated libraries for many common tasks ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce
  • 7. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 8. A detour into PySpark’s internals Photo by Bill Ward
  • 9. Spark in Scala, how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this
  • 10. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 11. So how does that impact PySpark? ● Data from Spark worker serialized and piped to Python worker ○ Multiple iterator-to-iterator transformations are still pipelined :) ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● etc.
  • 12. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 13. Word count w/RDDs lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD These are still pipelined inside of the same python executor
  • 14. Our saviour from serialization: DataFrames ● For the most part keeps data in the JVM ○ Notable exception is UDFs written in Python ● Takes our python calls and turns it into a query plan ● If we need more than the native operations in Spark’s DataFrames ● be wary of Distributed Systems bringing claims of usability….
  • 15. So what are Spark DataFrames? ● More than SQL tables ● Not Pandas or R DataFrames ● Semi-structured (have schema information) ● tabular ● work on expression instead of lambdas ○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x. happy == true))
  • 16. Just how fast are DataFrames? (scala)
  • 17. Well let's try word count on DataFrames ● We need a way to split the words ● Could convert to an RDD ● Or extend with a UDF ● Or if we were in 2.0 (or Scala) we could use Datasets
  • 18. DataFrames to RDDs and back ● map lets us work per-row df.map(lambda row: row.text) ● Converting back ○ infer_schema ○ specify the schema
  • 19. Word count w/Dataframes df = sqlCtx.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 20. Or we can make a UDF def function(x): # Some magic sqlContext.registerFunction(“name”, function, IntegerType())
  • 21. Mixing Python & JVM code FTW: ● DataFrames are an example of pushing our processing to the JVM ● Python UDFS & maps lose this benefit ● But we can write Scala UDFS and call them from Python ○ py4j error messages can be difficult to understand :( ● Trickier with RDDs since stores pickled objects
  • 22. Exposing functions to be callable from Python: // functions we want to be callable from python object functions { def kurtosis(e: Column): Column = new Column (Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) } }
  • 23. Calling the functions with py4j*: ● The SparkContext has a reference to the jvm (_jvm) ● Many Python objects which are wrappers of JVM objects have _j[objtype] to get the JVM object ○ rdd._jrdd ○ df._jdf ○ sc._jsc ● These are all private and may change *The py4j bridge only exists on the driver** ** Not exactly true but close enough
  • 24. e.g.: def register_sql_extensions(sql_ctx): scala_sql_context = sql_ctx._ssql_ctx spark_ctx = sql_ctx._sc (spark_ctx._jvm.com.sparklingpandas.functions .registerUdfs(scala_sql_context))
  • 25. More things to keep in mind with DFs (in Python) ● Schema serialized as json from JVM ● toPandas is essentially collect ● joins can result in the cross product ○ big data x big data =~ out of memory ● Use the HiveContext ○ you don’t need a hive install ○ more powerful UDFs, window functions, etc.
  • 26. DataFrames aren’t quite as lazy... ● Keep track of schema information ● Loading JSON data involves looking at the data ● Before if we tried to load non-existent data wouldn’t fail right away, now fails right away
  • 27. Why is groupByKey so slow anyways? ● Well the answer is it sort of depends on what our distribution looks like ● If we’ve got highly skewed data this can cause all sorts of problems, not just with groupByKey ● groupByKey is just the canary in the coal mine LadyJillybea n
  • 28. Considerations for Key/Value Data ● What does the distribution of keys look like? ● What type of aggregations do we need to do? ● Do we want our data in any particular order? ● Are we joining with another RDD? ● Whats our partitioner? ○ If we don’t have an explicit one: what is the partition structure? eleda 1
  • 29. What is key skew and why do we care? ● Keys aren’t evenly distributed ○ Sales by postal code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 30. groupByKey - just how evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  • 31. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
  • 32. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum)
  • 33. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 34. Let’s see what it looks like when we run the two Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  • 37. So what did we do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 38. So why did we read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  • 39. Can just the shuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 40. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R)
  • 41. Everyone* needs reduce, let’s make it faster! ● reduce & aggregate have “tree” versions ● we already had free map-side reduction ● but now we can get even better!** **And we might be able to make even cooler versions
  • 42. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_AND_DISK, MEMORY_AND_DISK, OFF_HEAP ● There is also _SER versions in Scala, but Python RDDs are already pickled ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help Richard Gillin
  • 43. Using other libraries ● built ins ○ just import!* ■ Except for Hive, compile with -PHive & then import ● spark-packages ○ --packages ● generic python ○ pre-install on workers (pssh, puppet, etc.) ○ add it with --zip-files ○ sc.addPyFile ○ CDH + Continuum + Anaconda =~ sanity
  • 44. And the next book….. Still being written - signup to be notified when it is available: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark ● Examples are Scala centric ○ Sorry! ○ but will try and port the examples to Python in repo once finished w/Scala & Java ○ does cover interaction with other languages
  • 45. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  • 46. Office hours - I’ll answer your questions :)* ● IRL this Monday ○ Booking.com offices @ 6pm - http://www.meetup.com/Amsterdam- Spark/events/228667345/ ● On-line in the future ○ Follow me on twitter or linkedin and I’ll post when it’s going on *If I can - if I don’t know I’ll try and figure it out but no guarantees
  • 47. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit. ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Cat wave photo by Quinn Dombrowski
  • 48. Preview: bringing codegen to Spark ML ● Based on Spark SQL’s code generation ○ First draft using quasiquotes ○ Switch to janino for Java compilation ● Initial draft for Gradient Boosted Trees ○ Based on DB’s work ○ First draft with QuasiQuotes ■ Moved to Java for speed ○ See SPARK-10387 for the details Jon
  • 49. @Override public double call(Vector input) throws Exception { if (input.apply(1) <= 1.0) { return 0.1; } else { if (input.apply(0) <= 0.5) { return 0.0; } else { return 2.0; } } } (1, 1.0) 0.1 (0, 0.5) 0.0 2.0 What the generated code looks like: Glenn Simmons