2. About Databricks
Founded by creatorsof Spark in 2013 and remains the top
contributor
End-to-end service for Spark on EC2
• Interactive notebooks,dashboards,
and production jobs
3. Our Goal for Spark
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …
4. Past 2 Years
Fast growth in libraries and
integration points
• New library for SQL + DataFrames
• 10xgrowth of ML library
• Pluggable data source API
• R language
Result: very diverse use of Spark
• Only 40% of userson Hadoop YARN
• Most users use at least 2 of Spark’s
built-in libraries
• 98%of Databricks customers use
SQL, 60% use Python
5. Beyond Libraries
Best thing about basing Spark’s libraries on a high-level API is
that we can also make big changesunderneaththem
Now working on some of the largestchangesto Spark Core
since the projectbegan
11. Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Off-heap memory management
(2) Runtime code generation
(3) Cache-awarealgorithms
13. DataFrame API
Single-node tabularstructure in R and Python,with APIs for:
relational algebra (filter, join,…)
math and stats
input/output(CSV, JSON, …)
Google Trends for “data frame”
14. DataFrame: lingua franca for “small data”
head(flights)
#> Source: local data frame [6 x 16]
#>
#> year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1 2013 1 1 517 2 830 11 UA N14228
#> 2 2013 1 1 533 4 850 20 UA N24211
#> 3 2013 1 1 542 2 923 33 AA N619AA
#> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB
#> .. ... ... ... ... ... ... ... ... ...
15. 15
Spark DataFrames
Structureddata collections
with similar API to R/Python
• DataFrame = RDD + schema
Capture many operations as
expressionsin a DSL
• Enablesrich optimizations
df = jsonFile(“tweets.json”)
df(df(“user”) === “matei”)
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python RDD Scala RDD DataFrame
RunningTime
17. 1. Off-Heap Memory Management
Store data outside JVM heap to avoid object overhead & GC
• For RDDs: fast serialization libraries
• For DataFrames & SQL: binary format we compute on directly
2-10x space saving, especiallyfor strings, nested objects
Can use new RAM-like devices, e.g. flash, 3D XPoint
18. 2. Runtime Code Generation
GenerateJava code for DataFrame and
SQL expressionsrequestedby user
Avoids virtual calls and generics/boxing
Can do same in core, ML and graph
• Code-gen serializers,fused functions,
math expressions
9.3
9.4
36.7
Hand
writtenCodegen
Interpreted
Projection
Evaluating“SELECTa+a+a”
(timein seconds)
19. 3. Cache-Aware Algorithms
Use custom memory layout to better leverageCPU cache
Example: AlphaSort-style prefix sort
• Store prefixes of sort key inside pointerarray
• Compare prefixes to avoid full record fetches+ comparisons
pointer record
key prefix pointer record
Naïve layout
Cache friendly layout
22. Motivation
Network and storage speedshave improved 10x, but this
speed isn’t always easyto leverage!
Many challengeswith:
• Keeping diskoperationslarge (even on SSDs)
• Keeping networkconnectionsbusy & balanced across cluster
• Doing all this on many cores and many disks
23. Sort Benchmark
Started by Jim Grayin 1987 to measure HW+SW advances
• Many entrantsuse purpose-builthardware & software
Participated in largestcategory: Daytona GraySort
• Sort 100 TB of 100-byte recordsin a fault-tolerant manner
Seta new world record (tied with UCSD)
• Saturated 8 SSDs and 10 Gbps network/ node
• 1st time public cloud + open source won
24. On-Disk Sort Record
Time to sort 100 TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1 PB in 4 hours
27. Motivation
Queryplanning is crucial to performancein distributed setting
• Level of parallelismin operations
• Choice of algorithm(e.g. broadcast vs. shuffle join)
Hard to do well for big data even with cost-based optimization
• Unindexed data => don’t have statistics
• User-defined functions=> hard to predict
Solution: letSpark changequery plan adaptively
38. Advanced Example: Join
Hybrid join
(broadcast popular
key, shuffle rest)
Goal: Bringtogetherdata items with the same key
39. Advanced Example: Join
Hybrid join
(broadcast popular
key, shuffle rest)
Goal: Bringtogetherdata items with the same key
More details: SPARK-9850
40. Impact of Adaptive Planning
Level of parallelism: 2-3x
Choice of join algorithm: as much as 10x
Follow it at SPARK-9850
41. Effect of Optimizations in Core
Often, when we made one optimization, we saw all of the
Spark components get faster
• Scheduleroptimization for Spark Streaming => SQL 2xfaster
• Network optimizations=> speed up all comm-intensive libraries
• Tungsten => DataFrames, SQL and parts of ML
Same applies to other changesin core, e.g. debug tools
42. Conclusion
Spark has grown a lot, but it still remains the most active open
sourceproject in big data
Small core + high-level API => can make changesquickly
New hardware => exciting optimizations at all levels