In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
SpotFlow: Tracking Method Calls and States at Runtime
Strata NYC 2015 - What's coming for the Spark community
1. What’s New in the Spark
Community
Patrick Wendell | @pwendell
2. About Me
Co-Founder of Databricks
Founding committer of Apache Spark at U.C. Berkeley
Today, manage Spark effort @ Databricks
3. About Databricks
Team donated Spark to ASF in 2013;
primary maintainers of Spark today
Hosted analytics stack based on
Apache Spark
Managed clusters, notebooks,
collaboration, and third party apps:
5. What is your familiarity with Spark?
1. Not very familiar with Spark – only very high level.
2. Understand the components/uses well, but I’ve never written code.
3. I’ve written Spark code on POC or production use case of Spark.
6. “Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
7. …
Apache Spark Engine
Spark Core
Streaming
SQL and
Dataframe
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
8. This Talk
“What’s new” in Spark? And what’s coming?
Two parts: Technical roadmap and community developments
“The future is already here — it's just not very evenly distributed.”
- William Gibson
10. Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
11. Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
13. Computing an Average: MapReduce vs Spark
private
IntWritable
one
=
new
IntWritable(1)
private
IntWritable
output
=
new
IntWritable()
proctected
void
map(
LongWritable
key,
Text
value,
Context
context)
{
String[]
fields
=
value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,
output)
}
IntWritable
one
=
new
IntWritable(1)
DoubleWritable
average
=
new
DoubleWritable()
protected
void
reduce(
IntWritable
key,
Iterable<IntWritable>
values,
Context
context)
{
int
sum
=
0
int
count
=
0
for(IntWritable
value
:
values)
{
sum
+=
value.get()
count++
}
average.set(sum
/
(double)
count)
context.Write(key,
average)
}
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[x.[1],
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
13
14. Computing an Average with Spark
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[x.[1],
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
14
15. Computing an Average with DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name",
avg("age"))
.collect()
15
16. Spark DataFrame API
Explicit data model and schema
Selecting columns and filtering
Aggregation (count, sum, average, etc)
User defined functions
Joining different data sources
Statistical functions and easy plotting
Python, Scala, Java, and R
16
sqlCtx.table("people")
.groupBy("name")
.agg("name",
avg("age"))
.collect()
17. Ask more of your framework!
MapReduce Spark Spark + DataFrames
Fault tolerance Fault tolerance Fault tolerance
Data distribution Data distribution Data distribution
Set operators Set operators
Operator DAG Operator DAG
Caching Caching
Schema management
Relational semantics
Logical plan optimization
Storage push down and opt.
Analytic operations
…
18. Other high level API’s
ML Pipelines
SparkR
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
>
faithful
<-‐
read.df("faithful.json",
"json”)
>
head(filter(faithful,
faithful
$waiting
<
50))
##
eruptions
waiting
##1
1.750
47
##2
1.750
47
##3
1.867
48
19. Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
21. Project Tungsten: The CPU Squeeze
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
22. Project Tungsten
Code generation for CPU efficiency
Code generation on by default and using Janino [SPARK-7956]
Beef up built-in UDF library (added ~100 UDF’s with code gen)
AddMonths
ArrayContains
Ascii
Base64
Bin
BinaryMathExpressi
on
CheckOverflow
CombineSets
Contains
CountSet
Crc32
DateAdd
DateDiff
DateFormatClass
DateSub
DayOfMonth
DayOfYear
Decode
Encode
EndsWith
Explode
Factorial
FindInSet
FormatNumber
FromUTCTimestamp
FromUnixTime
GetArrayItem
GetJsonObject
GetMapValue
Hex
InSet
InitCap
IsNaN
IsNotNull
IsNull
LastDay
Length
Levenshtein
Like
Lower
MakeDecimal
Md5
Month
MonthsBetween
NaNvl
NextDay
Not
PromotePrecision
Quarter
RLike
Round
Second
Sha1
Sha2
ShiYLeY
ShiYRight
ShiYRightUnsigne
d
SortArray
SoundEx
StartsWith
StringInstr
StringRepeat
StringReverse
StringSpace
StringSplit
StringTrim
StringTrimLeY
StringTrimRight
TimeAdd
TimeSub
ToDate
ToUTCTimestamp
TruncDate
UnBase64
UnaryMathExpressi
on
Unhex
UnixTimestamp
23. Project Tungsten
Binary processing for memory management (all data types):
External sorting with managed memory
External hashing with managed memory
Memory
page
hc
ptr
…
key
value
key
value
key
value
key
value
key
value
key
value
Managed Memory HashMap in Tungsten
24. Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Where are we going?
Tungsten
backend
language
frontend
…
26. Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
27. Pluggability: Rich IO Support
df
=
sqlContext.read
.format("json")
.option("samplingRatio",
"0.1")
.load("/home/michael/data.json”)
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
Unified interface to reading/writing data in a variety of formats
28. Large Number of IO Integration
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
28
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
30. Technical Directions
Early on, the focus was:
Can Spark be an engine that is faster and easier to use than Hadoop
MapReduce?
Today the question is:
Can Spark & its ecosystem make big data as easy as little data?
32. Who is the “Spark Community”?
thousands of users
… hundreds of developers
… dozens of distributors
33. Getting a better vantage point
Databricks survey - feedback from more than 1,400 users
34. Community trends: Library & package ecosystem
Strata NY 2014: Widespread use of core RDD API
Today: Most use built-in and community libraries
51% of users use 3 or more libraries
35. Spark Packages
Strata NY 2014: Didn’t exist
Today: > 100 community packages
> ./bin/spark-shell --packages databricks/spark-avro:0.2
36. Spark Packages
API Extensions
Clojure API
Spark Kernel
Zepplin Notebook
Indexed RDD
Deployment Utilities
Google Compute
Microsoft Azure
Spark Jobserver
Data Sources
Redshift
Avro
CSV
Elastic Search
MongoDB
37. Increasing storage options
Strata NY 2014: IO primarily through Hadoop InputFormat API
January 2015: Spark adds native storage API
Today: Well over 20 natively integrated storage bindings
Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,
Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…
38. Deployment environments
Strata NY 2014: Traction in the Hadoop community
Today: Growth beyond Hadoop… increasingly public cloud
51% of respondents run Spark in public cloud
39. Wrapping it up
Spark has grown and developed quickly in the last year!
Looking forward expect:
- Engineering effort on higher level API’s and performance
- A broader surrounding ecosystem
- The unexpected
40. Where to learn more about Spark?
SparkHub community portal
Spark Summit conference - https://spark-summit.org/
Massive online course (edX):
Databricks Spark training Books: