This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
2. Why we compare them?
Both frameworks came as MapReduce
replacement
Both essentially provide DAG of computations
Both are YARN applications.
Both reduce latency of MR
Both promise to improve SQL capabilities
3. Our plan for today
To understand what is Tez
To recall what is spark
To understand what is in common and what
differentiate them.
To try identifying when each one of them is
more applicable
4. MapReduce extension
While MapReduce can solve virtually any data
transformation problems, not all of them are
done efficiently.
One of the main drawbacks of the current
MapReduce implementation is latency,
especially in job cascades.
5. MapReduce latency causes
1. Obtain and initialize containers
2. Poll oriented scheduling
3. In series of jobs - persistence of intermediate
results
a. Serialization and Deserialization costs
b. IO Costs
c. HDFS costs
6. Common Solutions to latency
problems in Spark and Tez
Container start overhead - container reuse
Polling style scheduling - event driven control
Building DAG of computations to eliminate
need of fixing intermediate results.
7. Tez
Implementation language - Java
Client language - Java
Main abstraction - DAG of computations
In best of my understanding - improvement of
MR as much as possible.
9. Vertex
Vertex is collection of tasks, running in cluster
Task consists from inputs, outputs and
processors.
Inputs can be from other vertices or from HDFS
Outputs can be sorted or not, and go to HDFS
or other Vertices
11. Edge Data sources
● Persisted: Output will be available after the task exits. Output may be lost later on.
● Persisted-Reliable: Output is reliably stored and will always be available
● Ephemeral: Output is available only while the producer task is running
Persistent - after the task life. Local FS
Persistent - Reliable. HDFS
Ephemeral - in memory
15. Tez Vs MapReduce
MapReduce can be expressed in Tez efficiently
It can be stated that Tez is somewhat lower
level than MapReduce
16. Tez session
Tez session allow us to reuse tez application
master for different DAG.
Tez AM capable of caching containers.
IMO it contradict YARN in some extent.
Tez sessions are similar as concept to Spark
context
17. Tez - summary
Tez enable us explicitly define DAG of
computations, and tune its execution.
Tez tightly integrated with YARN.
MR can be efficiently expressed in terms of Tez
Tez programming is more complicated than
MR.
20. Spark - word of thanks
I want to mention help of Raynold Xin from
DataBricks (http://www.cs.berkeley.edu/~rxin/)
who helped me to verify findings of this
presentation.
Spark today is most popular apache project
with more then 400 contributors.
21. Spark
Spark is a framework which enables us
manipulation of distributed collections, called
RDD.
RDD is Resilient distributed datasets.
We also can view these manipulations as DAG
of computations
22. RDD storage options
RDD can live in cluster in 3 forms.
- As native scala objects. Fastest, more RAM
- As serialized blocks. Slower, less RAM
- As persisted blocks. Slowest, but minimal
RAM.
24. Spark - usability
While in MR (or in Tez) Simple WordCount is
pages of code, in Spark it is a few lines
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
25. Implicit DAG definition
When we define Map in Spark - we define one-
to-one, or “non-shuffle” dependency.
When we do join or group by - we define
“shuffle” dependency.
26. Explicit DAG definition
While it is not common, Spark does enable
explicit DAG definition.
Spark SQL is using this for performance
reasons.
28. Spark serialization
Spark is using pluggable serialization.
You can write your own or re-use existing
serialization frameworks.
Java serialization is default and works
transparently.
Kryo fastest in best of my knowledge.
29. Spark deployment
Spark can be deployed standalone as well as in
form of YARN application.
It means that Spark can be used without
Hadoop.
31. Storage model
Tez is working with HDFS data. Tez job
transforms data from HDFS to HDFS.
Spark has notion of RDD, which can live in
memory or on HDFS.
RDD can be in form of native Scala objects,
something Tez can not offer.
34. Job definition level
Tez is low level - we explicitly define vertices
and edge
Spark is “high level” oriented, while low level
API exists.
35. Target audience
Tez is built ground up to be underlying
execution engine for high level languages, like
Hive and Pig
Spark is built to be very usable as is. In the
same time there are a few frameworks built on
top of it - Spark SQL, MLib, GraphX.
36. YARN integration
Tez is ground up Yarn application
Spark is “moving” toward YARN.
Spark recently added “dynamic” executors
execution in YARN.
In near future it should be similar, for now Tez
has some edge.
37. Note on similarity
1. There is initiative to run Hive on Spark
https://cwiki.apache.org/confluence/display/Hiv
e/Hive+on+Spark
2. There is intiative to reuse MR shuffling for
Spark:
http://hortonworks.com/blog/improving-spark-
data-pipelines-native-yarn-integration/
38. Applicability : Spark vs Tez
Interactive work with data, ad-hoc analysis :
Spark is much easier.
39. Data >> RAM
Processing huge data volumes, much bigger
than cluster RAM : Tez might be better, since it
is more “stream oriented”, has more mature
shuffling implementation, closer Yarn
integration.
40. Data << RAM
Since Spark can cache in memory parsed data
- it can be much better when we process data
smaller than cluster’s memory.
41. Building own DSL
For Tez low level interface is “main” so building
your own framework or language on top of Tez
can be simpler than for Spark.
47. Performance
First read : select count(*) from … where …
20 minutes.
Subsequent reads:
where on numeric column : 1 minute.
“grep” on string : 10 minutes.
48. Cost
Scan of about 5 TB of strings cost us $1.16
Cost per TB is about $0.24 per TB.
Just to compare cost of processing of 1 TB of
data in BigQuery is $5 - 40 times more
49. POC
If you have data in S3 you want to query -
we can do POC together.