SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Spark vs Tez
By David Gruzman, BigDataCraft.com
Why we compare them?
Both frameworks came as MapReduce
replacement
Both essentially provide DAG of computations
Both are YARN applications.
Both reduce latency of MR
Both promise to improve SQL capabilities
Our plan for today
To understand what is Tez
To recall what is spark
To understand what is in common and what
differentiate them.
To try identifying when each one of them is
more applicable
MapReduce extension
While MapReduce can solve virtually any data
transformation problems, not all of them are
done efficiently.
One of the main drawbacks of the current
MapReduce implementation is latency,
especially in job cascades.
MapReduce latency causes
1. Obtain and initialize containers
2. Poll oriented scheduling
3. In series of jobs - persistence of intermediate
results
a. Serialization and Deserialization costs
b. IO Costs
c. HDFS costs
Common Solutions to latency
problems in Spark and Tez
Container start overhead - container reuse
Polling style scheduling - event driven control
Building DAG of computations to eliminate
need of fixing intermediate results.
Tez
Implementation language - Java
Client language - Java
Main abstraction - DAG of computations
In best of my understanding - improvement of
MR as much as possible.
DAG - Vertexes and Edges
Vertex
Vertex is collection of tasks, running in cluster
Task consists from inputs, outputs and
processors.
Inputs can be from other vertices or from HDFS
Outputs can be sorted or not, and go to HDFS
or other Vertices
Tez edge types
One-to-one
Broadcast
Shuffle
Edge Data sources
● Persisted: Output will be available after the task exits. Output may be lost later on.
● Persisted-Reliable: Output is reliably stored and will always be available
● Ephemeral: Output is available only while the producer task is running
Persistent - after the task life. Local FS
Persistent - Reliable. HDFS
Ephemeral - in memory
Tez edge scheduling
Sequential - next task run after current task is
finished
Concurrent - next task can be run
Vertex Management
Need for dynamic parallelism
Tez Vs MapReduce
MapReduce can be expressed in Tez efficiently
It can be stated that Tez is somewhat lower
level than MapReduce
Tez session
Tez session allow us to reuse tez application
master for different DAG.
Tez AM capable of caching containers.
IMO it contradict YARN in some extent.
Tez sessions are similar as concept to Spark
context
Tez - summary
Tez enable us explicitly define DAG of
computations, and tune its execution.
Tez tightly integrated with YARN.
MR can be efficiently expressed in terms of Tez
Tez programming is more complicated than
MR.
Tez performance vs MR
Tez performance vs MR
Spark - word of thanks
I want to mention help of Raynold Xin from
DataBricks (http://www.cs.berkeley.edu/~rxin/)
who helped me to verify findings of this
presentation.
Spark today is most popular apache project
with more then 400 contributors.
Spark
Spark is a framework which enables us
manipulation of distributed collections, called
RDD.
RDD is Resilient distributed datasets.
We also can view these manipulations as DAG
of computations
RDD storage options
RDD can live in cluster in 3 forms.
- As native scala objects. Fastest, more RAM
- As serialized blocks. Slower, less RAM
- As persisted blocks. Slowest, but minimal
RAM.
DAG in Spark
Spark - usability
While in MR (or in Tez) Simple WordCount is
pages of code, in Spark it is a few lines
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Implicit DAG definition
When we define Map in Spark - we define one-
to-one, or “non-shuffle” dependency.
When we do join or group by - we define
“shuffle” dependency.
Explicit DAG definition
While it is not common, Spark does enable
explicit DAG definition.
Spark SQL is using this for performance
reasons.
Spark architecture
Spark serialization
Spark is using pluggable serialization.
You can write your own or re-use existing
serialization frameworks.
Java serialization is default and works
transparently.
Kryo fastest in best of my knowledge.
Spark deployment
Spark can be deployed standalone as well as in
form of YARN application.
It means that Spark can be used without
Hadoop.
Spark usage
Spark
Spark
SQL
MLib GraphX Applications/Shell
Storage model
Tez is working with HDFS data. Tez job
transforms data from HDFS to HDFS.
Spark has notion of RDD, which can live in
memory or on HDFS.
RDD can be in form of native Scala objects,
something Tez can not offer.
Tez processing model
Persistent dataset Persistent dataset
Tez job
Spark processing model
Persistent
dataset
In Memory
dataset
Persistent
dataset
In Memory
dataset
Job definition level
Tez is low level - we explicitly define vertices
and edge
Spark is “high level” oriented, while low level
API exists.
Target audience
Tez is built ground up to be underlying
execution engine for high level languages, like
Hive and Pig
Spark is built to be very usable as is. In the
same time there are a few frameworks built on
top of it - Spark SQL, MLib, GraphX.
YARN integration
Tez is ground up Yarn application
Spark is “moving” toward YARN.
Spark recently added “dynamic” executors
execution in YARN.
In near future it should be similar, for now Tez
has some edge.
Note on similarity
1. There is initiative to run Hive on Spark
https://cwiki.apache.org/confluence/display/Hiv
e/Hive+on+Spark
2. There is intiative to reuse MR shuffling for
Spark:
http://hortonworks.com/blog/improving-spark-
data-pipelines-native-yarn-integration/
Applicability : Spark vs Tez
Interactive work with data, ad-hoc analysis :
Spark is much easier.
Data >> RAM
Processing huge data volumes, much bigger
than cluster RAM : Tez might be better, since it
is more “stream oriented”, has more mature
shuffling implementation, closer Yarn
integration.
Data << RAM
Since Spark can cache in memory parsed data
- it can be much better when we process data
smaller than cluster’s memory.
Building own DSL
For Tez low level interface is “main” so building
your own framework or language on top of Tez
can be simpler than for Spark.
Links
http://www.slideshare.net/ydn/hive-hug
http://ampcamp.berkeley.edu/wp-
content/uploads/2012/06/josh-rosen-amp-
camp-2012-spark-python-api-final.pdf
http://www.quora.com/When-would-someone-
use-Apache-Tez-instead-of-Apache-Spark-or-
vice-versa
https://yhemanth.wordpress.com/2013/11/07/co
Update on ImpalaToGo
ImpalaToGo is “light” version of
ClouderaImpala optimized to work with S3
Architecture
S3
Cache layer on local SSD drives
ImpalaToGo Cluster
Data
Table with 28 billion records, one string, and a
few numbers.
Size : 6 TB CSV. Stored as 1 TB of Parquet
with Snappy compression.
Hardware
14 Amazon m3.2xlarge instances.
30 GB RAM, 8 Cores, 2 * 80 GB SSD.
Cost of this HW - about $7 an hour.
Performance
First read : select count(*) from … where …
20 minutes.
Subsequent reads:
where on numeric column : 1 minute.
“grep” on string : 10 minutes.
Cost
Scan of about 5 TB of strings cost us $1.16
Cost per TB is about $0.24 per TB.
Just to compare cost of processing of 1 TB of
data in BigQuery is $5 - 40 times more
POC
If you have data in S3 you want to query -
we can do POC together.

Weitere ähnliche Inhalte

Was ist angesagt?

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Was ist angesagt? (20)

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
10c introduction
10c introduction10c introduction
10c introduction
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 

Ähnlich wie Spark vstez

Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 

Ähnlich wie Spark vstez (20)

Spark
SparkSpark
Spark
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Module01
 Module01 Module01
Module01
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Spark rdd
Spark rddSpark rdd
Spark rdd
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
 

Mehr von David Groozman (7)

Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.
 
ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integration
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

Spark vstez

  • 1. Spark vs Tez By David Gruzman, BigDataCraft.com
  • 2. Why we compare them? Both frameworks came as MapReduce replacement Both essentially provide DAG of computations Both are YARN applications. Both reduce latency of MR Both promise to improve SQL capabilities
  • 3. Our plan for today To understand what is Tez To recall what is spark To understand what is in common and what differentiate them. To try identifying when each one of them is more applicable
  • 4. MapReduce extension While MapReduce can solve virtually any data transformation problems, not all of them are done efficiently. One of the main drawbacks of the current MapReduce implementation is latency, especially in job cascades.
  • 5. MapReduce latency causes 1. Obtain and initialize containers 2. Poll oriented scheduling 3. In series of jobs - persistence of intermediate results a. Serialization and Deserialization costs b. IO Costs c. HDFS costs
  • 6. Common Solutions to latency problems in Spark and Tez Container start overhead - container reuse Polling style scheduling - event driven control Building DAG of computations to eliminate need of fixing intermediate results.
  • 7. Tez Implementation language - Java Client language - Java Main abstraction - DAG of computations In best of my understanding - improvement of MR as much as possible.
  • 8. DAG - Vertexes and Edges
  • 9. Vertex Vertex is collection of tasks, running in cluster Task consists from inputs, outputs and processors. Inputs can be from other vertices or from HDFS Outputs can be sorted or not, and go to HDFS or other Vertices
  • 11. Edge Data sources ● Persisted: Output will be available after the task exits. Output may be lost later on. ● Persisted-Reliable: Output is reliably stored and will always be available ● Ephemeral: Output is available only while the producer task is running Persistent - after the task life. Local FS Persistent - Reliable. HDFS Ephemeral - in memory
  • 12. Tez edge scheduling Sequential - next task run after current task is finished Concurrent - next task can be run
  • 14. Need for dynamic parallelism
  • 15. Tez Vs MapReduce MapReduce can be expressed in Tez efficiently It can be stated that Tez is somewhat lower level than MapReduce
  • 16. Tez session Tez session allow us to reuse tez application master for different DAG. Tez AM capable of caching containers. IMO it contradict YARN in some extent. Tez sessions are similar as concept to Spark context
  • 17. Tez - summary Tez enable us explicitly define DAG of computations, and tune its execution. Tez tightly integrated with YARN. MR can be efficiently expressed in terms of Tez Tez programming is more complicated than MR.
  • 20. Spark - word of thanks I want to mention help of Raynold Xin from DataBricks (http://www.cs.berkeley.edu/~rxin/) who helped me to verify findings of this presentation. Spark today is most popular apache project with more then 400 contributors.
  • 21. Spark Spark is a framework which enables us manipulation of distributed collections, called RDD. RDD is Resilient distributed datasets. We also can view these manipulations as DAG of computations
  • 22. RDD storage options RDD can live in cluster in 3 forms. - As native scala objects. Fastest, more RAM - As serialized blocks. Slower, less RAM - As persisted blocks. Slowest, but minimal RAM.
  • 24. Spark - usability While in MR (or in Tez) Simple WordCount is pages of code, in Spark it is a few lines val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 25. Implicit DAG definition When we define Map in Spark - we define one- to-one, or “non-shuffle” dependency. When we do join or group by - we define “shuffle” dependency.
  • 26. Explicit DAG definition While it is not common, Spark does enable explicit DAG definition. Spark SQL is using this for performance reasons.
  • 28. Spark serialization Spark is using pluggable serialization. You can write your own or re-use existing serialization frameworks. Java serialization is default and works transparently. Kryo fastest in best of my knowledge.
  • 29. Spark deployment Spark can be deployed standalone as well as in form of YARN application. It means that Spark can be used without Hadoop.
  • 31. Storage model Tez is working with HDFS data. Tez job transforms data from HDFS to HDFS. Spark has notion of RDD, which can live in memory or on HDFS. RDD can be in form of native Scala objects, something Tez can not offer.
  • 32. Tez processing model Persistent dataset Persistent dataset Tez job
  • 33. Spark processing model Persistent dataset In Memory dataset Persistent dataset In Memory dataset
  • 34. Job definition level Tez is low level - we explicitly define vertices and edge Spark is “high level” oriented, while low level API exists.
  • 35. Target audience Tez is built ground up to be underlying execution engine for high level languages, like Hive and Pig Spark is built to be very usable as is. In the same time there are a few frameworks built on top of it - Spark SQL, MLib, GraphX.
  • 36. YARN integration Tez is ground up Yarn application Spark is “moving” toward YARN. Spark recently added “dynamic” executors execution in YARN. In near future it should be similar, for now Tez has some edge.
  • 37. Note on similarity 1. There is initiative to run Hive on Spark https://cwiki.apache.org/confluence/display/Hiv e/Hive+on+Spark 2. There is intiative to reuse MR shuffling for Spark: http://hortonworks.com/blog/improving-spark- data-pipelines-native-yarn-integration/
  • 38. Applicability : Spark vs Tez Interactive work with data, ad-hoc analysis : Spark is much easier.
  • 39. Data >> RAM Processing huge data volumes, much bigger than cluster RAM : Tez might be better, since it is more “stream oriented”, has more mature shuffling implementation, closer Yarn integration.
  • 40. Data << RAM Since Spark can cache in memory parsed data - it can be much better when we process data smaller than cluster’s memory.
  • 41. Building own DSL For Tez low level interface is “main” so building your own framework or language on top of Tez can be simpler than for Spark.
  • 43. Update on ImpalaToGo ImpalaToGo is “light” version of ClouderaImpala optimized to work with S3
  • 44. Architecture S3 Cache layer on local SSD drives ImpalaToGo Cluster
  • 45. Data Table with 28 billion records, one string, and a few numbers. Size : 6 TB CSV. Stored as 1 TB of Parquet with Snappy compression.
  • 46. Hardware 14 Amazon m3.2xlarge instances. 30 GB RAM, 8 Cores, 2 * 80 GB SSD. Cost of this HW - about $7 an hour.
  • 47. Performance First read : select count(*) from … where … 20 minutes. Subsequent reads: where on numeric column : 1 minute. “grep” on string : 10 minutes.
  • 48. Cost Scan of about 5 TB of strings cost us $1.16 Cost per TB is about $0.24 per TB. Just to compare cost of processing of 1 TB of data in BigQuery is $5 - 40 times more
  • 49. POC If you have data in S3 you want to query - we can do POC together.